Todo
write an intro for all OS?
Since Windows 2000, Windows offers a nice Unicode API and supports non-BMP characters. It uses Unicode strings implemented as wchar_t* strings (LPWSTR). wchar_t is 16 bits long on Windows and so it uses UTF-16: non-BMP characters are stored as two wchar_t (a surrogate pair), and the length of a string is the number of UTF-16 units and not the number of characters.
Windows 95, 98 an Me had also Unicode strings, but were limited to BMP characters: they used UCS-2 instead of UTF-16.
Todo
And Windows CE?
A Windows application has two encodings, called code pages (abbreviated “cp”): ANSI and OEM code pages. The ANSI code page, CP_ACP, is used for the ANSI version of the Windows API to decode byte strings to character strings and has a number between 874 and 1258. The OEM code page or “IBM PC” code page, CP_OEMCP, comes from MS-DOS, is used for the Windows console, contains glyphs to create text interfaces (draw boxes) and has a number between 437 and 874. Example of a French setup: ANSI is cp1252 and OEM is cp850.
There are code page constants:
Functions.
Get the ANSI code page number.
Get the OEM code page number.
Set the locale. It can be used to change the ANSI code page of current thread (CP_THREAD_ACP).
See also
Wikipedia article: Windows code page.
Encode and decode functions of <windows.h>.
Decode a byte string from a code page to a character string. Use MB_ERR_INVALID_CHARS flag to return an error on an undecodable byte sequence.
The default behaviour (flags=0) depends on the Windows version:
- Windows Vista and later: replace undecodable bytes
- Windows 2000, XP and 2003: ignore undecodable bytes
In strict mode (MB_ERR_INVALID_CHARS), the UTF-8 decoder (CP_UTF8) returns an error on surrogate characters in Windows Vista and later. On Windows XP, the UTF-8 decoder is not strict: surrogates can be decoded in any mode.
The UTF-7 decoder (CP_UTF7) only supports flags=0.
Examples on any version Windows version:
| Flags | default (0) | MB_ERR_INVALID_CHARS |
|---|---|---|
| 0xFF, cp932 | {U+F8F3} | error |
| 0xE9 0x80, cp1252 | {U+00E9, U+20AC} | {U+00E9, U+20AC} |
| 0xFF, CP_UTF7 | {U+FF} | invalid flags |
| 0xC3 0xA9, CP_UTF8 | {U+00E9} | {U+00E9} |
Examples on Windows Vista and later:
| Flags | default (0) | MB_ERR_INVALID_CHARS |
|---|---|---|
| 0x81 0x00, cp932 | {U+30FB, U+0000} | error |
| 0xFF, CP_UTF8 | {U+FFFD} | error |
| 0xED 0xB2 0x80, CP_UTF8 | {U+FFFD, U+FFFD, U+FFFD} | error |
Examples on Windows 2000, XP, 2003:
| Flags | default (0) | MB_ERR_INVALID_CHARS |
|---|---|---|
| 0x81 0x00, cp932 | {U+0000} | error |
| 0xFF, CP_UTF8 | error | error |
| 0xED 0xB2 0x80, CP_UTF8 | {U+DC80} | {U+DC80} |
Encode a character string to a byte string. The behaviour on unencodable characters depends on the code page, the Windows version and the flags.
| Code page | Windows version | Flags | Behaviour |
|---|---|---|---|
| CP_UTF8 | 2000, XP, 2003 | 0 | Encode surrogates |
| Vista or later | 0 | Replace surrogates by U+FFFD | |
| WC_ERR_INVALID_CHARS | Strict | ||
| CP_UTF7 | all versions | 0 | Encode surrogates |
| Others | all versions | 0 | Replace by similar glyph |
| WC_NO_BEST_FIT_CHARS | Replace by ? (1) |
pusedDefaultChar is not supported by CP_UTF7 or CP_UTF8.
Use WC_NO_BEST_FIT_CHARS flag (or WC_ERR_INVALID_CHARS flag for CP_UTF8) to have a strict encoder: return an error on unencodable character. By default, if a character cannot be encoded, it is replaced by a character with a similar glyph or by ”?” (U+003F). For example, with cp1252, Ł (U+0141) is replaced by L (U+004C).
On Windows Vista or later with WC_ERR_INVALID_CHARS flag, the UTF-8 encoder (CP_UTF8) returns an error on surrogate characters. The default behaviour (flags=0) depends on the Windows version: surrogates are replaced by U+FFFD on Windows Vista and later, and are encoded to UTF-8 on older Windows versions. The WC_NO_BEST_FIT_CHARS flag is not supported by the UTF-8 encoder.
The WC_ERR_INVALID_CHARS flag is only supported by CP_UTF8 and only on Windows Vista or later.
The UTF-7 encoder (CP_UTF7) only supports flags=0. It is not strict: it encodes surrogate characters.
Examples (on any version Windows version):
| Flags | default (0) | WC_NO_BEST_FIT_CHARS |
|---|---|---|
| ÿ (U+00FF), cp932 | 0x79 (y) | 0x3F (?) |
| Ł (U+0141), cp1252 | 0x4C (L) | 0x3F (?) |
| € (U+20AC), cp1252 | 0x80 | 0x80 |
| U+DC80, CP_UTF7 | 0x2b 0x33 0x49 0x41 0x2d (+3IA-) | invalid flags |
Examples on Windows Vista an later:
| Flags | default (0) | WC_ERR_INVALID_CHARS | WC_NO_BEST_FIT_CHARS |
|---|---|---|---|
| U+DC80, CP_UTF8 | 0xEF 0xBF 0xBD | error | invalid flags |
Examples on Windows 2000, XP, 2003:
| Flags | default (0) | WC_ERR_INVALID_CHARS | WC_NO_BEST_FIT_CHARS |
|---|---|---|---|
| U+DC80, CP_UTF8 | 0xED 0xB2 0x80 | invalid flags | invalid flags |
Note
MultiByteToWideChar() and WideCharToMultiByte() functions are similar to mbstowcs() and wcstombs() functions.
Todo
Document NormalizeString()
Todo
Document the replacement character?
Windows has two versions of each function of its API: the ANSI version using byte strings (A suffix) and the ANSI code page, and the wide version (W suffix) using character strings. There are also functions without suffix using TCHAR* strings: if the C define _UNICODE is defined, TCHAR is replaced by wchar_t and the Unicode functions are used; otherwise TCHAR is replaced by char and the ANSI functions are used. Example:
- CreateFileA(): bytes version, use byte strings encoded to the ANSI code page
- CreateFileW(): Unicode version, use wide character strings
- CreateFile(): TCHAR version depending on the _UNICODE define
Always prefer the Unicode version to avoid encoding/decoding errors, and use directly the W suffix to avoid compiling issues.
Note
There is a third version of the API: the MBCS API (multibyte character string). Use the TCHAR functions and define _MBCS to use the MBCS functions. For example, _tcsrev() is replaced by _mbsrev() if _MBCS is defined, by _wcsrev() if _UNICODE is defined, or by _strrev() otherwise.
- LPSTR (LPCSTR): byte string, char* (const char*)
- LPWSTR (LPCWSTR): wide character string, wchar_t* (const wchar_t*)
- LPTSTR (LPCTSTR): byte or wide character string depending of _UNICODE define, TCHAR* (const TCHAR*)
Windows stores filenames as Unicode in the filesystem. Filesystem wide character POSIX-like API:
POSIX functions, like fopen(), use the ANSI code page to encode/decode strings.
Console functions.
Get the ccode page of the standard input (stdin) of the console.
Get the code page of the standard output (stdout and stderr) of the console.
Write a character string into the console.
Todo
document ReadConsoleW()?
To improve the Unicode support of the console, set the console font to a TrueType font (e.g. “Lucida Console”) and use the wide character API
If the console is unable to render a character, it tries to use a character with a similar glyph. For example, with OEM code page 850, Ł (U+0141) is replaced by L (U+0041). If no replacment character can be found, ”?” (U+003F) is displayed instead.
In a console (cmd.exe), chcp command can be used to display or to change the OEM code page (and console code page). Change the console code page is not a good idea because the ANSI API of the console still expect characters encoded to the previous console code page.
See also
Conventional wisdom is retarded, aka What the @#%&* is _O_U16TEXT? (Michael S. Kaplan, 2008) and the Python bug report #1602: windows console doesn’t print or input Unicode.
_setmode() and _wsopen() are special functions to set the encoding of a file:
fopen() can use these modes using ccs= in the file mode:
- ccs=UNICODE: _O_WTEXT
- ccs=UTF-8: _O_UTF8
- ccs=UTF-16LE: _O_UTF16
Todo
Consequences on TTY and pipes?
Mac OS X uses UTF-8 for the filenames. If a filename is an invalid UTF-8 byte string, Mac OS X returns an error. The filenames are decomposed to an incompatible variant of the Normal Form D (NFD). Extract of the Technical Q&A QA1173: “For example, HFS Plus uses a variant of Normal Form D in which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed.”
To support different languages and encodings, UNIX and BSD operating systems have “locales”. Locales are process-wide: if a thread or a library change the locale, the whole process is impacted.
Locale categories:
- LC_COLLATE: compare and sort strings
- LC_CTYPE: decode byte strings and encode character strings
- LC_MESSAGES: language of messages
- LC_MONETARY: monetary formatting
- LC_NUMERIC: number formatting (e.g. thousands separator)
- LC_TIME: time and date formatting
LC_ALL is a special category: if you set a locale using this category, it sets the locale for all categories.
Each category has its own environment variable with the same name. For example, LC_MESSAGES=C displays error messages in English. To get the value of a locale category, LC_ALL, LC_xxx (e.g. LC_CTYPE) or LANG environment variables are checked: use the first non empty variable. If all variables are unset, fallback to the C locale.
Note
The gettext library reads LANGUAGE, LC_ALL and LANG environment variables (and some others) to get the user language. The LANGUAGE variable is specific to gettext and is not related to locales.
When a program starts, it does not get directly the user locale: it uses the default locale which is called the “C” locale or the “POSIX” locale. It is also used if no locale environment variable is set. For LC_CTYPE, the C locale usually means ASCII, but not always (see the locale encoding section). For LC_MESSAGES, the C locale means to speak the original language of the program, which is usually English.
For Unicode, the most important locale category is LC_CTYPE: it is used to set the “locale encoding”.
To get the locale encoding:
- Copy the current locale: setlocale(LC_CTYPE, NULL)
- Set the current locale encoding to the user preference: setlocale(LC_CTYPE, "")
- Use nl_langinfo(CODESET) if available
- or setlocale(LC_CTYPE, NULL)
Todo
write a full example in C
For the C locale, nl_langinfo(CODESET) returns ASCII, or an alias to this encoding (e.g. “US-ASCII” or “646”). But on FreeBSD, Solaris and Mac OS X, codec functions (e.g. mbstowcs()) use ISO 8859-1 even if nl_langinfo(CODESET) announces ASCII encoding. AIX uses ISO 8859-1 for the C locale (and nl_langinfo(CODESET) returns "ISO8859-1").
<locale.h> functions.
Get the value of the specified locale category.
Set the value of the specified locale category.
Todo
setlocale(“”) means user preference
<langinfo.h> functions.
Get the name of the locale encoding.
<stdlib.h> functions.
Decode a byte string from the locale encoding to a character string. The decoder is strict: it returns an error on undecodable byte sequence. If available, prefer the reentrant version: mbsrtowcs().
Encode a character string to a byte string in the locale encoding. The encoder is strict : it returns an error if a character cannot by encoded. If available, prefer the reentrant version: wcsrtombs().
mbstowcs() and wcstombs() are strict and don’t support error handlers.
Note
“mbs” stands for “multibyte string” (byte string) and “wcs” stands for “wide character string”.
On Windows, the “locale encoding” are the ANSI and OEM code pages. A Windows program uses the user preferred code pages at startup, whereas a program starts with the C locale on UNIX.
CD-ROM uses the ISO 9660 filesystem which stores filenames as byte strings. This filesystem is very restrictive: only A-Z, 0-9, _ and ”.” are allowed. Microsoft has developped the Joliet extension: store filenames as UCS-2, up to 64 characters (BMP only). It was first supported by Windows 95. Today, all operationg systems are able to read it.
UDF (Universal Disk Format) is the filesystem of DVD: it stores filenames as character strings.
Todo
UDF encoding?
MS-DOS uses the FAT filesystems (FAT 12, FAT 16, FAT 32): filenames are stored as byte strings. Filenames are limited to 8+3 characters (8 for the name, 3 for the extension) and displayed differently depending on the code page (mojibake issue).
Microsoft extended its FAT filesystem in Windows 95: the Virtual FAT (VFAT) supports “long filenames”, filenames are stored as UCS-2, up to 255 characters (BMP only). Starting at Windows 2000, non-BMP characters can be used: UTF-16 replaces UCS-2 and the limit is now 255 UTF-16 units.
The NTFS filesystem stores filenames as character strings.
Todo
NTFS encoding
HFS stores filenames as byte strings.
HFS+ stores filenames as UTF-16: the maximum length is 255 UTF-16 units.
JFS and ZFS also use Unicode.
The ext family (ext2, ext3, ext4) store filenames as byte strings.
Todo
Linux: mount options (FAT, NFSv3)
Todo
USB keys, camera, memory cards
Todo
Network fileystems like NFS (NFS4 supports Unicode?)