Wireshark-dev: Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Mon, 11 Jul 2011 16:54:57 -0700
On Jul 11, 2011, at 4:00 PM, Stephen Fisher wrote:

> The popular SecureCRT terminal emulator defaults to "default" (same as 
> local system) character encoding, at least on Windows systems.  This is 
> not compatible with UTF-8 in my experience.

Not surprising, given that "default"/"same as local system" probably means "local code page".  Win32 first appeared in NT 3.1 in 1993, and Unicode first appeared in 1991 (and Microsoft joined the group doing it in 1990, at least according to the Wikipedia article), so it could support Unicode from Day One, and they could get away with saying "if you want Unicode you have to use the Unicode versions of the APIs, and strings are UCS-2 in those versions of the APIs", with the legacy "ASCII"/"ANSI" APIs using code pages.  UN*X didn't have that advantage, so UN*X systems support Unicode using UTF-8 rather than with Shiny New APIs.

So, on Windows, consoles, whether from Microsoft or third parties, probably tend to, if not using UCS-2/UTF-16 characters, use the local code page.  For what it's worth, the Wikipedia article on the Win32 console:

	http://en.wikipedia.org/wiki/Win32_console

claims that

	Under Windows NT and CE based versions of Windows, the screen buffer uses four bytes per character cell: two bytes for character code, two bytes for attributes. The character is then encoded a 16-bit subset of Unicode (UCS-2).[2] For backward compatibility, the console APIs exist in two versions: Unicode and non-Unicode. The non-Unicode versions of APIs can usecode page switching to extend the range of displayed characters (but only if TrueType fonts are used for the console window, thereby extending the range of codes available). Even UTF-8is available as "code page 65001".

At least according to

	http://msdn.microsoft.com/en-us/library/ms683458(v=VS.85).aspx

the device-independent I/O functions ReadFile() and WriteFile() (for UN*X folks, think read() and write()) don't support Unicode:

	High-level I/O gives you a choice between the ReadFile and WriteFile functions and the ReadConsole and WriteConsole functions. They are identical, except for two important differences. The console functions support the use of either Unicode characters or the ANSI character set; the file I/O functions do not support Unicode. Also, the file I/O functions can be used to access files, pipes, and serial communications devices; the console functions can only be used with console handles. This distinction is important if an application relies on standard handles that may have been redirected.

and I suspect that the C library _read() and _write() functions, and the "standard I/O library" functions that are presumably built atop them, probably ultimately run atop ReadFile() and WriteFile(), so that they're device-independent.

On UN*X, you probably get similar behavior, *mutatis mutandis* (e.g., replacing "the system code page setting" with the code set portion of the setting of LANG or LC_CTYPE" or whatever), so we can't guarantee, on Windows or UN*X, that what gets printed with printf() or fprintf() can always be done in UTF-8, so

	1) we'd have to translate it to the appropriate character encoding

and

	2) not all Unicode characters can necessarily be represented in that encoding.

In the best of all possible worlds, all UN*X systems would be configured to use UTF-8 encoding and all Windows systems would be configured to use code page 65001, but....