Wireshark-users: [Wireshark-users] Handling non-ASCII characters on Windows and on non-UTF-8 UN*X

From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Thu, 4 Oct 2018 19:38:58 -0700
The Wireshark dissection code uses UTF-8 to store strings and for various format and other strings in its output.

The file I/O code assumes file names are in the current locale's encoding on UN*X and in UTF-8 on Windows.

The code that prints file names just puts out the file names as they are.

There are three types of systems we have to deal with:

1) UN*Xes where the current locale uses UTF-8.  This Just Works.

2) UN*Xes where the current locale uses some other encoding.  This works OK for file names, as we just use them as is, but doesn't work for text output from dissections, as that's in UTF-8, not in the locale's encoding.

3) Windows.  This Is A Mess:

	file names work fine through the GUI, although if they're written to a text file, that text file will be a UTF-8 file, not a file in the current code page;

	non-ASCII file names that can't be encoded in the current code page *don't* work from the command line - they can't be opened;

	non-ASCII file names that *can* be encoded in the current code page can be opened, but if there's an error, the error message on the console will be bad because we're writing them out in UTF-8, not the current code page;

	text output has the same "UTF-8, not current code page" issue.

"Non-ASCII file names that can't be encoded in the current code page *don't* work from the command line - they can't be opened" can be fixed by changing the command-line tools so that:

	the main routine is renamed to real_main();

	on Windows, we have a small wmain() routine that converts the arguments from UTF-16LE to UTF-8 and then calls real_main(), returning its return value;

	on UN*X, we just have main() pass its arguments to real_main() and return real_main()'s return value.

That doesn't fix the other issues, however.

In theory, if your local code page nn Windows is code page 65001, which uses an encoding called "UTF-8", Windows would act like a UN*X where the current locale uses UTF-8.  Unfortunately, not everything on Windows works well in that code page - Windows 7 has problems with CP 65001, and earlier versions had even more problems:

	https://www.dostips.com/forum/viewtopic.php?t=5357

The Wikipedia article claims that, until recently, you couldn't set the code page to 65001 at all:

	https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8

but I seem to remember having tried it, although I may just have done "chcp 65001" to change the cmd.exe code page.

Has anybody encountered these issues?  (I use a UN*X that *VERY* strongly prefers UTF-8 - I don't know how well it supports any other encodings - so *I* haven't encountered it.)