Ethereal-dev: Re: [Ethereal-dev] UTF-8 field in a dissector

Note: This archive is from the project's previous web site, ethereal.com. This list is no longer active.

From: Guy Harris <gharris@xxxxxxxxx>
Date: Fri, 12 Nov 2004 11:49:41 -0800
Martijn Schipper wrote:

I have created a dissector for a protocol and one of the fields is UTF-8 encoded. What should I do to display this field in the tree?

If you mean "what should I do to display all the characters in it correctly", the answer is "change Ethereal's handling of strings to allow a character encoding to be specified with the string, and add UTF-8 as one of the valid encodings". (With such a change, the set of encodings should ultimately include:

ASCII, meaning "display anything with the 8th bit set, as well as all control characters, as an escaped character";

	UTF-8;

	16-bit Unicode (big-endian and little-endian);

	various PC OEM character sets;

various classic Mac OS character sets (OS X's native encoding is UTF-8, but the earlier versions might've used MacRoman, etc.);

	EBCDIC;

	ISO 8859/x;

	various EUC character sets;

various other encodings (KOI-8, Shift-JIS, GBwhatever-that-Chinese-encoding-is, etc.).

Note that iconv isn't necessarily the answer, as we can't guarantee that the iconv implementation on a given platform will support all the character sets that Ethereal would need (it's not a question of what character sets the machine running Ethereal uses, because it has to deal with the character sets that the machines that transmitted the packets Ethereal is reading used). Perhaps incorporating a copy of GNU iconv into Ethereal, and having our own tables for character encodings, would be the answer.

Note also that to display, print, etc. these characters you have to deal with:

GTK+ 1.2[.x], which expects text in whatever the encoding is for the font being used;

	GTK+ 1.3[.x] and 2.x, which expect UTF-8 text;

formatting to a text file, which, on UN*X, should probably generate text in whatever the encoding is for the user's local, and on Windows, should probably - what? ASCII? 16-bit Unicode? If 16-bit Unicode, how can it tag the file as such, so that Windows text editors can handle it? Begin the file with a byte-orde mark?

	printing to a printer.

If, however, you are willing to live with only ASCII characters being displayed correctly, then, if you're adding the strings as fields, Ethereal should properly escape non-ASCII characters, and if you're explicitly formatting with "proto_tree_add_text()" or "proto_tree_add_XXX_format()", use "format_text()" or "tvb_format_text()" with "%s" format items (which is what people should be doing *anyway*, to keep non-printable characters from screwing things up).