Ethereal-dev: Re: [Ethereal-dev] While we're on the subject of new frametypes...

Note: This archive is from the project's previous web site, ethereal.com. This list is no longer active.

From: Guy Harris <guy@xxxxxxxxxx>
Date: Thu, 12 Dec 2002 23:14:42 -0800
On Fri, Dec 13, 2002 at 04:22:39PM +1100, Tim Potter wrote:
> How about a new frametype for unicode strings?

Big-endian, or little-endian?  You could tell "proto_tree_add_item()"
what the byte order is; "proto_tree_add_ustring()", however, would
probably need to take a byte order argument.

There's currently a commented-out FT_UCS2_LE in "epan/ftypes/ftypes.h",
for 2-byte little-endian Unicode.  We could perhaps implement that.

However, I think there are some things we should think about before
doing Unicode (even if we don't come to a conclusion on all of them
first - we might be able to temporarily punt on the display and printing
issues by discarding or printing/displaying as an escape sequence
non-ASCII characters, so those issues may not require immediate
resolution):

	1) What should we do about other extended-ASCII character sets? 
	   Currently, we don't do anything clever, which means that, for
	   example, ISO 8859/1 strings might work OK if you're running
	   on some UNIX flavor with the locale set to an 8859/1 locale,
	   but don't work in other locales?

	   Should we make them Unicode strings, and have the dissector
	   translate them from the character set in question to Unicode?
	   Making the character set a property of the field might not
	   work - for example, that wouldn't work for OEM character sets
	   in SMB, as that'd have to be something set by an SMB
	   preference item at run time.  It might work for the Mac
	   character set in Appletalk, however.

	2) As long as we're going down that path, should we store *all*
	   strings as Unicode in the protocol tree, and just keep the
	   existing FT_STRING types, and:

		perhaps have the byte-order argument to
		"proto_tree_add_item()" specify, for FT_STRING types,
		the character set and, in cases where a multi-byte
		character type can come in either byte order, the byte
		order;

		add a character set+byte order argument to
		"proto_tree_add_string()"?

	   That complicates life for GTK+ 1.2[.x], as you have to figure
	   out what character encoding is being used for the font, and
	   translate into that.  However, GTK+ 2.x, and the Win32 GTK+
	   1.3[.x], use UTF-8, so we should be able to make that work
	   reasonably well.  Doing so *might* fix *some* of the problems
	   people are reporting on Windows.

	   Recent versions of Qt use Unicode or UTF-8, so a KDE version
	   should be able to handle that, if we do one.

	   I don't know offhand what Aqua uses, but I wouldn't be
	   surprised if you could get it to use Unicode or UTF-8.

	   You can use Unicode for applications running on Windows NT
	   (NT 4.0, 2K, XP, .NET Server), so any native Windows GUI (or
	   Packetyzer) should be able to make that work.  Windows OT
	   (95, 98, Me) is another matter; there is the "Microsoft Layer
	   for Unicode on Windows 95/98/Me Systems":

		http://msdn.microsoft.com/library/default.asp?url=/library/en-us/win9x/unilayer_4wj7.asp

	   which might help - however, that *might* also affect non-GUI
	   APIs, causing them to use Unicode as well.  If so, we'd have
	   to deal with that somehow.

	   Text output gets tricky.  On Windows, if you do a "print to
	   file" in Network Monitor 2.0, it prints out a Unicode text
	   file (which is a bit annoying if I wanted an ASCII text file,
	   although "tr"ing it on UNIX can end that annoyance by
	   stripping out the extra null bytes).  We could, I guess, do
	   that on Windows for Tethereal and printing, although we might
	   have to further Windowsify the printing code to make that
	   work right.

	   On UNIX, if we can find some way to translate from Unicode or
	   UTF-8 to the locale's character set, we could do that for
	   Tethereal and printing.  The iconv library *might* handle
	   that, although that'd require the native iconv library to
	   handle UTF-8 or Unicode - I'm not sure all of them do; I seem
	   to remember some version of Solaris having some special
	   add-on developer's pack to add UTF-8 support, so it might not
	   handle it in that and earlier Solaris versions, although I
	   think Solaris 8 handles it natively - or force us to require
	   GNU iconv on platforms that lack a version of iconv that can
	   handle Unicode or UTF-8.

> Currently they can
> either be displayed as a normal string in which case you get the first
> character, or as a bunch of bytes which isn't very attractive.

Or you could de-Unicodeize them and use FT_STRING-family types, which is
better than a poke in the eye with a sharp stick, but doesn't handle
non-ASCII characters.  I think we do that in some places.