Wireshark-dev: [Wireshark-dev] Unicode support

From: Stephen Fisher <stephentfisher@xxxxxxxxx>
Date: Sat, 6 Dec 2008 18:17:10 -0700
This has been discussed before, but no formal decisions were made on the 
matter.  We need the ability to support non-ASCII character sets in 
Wireshark, in particular Unicode.

I know of at least two bugs off the top of my head that would be fixed 
by adding Unicode support in Wireshark (1827 & 1867).  Another bug, 
#1372, is titled "Wireshark doesn't support non-ASCII strings well" and 
refers to the UTF-16 nature of some Windows file sharing protocol 
traffic.  In that bug's case, our fake Unicode functions are mangling 
the actual UTF-16 characters.  Guy wrote a detailed comment on that bug 
on Feb '07 on one method we could use to handle arbitrary character 
sets.

After some thought and research, I think it would be best to convert all 
strings into UTF-8 once read in from disk/network/user and keep them in 
UTF-8 all the way to display in GTK.  Pango writes strings for GTK and 
uses UTF-8, so GTK in turn uses UTF-8.  In fact, Pango blows up if you 
don't pas it a UTF-8 compatible string.  We're only getting by now 
because UTF-8 and standard ASCII are compatible.

I would like to start implementing some Unicode support in Wireshark, 
but we need to have a consensus first on going this way and how we're 
going to tackle it.  It should be possible to do it incrementally 
without causing any problems.

The GLib documentation on Unicode support is here:

  http://library.gnome.org/devel/glib/unstable/glib-Unicode-Manipulation.html

It offers unichar characters that are always 4 bytes long.  Those would 
be wasteful.  It then goes on to describe many UTF8 handling functions 
that use a typical gchar/char string since multi-byte characters are 
handled in UTF-8 by using as many bytes as needed to represent the 
character (1, 2, 3 or 4).

Thoughts? Concerns?


Steve