This has been discussed before, but no formal decisions were made on the
matter. We need the ability to support non-ASCII character sets in
Wireshark, in particular Unicode.
I know of at least two bugs off the top of my head that would be fixed
by adding Unicode support in Wireshark (1827 & 1867). Another bug,
#1372, is titled "Wireshark doesn't support non-ASCII strings well" and
refers to the UTF-16 nature of some Windows file sharing protocol
traffic. In that bug's case, our fake Unicode functions are mangling
the actual UTF-16 characters. Guy wrote a detailed comment on that bug
on Feb '07 on one method we could use to handle arbitrary character
sets.
After some thought and research, I think it would be best to convert all
strings into UTF-8 once read in from disk/network/user and keep them in
UTF-8 all the way to display in GTK. Pango writes strings for GTK and
uses UTF-8, so GTK in turn uses UTF-8. In fact, Pango blows up if you
don't pas it a UTF-8 compatible string. We're only getting by now
because UTF-8 and standard ASCII are compatible.
I would like to start implementing some Unicode support in Wireshark,
but we need to have a consensus first on going this way and how we're
going to tackle it. It should be possible to do it incrementally
without causing any problems.
The GLib documentation on Unicode support is here:
http://library.gnome.org/devel/glib/unstable/glib-Unicode-Manipulation.html
It offers unichar characters that are always 4 bytes long. Those would
be wasteful. It then goes on to describe many UTF8 handling functions
that use a typical gchar/char string since multi-byte characters are
handled in UTF-8 by using as many bytes as needed to represent the
character (1, 2, 3 or 4).
Thoughts? Concerns?
Steve