Wireshark-bugs: [Wireshark-bugs] [Bug 1372] Wireshark doesn't support non-ASCII strings well

Date: Tue, 13 Feb 2007 18:49:38 +0000 (GMT)
http://bugs.wireshark.org/bugzilla/show_bug.cgi?id=1372


guy@xxxxxxxxxxxx changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|Minor                       |Enhancement
         OS/Version|Linux                       |All
           Platform|PC                          |All
            Summary|samba unicode filenames     |Wireshark doesn't support
                   |displayed as ASCII          |non-ASCII strings well




------- Comment #1 from guy@xxxxxxxxxxxx  2007-02-13 18:49 GMT -------
The underlying problem is that non-ASCII strings are not supported well in
Wireshark/TShark.

Here's the item in the Wireshark Wiki (on the Development/Wishlist page) about
this:

A way to handle strings in arbitrary character sets would be useful. A string
value might contain:

    1. a length in bytes, and a pointer to an array of that many bytes,
containing the raw data from the packet;

    2. an indication of the encoding of that array of bytes (UTF-8,
little-endian 16-bit Unicode, big-endian 16-bit Unicode, ISO 8859/1, ISO
8859/2, Windows code page XXX, MacRoman, etc.);

    3. a length in bytes, and a pointer to that many bytes, containing a UTF-8
translation of the string.

When the string is fetched, only the first two of those would be filled in. The
only reason to translate a string to UTF-8 would be to display it or to compare
it against another string in a filter expression; most strings in a protocol
tree probably won't be used in a filter expression, and if the only reason why
the protocol tree is being generated is to evaluate a filter expression, the
string won't be displayed.

We would probably have our own private copy of iconv and its data bases, so
that we don't have to rely on the OS having iconv and information about the
relevant character sets (a UN*X version might not know about all the Windows
code pages we might want to handle, for example - and even if it did, I don't
know whether there are standard names for character encodings, so we couldn't
rely on a particular encoding having a particular name; the Single UNIX
Specification says the encoding names are implementation-dependent, and I think
I've seen some HP-UX documentation giving names for some encodings that are
different from the names used by GNU iconv). Should we identify character sets
using the values from http://www.iana.org/assignments/character-sets? Does that
include all the DOS and Windows code pages, and all the Macintosh character
sets, we'd need to support? It appears to include EBCDIC in various national
forms, as well as various ISO 8859-n and EUC, but I'm not sure it has all the
old Mac character sets? Displaying UTF-8 or UTF-16 or UTF-32 strings should be
easy in GTK+ 2.x, as the string routines take UTF-8 strings. It's harder in
GTK+ 1.2[.x]; see the GDK 1.2.x documentation on fonts
(http://developer.gnome.org/doc/API/gdk/gdk-fonts.html) for at least some of
the painful details.

For other GUIs, if we do native versions (there's another item in the wishlist
about that):

    o Windows' Unicode interfaces can draw UTF-16 strings (older releases might
handle only the Basic Multilingual Plane, not all of Unicode), but you might
need the Microsoft Layer for Unicode (MSLU) (see
http://www.microsoft.com/globaldev/handson/dev/mslu_announce.mspx) on Windows
95/98/Me; we've dropped support for Windows 95/98/Me, so that should no longer
be an issue.  Building a native Windows Wireshark using the Unicode APIs means,
however, that we'd get file names in Unicode, so we'd have to handle those,
e.g. using the GLib wrappers that take path names in GLib's string encoding.
(Were we to revive the Windows 95/98/Me support, we wouldn't necessarily be
able to use the MSLU; at least according to the Open Layer for Unicode
(Opencow) site, the licensing terms for MSLU are not compatible with the GPL;
the license for MSLU requires that you prevent people from redistributing the
MSLU. Even if it were, another problem is that, even if bundling it with the
GPL'ed Wireshark counts as "mere aggregation" so that it doesn't have to be
GPL'ed, using it from Wireshark might present a problem as it probably wouldn't
be counted as a "system library". Opencow is a free-software replacement for
MSLU. MSLU also apparently requires an "import library" to allow your
executable not to care whether it's running on Windows 95/98/Me or on NT
4.0/2K/XP/2K3/Vista/etc.; the import library is part of the Platform SDK -
there's a free replacement for it, libunicows.)

    o Qt 3.x uses QStrings as the values returned by a QListViewItem (a single
item in a QListView, which would be used to implement the packet list and
detail view), which can be constructed from UTF-8 or UTF-16(?) strings, so KDE
3.0 and later should handle Unicode strings.

    o OS X natively uses Unicode; NSStrings, used in Cocoa to supply string
data for NSTableView (for the packet list) and NSOutlineView (for the packet
detail), can be created from UTF-8 text strings.

BTW, any stuff we can't display (invalid UTF-8 sequences, characters in some
non-Unicode character set not found in Unicode) should probably be turned into
Unicode FFFD, the "REPLACEMENT CHARACTER", which displays as a white question
mark in a black diamond and is intended precisely for that use. I guess we can
display that as "?" on a dumb terminal....


-- 
Configure bugmail: http://bugs.wireshark.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.