Wireshark-bugs: [Wireshark-bugs] [Bug 1372] Wireshark doesn't support non-ASCII strings well
Date: Tue, 13 Feb 2007 18:49:38 +0000 (GMT)
http://bugs.wireshark.org/bugzilla/show_bug.cgi?id=1372 guy@xxxxxxxxxxxx changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|Minor |Enhancement OS/Version|Linux |All Platform|PC |All Summary|samba unicode filenames |Wireshark doesn't support |displayed as ASCII |non-ASCII strings well ------- Comment #1 from guy@xxxxxxxxxxxx 2007-02-13 18:49 GMT ------- The underlying problem is that non-ASCII strings are not supported well in Wireshark/TShark. Here's the item in the Wireshark Wiki (on the Development/Wishlist page) about this: A way to handle strings in arbitrary character sets would be useful. A string value might contain: 1. a length in bytes, and a pointer to an array of that many bytes, containing the raw data from the packet; 2. an indication of the encoding of that array of bytes (UTF-8, little-endian 16-bit Unicode, big-endian 16-bit Unicode, ISO 8859/1, ISO 8859/2, Windows code page XXX, MacRoman, etc.); 3. a length in bytes, and a pointer to that many bytes, containing a UTF-8 translation of the string. When the string is fetched, only the first two of those would be filled in. The only reason to translate a string to UTF-8 would be to display it or to compare it against another string in a filter expression; most strings in a protocol tree probably won't be used in a filter expression, and if the only reason why the protocol tree is being generated is to evaluate a filter expression, the string won't be displayed. We would probably have our own private copy of iconv and its data bases, so that we don't have to rely on the OS having iconv and information about the relevant character sets (a UN*X version might not know about all the Windows code pages we might want to handle, for example - and even if it did, I don't know whether there are standard names for character encodings, so we couldn't rely on a particular encoding having a particular name; the Single UNIX Specification says the encoding names are implementation-dependent, and I think I've seen some HP-UX documentation giving names for some encodings that are different from the names used by GNU iconv). Should we identify character sets using the values from http://www.iana.org/assignments/character-sets? Does that include all the DOS and Windows code pages, and all the Macintosh character sets, we'd need to support? It appears to include EBCDIC in various national forms, as well as various ISO 8859-n and EUC, but I'm not sure it has all the old Mac character sets? Displaying UTF-8 or UTF-16 or UTF-32 strings should be easy in GTK+ 2.x, as the string routines take UTF-8 strings. It's harder in GTK+ 1.2[.x]; see the GDK 1.2.x documentation on fonts (http://developer.gnome.org/doc/API/gdk/gdk-fonts.html) for at least some of the painful details. For other GUIs, if we do native versions (there's another item in the wishlist about that): o Windows' Unicode interfaces can draw UTF-16 strings (older releases might handle only the Basic Multilingual Plane, not all of Unicode), but you might need the Microsoft Layer for Unicode (MSLU) (see http://www.microsoft.com/globaldev/handson/dev/mslu_announce.mspx) on Windows 95/98/Me; we've dropped support for Windows 95/98/Me, so that should no longer be an issue. Building a native Windows Wireshark using the Unicode APIs means, however, that we'd get file names in Unicode, so we'd have to handle those, e.g. using the GLib wrappers that take path names in GLib's string encoding. (Were we to revive the Windows 95/98/Me support, we wouldn't necessarily be able to use the MSLU; at least according to the Open Layer for Unicode (Opencow) site, the licensing terms for MSLU are not compatible with the GPL; the license for MSLU requires that you prevent people from redistributing the MSLU. Even if it were, another problem is that, even if bundling it with the GPL'ed Wireshark counts as "mere aggregation" so that it doesn't have to be GPL'ed, using it from Wireshark might present a problem as it probably wouldn't be counted as a "system library". Opencow is a free-software replacement for MSLU. MSLU also apparently requires an "import library" to allow your executable not to care whether it's running on Windows 95/98/Me or on NT 4.0/2K/XP/2K3/Vista/etc.; the import library is part of the Platform SDK - there's a free replacement for it, libunicows.) o Qt 3.x uses QStrings as the values returned by a QListViewItem (a single item in a QListView, which would be used to implement the packet list and detail view), which can be constructed from UTF-8 or UTF-16(?) strings, so KDE 3.0 and later should handle Unicode strings. o OS X natively uses Unicode; NSStrings, used in Cocoa to supply string data for NSTableView (for the packet list) and NSOutlineView (for the packet detail), can be created from UTF-8 text strings. BTW, any stuff we can't display (invalid UTF-8 sequences, characters in some non-Unicode character set not found in Unicode) should probably be turned into Unicode FFFD, the "REPLACEMENT CHARACTER", which displays as a white question mark in a black diamond and is intended precisely for that use. I guess we can display that as "?" on a dumb terminal.... -- Configure bugmail: http://bugs.wireshark.org/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
- References:
- [Wireshark-bugs] [Bug 1372] New: samba unicode filenames displayed as ASCII
- From: bugzilla-daemon
- [Wireshark-bugs] [Bug 1372] New: samba unicode filenames displayed as ASCII
- Prev by Date: [Wireshark-bugs] [Bug 1371] BSSGP dissector: incorrect TLLI field name handling
- Next by Date: [Wireshark-bugs] [Bug 1360] cannot dissor WAP SIR
- Previous by thread: [Wireshark-bugs] [Bug 1372] New: samba unicode filenames displayed as ASCII
- Next by thread: [Wireshark-bugs] [Bug 1373] New: Info field remains truncated monitoring UDP 514 IDS/IDP/Routers vetc
- Index(es):