Wireshark-dev: Re: [Wireshark-dev] guint8* and gchar* ... and Vim ?! :)

From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Thu, 14 Dec 2006 11:49:35 -0800
Sebastien Tandel wrote:

   is there any reason to use guint8* instead of gchar*?

For what purpose?

If you're dealing with an array of 8-bit bytes, or a pointer to a sequence of those, guint8 is the right type; it makes it clear that they're bytes, not characters (it might be binary, it might be a sequence of 16-bit "bytes" in a UTF-16-encoded string, it might be a UTF-8 string, etc.).

I.e., tvb_get_ptr(), for example, should return a "guint8 *", as should tvb_memdup(), and the raw packet data you get from Wiretap should be pointed to by a "guint8 *".

Note also that you can safely pass a guint8 or guchar to one of the <ctype.h> routines, but you can't safely pass a gchar to them, as they might get sign-extended into negative values if the 8th bit is set (I think that none of the popular platforms for Windows and modern UN*Xes have C compilers with "char" an unsigned type, so I think "might" can be replaced by "will" in practice).

With gcc-4.0, there is the new feature warning you that "pointer target
differs in signedness" (which is not such a bad thing).

I suspect most of those warnings are for cases where you're treating byte sequences as character strings.

What I think we *really* need to do, for those cases, is have a different way of handling strings. The current way we handle strings doesn't take into account the fact that there are a number of different character encodings for strings - "ASCII" (which would imply that a byte with the 8th bit set is an error), ISO 8859/n, other EUC encodings, Shift-JIS, KOI8, UTF-8, UTF-16, etc..

See the first item under "Dissector infrastructure" on the

	http://wiki.wireshark.org/Development/Wishlist

page. (That discusses two items - the dissector APIs for handling strings, and the UI aspects of this. The former doesn't require the latter - we can continue to display non-ASCII characters as escape sequences - but the latter, which is something we should ultimately do, requires some way of getting all strings from packets translated into Unicode.)

May we change these guint8* to gchar* ? I mean may we change the type of
the concerned variables and not cast to every call of a function ?

Which ones are you thinking of? We shouldn't globally replace guint8 with gchar, as per my comments in the beginning.