Wireshark-dev: Re: [Wireshark-dev] How to print out string encoded data that contains nul chara

From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Wed, 9 Apr 2014 14:24:53 -0700
On Apr 9, 2014, at 2:06 PM, "John Dill" <John.Dill@xxxxxxxxxxxxxxxxx> wrote:

> I have several character data fields that happen to contain sections of non-ascii binary data including nul characters.  I'd like to get a string display that shows all of the characters according to the length of the field, i.e.
> 
> 20 20 20 20 20 20 01 00 01 00 48 31 20 20 20 20
> 
> produces
> 
> "      \001\000\001\000H1    "
> 
> In proto.c, I see that all of the format_text calls use strlen(bytes) as the length.
> 
> case FT_STRING:
> case FT_STRINGZ:
> case FT_UINT_STRING:
>         bytes = (guint8 *)fvalue_get(&fi->value);
>         label_fill(label_str, hfinfo, format_text(bytes, strlen(bytes)));
> 
> What is the recommended way of creating a text string that uses the octal encoding '\xxx' for non-ASCII data including nul characters that uses the 'length' field of 'proto_tree_add_item'?

The right short-term way would be to use proto_tree_add_string_format_value() to add the field, and format the string's value yourself, using format_text() with a byte count rather than strlen().

The right long-term way is to modify Wireshark so that this works.  The way we handle strings should probably be changed so that we:

	store the raw string octets as a counted array, along with the string encoding;

	convert the octets from the encoding to UTF-8 *with invalid octets and sequences shown as escapes* when displaying the strings;

	convert the octets from the encoding to UTF-8 with invalid octets and sequences shown as Unicode REPLACEMENT CHARACTERS when making the string available for processing by other software (e.g., "-T fields", etc.) (or somehow saying "this isn't a valid string in this encoding);

	somehow arrange that strings with invalid octets or sequences are *always* unequal to any character string in packet-matching expressions (display/read filters, color "filters", etc.), and perhaps allow strings to be compared against octet sequences (e.g. "foobar.name = 20:20:20:20:20:20:01:00:01:00:48:31:20:20:20:20" matches the raw octets of the string), and use that with "Prepare As Filter" etc..

Alternatively, if they're *not* really character strings, display them as a set of subfields, with the text part shown as strings and the binary data shown as whatever it is, e.g.

	Frobozz text 1: {blanks}
	Frobozz count 1: 1
	Frobozz count 2: 1
	Frobozz text 2: H1{and more blanks}

or whatever it is.