Wireshark-dev: Re: [Wireshark-dev] Replace TRUE/FALSE with proper ENC_* in proto_tree_add_item(

From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Wed, 12 Oct 2011 13:16:51 -0700
On Oct 12, 2011, at 12:30 PM, Bill Meier wrote:

> I propose to do the following for
> the FT_STRING, FT_STRINGZ, FT_UINT_STRING "encoding" parameter:
> 
> Essentially: Specify a character encoding but specify endianness only where relevant.
> 
> Conversions:
> 1.  For other than FT_UINT_STRING, remove all existing True/1/FALSE/0
>    & ENC_NA/ENC_BIG_ENDIAN/ENC_LITTLE_ENDIAN;

That's OK, modulo whether, for encodings that are sequences of octets (which means all of them, right now), the right thing to do is to specify no byte order or specify ENC_NA to say "for this particular encoding, the byte order doesn't matter".  My inclination might be to use ENC_NA.

> 2.  If there's no character encoding (ENC_ASCII, ...) specified
>    then use ENC_ASCII.
> 
>    As Guy noted re the choice of character encoding:
> > That, or ENC_UTF_8.  I suspect most new protocols support UTF-8;
> > older ones either only specify ASCII or use various legacy encodings.
> > Automated replacement will get it wrong for some protocols regardless
> > of whether we use ENC_ASCII or ENC_UTF_8; the question is which of
> > those would be worse, for some value of "worse".
> 
> I've no idea of which is "worse" (or how to decide) so I picked ENC_ASCII.

Currently, they behave the same.  At some point, ENC_UTF_8 will:

	if the string is valid UTF-8, display it correctly;

	if the string is not valid UTF-8, replace various invalid sequences with something such as the "substitute" character when it's displayed;

and ENC_ASCII will replace all octets with the 8th bit set with something such as the "substitute" character.

With ENC_ASCII:

	people will probably be annoyed by the "substitute" character and either submit fixes to use the appropriate encoding or file bugs to request the appropriate encoding, which might involve adding support for the appropriate encoding if it's not UTF-8.

With ENC_UTF_8:

	people will probably be annoyed by the "substitute" character, or bogus character, you'll probably get for all non-ASCII but also non-UTF-8 strings and either {see previous item}.

I'm not sure which would produce more annoyance and require more changes.  My guess is that:

	for protocols where the encoding is UTF-8, ENC_UTF_8 is (obviously) better;

	for other protocols, ENC_ASCII might not always be the right encoding (additional encodings would need to be added), but would probably produce a display that's more obviously wrong and where what's wrong is more obvious (i.e., both the fact that it's bad, and why it's bad, would be more obvious).

I *might* be inclined to go with ENC_ASCII as the first step even though it'd require more changes (e.g., to protocols where the encoding is UTF-8).