Wireshark mailing list archives

Re: Replace TRUE/FALSE with proper ENC_* in proto_tree_add_item() using a script.


From: Guy Harris <guy () alum mit edu>
Date: Wed, 12 Oct 2011 13:16:51 -0700


On Oct 12, 2011, at 12:30 PM, Bill Meier wrote:

I propose to do the following for
the FT_STRING, FT_STRINGZ, FT_UINT_STRING "encoding" parameter:

Essentially: Specify a character encoding but specify endianness only where relevant.

Conversions:
1.  For other than FT_UINT_STRING, remove all existing True/1/FALSE/0
   & ENC_NA/ENC_BIG_ENDIAN/ENC_LITTLE_ENDIAN;

That's OK, modulo whether, for encodings that are sequences of octets (which means all of them, right now), the right 
thing to do is to specify no byte order or specify ENC_NA to say "for this particular encoding, the byte order doesn't 
matter".  My inclination might be to use ENC_NA.

2.  If there's no character encoding (ENC_ASCII, ...) specified
   then use ENC_ASCII.

   As Guy noted re the choice of character encoding:
That, or ENC_UTF_8.  I suspect most new protocols support UTF-8;
older ones either only specify ASCII or use various legacy encodings.
Automated replacement will get it wrong for some protocols regardless
of whether we use ENC_ASCII or ENC_UTF_8; the question is which of
those would be worse, for some value of "worse".

I've no idea of which is "worse" (or how to decide) so I picked ENC_ASCII.

Currently, they behave the same.  At some point, ENC_UTF_8 will:

        if the string is valid UTF-8, display it correctly;

        if the string is not valid UTF-8, replace various invalid sequences with something such as the "substitute" 
character when it's displayed;

and ENC_ASCII will replace all octets with the 8th bit set with something such as the "substitute" character.

With ENC_ASCII:

        people will probably be annoyed by the "substitute" character and either submit fixes to use the appropriate 
encoding or file bugs to request the appropriate encoding, which might involve adding support for the appropriate 
encoding if it's not UTF-8.

With ENC_UTF_8:

        people will probably be annoyed by the "substitute" character, or bogus character, you'll probably get for all 
non-ASCII but also non-UTF-8 strings and either {see previous item}.

I'm not sure which would produce more annoyance and require more changes.  My guess is that:

        for protocols where the encoding is UTF-8, ENC_UTF_8 is (obviously) better;

        for other protocols, ENC_ASCII might not always be the right encoding (additional encodings would need to be 
added), but would probably produce a display that's more obviously wrong and where what's wrong is more obvious (i.e., 
both the fact that it's bad, and why it's bad, would be more obvious).

I *might* be inclined to go with ENC_ASCII as the first step even though it'd require more changes (e.g., to protocols 
where the encoding is UTF-8).
___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev () wireshark org>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-request () wireshark org?subject=unsubscribe


Current thread: