Wireshark mailing list archives

Re: FT_STRING, FT_STRINGZPAD, and null padding


From: John Thacker <johnthacker () gmail com>
Date: Sat, 5 Sep 2020 12:51:09 -0400

On Sat, Sep 5, 2020 at 2:49 AM Guy Harris <gharris () sonic net> wrote:

OK, here's the types of strings that there can be in protocols.

1. Counted strings.  A counted string has a count of the number of octets
(it might be half the number of octets if it's UTF-16 or UCS-2, but it's
really a count of octets).  There is nothing special about null characters,
other than "if some software treats the string as a C string, it won't work
correctly if the string happens to include a null character, so maybe that
software shouldn't treat it as a C string".

2. Null-terminated, non-counted strings.  A null-terminated, non-counted,
string has no count of the number of octets; it runs until a null
terminator (8-bit, 16-bit, whatever) is seen.  Obviously, that string can't
contain a null terminator.

3. Padded strings in a fixed-size buffer.  If the string is the length of
the buffer, there is no null terminator; otherwise, there's a null
terminator; the protocol may specify that all octets after the null
terminator, if the string isn't the length of the buffer - 1, must be null,
or may say there is no such requirement.

4. Strings that are counted *and* null-terminated.  This is redundant, but
maybe somebody wants to make sure that there's a null at the end, to make
life easier for implementations that use C strings internally.  This
doesn't *really* make things easier, unless you're in a 100% controlled
environment, because, in any environment in which the software has to deal
with packets from buggy or malicious senders, and uses C strings
internally, they have to make sure there's a null terminator *anyway* -
they *can't* just copy the string and blithely assume that a terminating
null is part of the counted blob.

In the case of a counted string, if the string is a 7-octet value of "foo\0bar",
it should show up as exactly that in the packet tree item; if not, we
should fix it to do so.


So this does not happen, at least not when proto_tree_add_item() is used on
a FT_STRING.
Things that call tvb_format_text() will end up producing "foo\000bar"
(because \0 is not one of the special cases handled, it defaults to using
octal in format_text() in epan/strutil.c, which is used for non-printable
but valid ASCII characters. However, proto_tree_add_item() does not do that.

I tested this by hexediting a pcap (from here:
https://gitlab.com/wireshark/wireshark/-/issues/16208) to put a NUL in the
middle of a counted FT_STRING, and it looked like this:

I had applied the patch to allow SSIDs to be UTF-8 in that bug, it looks
like this:

Tag: SSID parameter set: admini\000tración
    Tag Number: SSID parameter set (0)
    Tag length: 15
    SSID: admini
        [Expert Info (Warning/Undecoded): Trailing stray characters]

(admini\000tración also showed up in the FT_COLUMN). In the first entry,
where \000 shows up, tvb_format_text() is used explicitly and then that
text is appended. In the one where it is cut off, proto_tree_add_item() is
used. So that is a bug that should be fixed, then, we want to display the
trailing characters as well as warn about them?

Currently, we use FT_STRING for the first of those, FT_STRINGZ for the
second of those, FT_STRINGZPAD for the third of those, and FT_STRINGZ for
the fourth of those; the difference between the two FT_STRINGZ cases is
whether a length was specified or not.

That helps, but isn't there also FT_UINT_STRING for the common case of
counted strings where the count of the number of octets appears immediately
before the string? It seems like mostly a helper combining "add a
FT_UINT[...] and return its value, then add a FT_STRING with that length."
So ideally detect_trailing_stray_characters() should also operate on the
string part of FT_UINT_STRING, and that just hasn't been added, correct?

Finally, detect_trailing_stray_characters() and tvb_format_string() also
don't do the right thing for counted strings because they (or their
callers) assume that the length in bytes of the string when being checked
is the same as the count in octets passed in originally. That's true before
get_ascii_string() or other places substitute the replacement character,
which is three bytes in UTF-8, for individual bytes with the high bit set
(for ASCII), or for illegal Unicode sequences that can be from 1 to 4
bytes, and thus changes the length of the string. So in the case of
something like "f\xf6\xf6\0bar" (intended to be "föö\0bar" in ISO-8859-1,
presumably) will be converted to "f��\0bar" which is 11 octets in length
for the 7 characters, detect_trailing_stray_characters() will still be
called with a length of 7 bytes (not UTF-8 characters), and thus won't
notice the trailing "bar." Though at the same time there should probably be
an expert info warning any time the UTF-8 replacement character is present
as a hint that maybe the encoding is wrong.

We can see that be doing the same thing and *not* applying the patch to
allow SSIDs to always be UTF-8. Treating it as ASCII, we see:

Tag: SSID parameter set: admini\000traci�
    Tag Number: SSID parameter set (0)
    Tag length: 15
    SSID: admini
        [Expert Info (Warning/Undecoded): Trailing stray characters]

(admini\000traci� also showed up in the FT_COLUMN). What's going on is that
"\xc3\xb3" ("ó" in UTF-8) is not recognized as ASCII and is being replaced
with "\xef\xbf\xbd\xef\xbf\bd", two copies of the 3 byte UTF-8
representation of UNREPR, so then some of the other functions that read the
string stop after the first three replacement character bytes (reaching the
length of 15 octets) and never get to the second replacement character or
to the 'n."

Conversely, if I put non ASCII characters earlier in the string, I can get
this nonsense:

Tag: SSID parameter set: a����i\000
    Tag Number: SSID parameter set (0)
    Tag length: 15
    SSID: a����i

Where the parts that call tvb_format_string() stop after 15 bytes are
written, which is right at the first null but before the trailing
characters that are in the 15 octet field, the part where
proto_tree_add_item() stops after before the first NUL, and the expert info
for detect_trailing_stray_characters() doesn't show up for the same reason
that tvb_format_string() stops in time-- it reaches 15 octets in the
substituted/validated string:

That's for a field with hexdump:

0000 61 c3 b3 c3 b3 69 00 74 72 61 63 69 c3 b3 6e a....i.t raci..n

which with the UTF-8 patch looks like a different sort of nonsense:

Tag: SSID parameter set: a����i\000
    Tag Number: SSID parameter set (0)
    Tag length: 15
    SSID: aóói
        [Expert Info (Warning/Undecoded): Trailing stray characters]

for something that looks like  aóói\0tración when interpreted as UTF-8

I do plan to file a bug or two, just wanted to check on the expected
behavior.

John Thacker
___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev () wireshark org>
Archives:    https://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://www.wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-request () wireshark org?subject=unsubscribe

Current thread: