Wireshark mailing list archives

Wrongly escaped UTF-8 characters in JSON values ( epan/print.c )


From: Andrea Lo Pumo <alopumo () movia biz>
Date: Thu, 5 Jul 2018 16:01:05 +0200

I am using "tshark -T json -V -r file.pcap" and specifically I am looking
for the gsm_sms.sms_text field.
I get this output:

"gsm_sms.sms_text": "Ok per\u00c3\u00b2 non piove"

Instead, using "tshark -V -r file.pcap" I get:

SMS text: Ok però non piove

(There is an accent in the "o" of "però")

The problem is that the \uXXYY syntax is UTF-16 (see [1]), while "ò" is
UTF-8 and its bytes are c3 b2. Wireshark writes c3 b2 as they were UTF-16.

I solved the problem by changing print_escaped_bare() of epan/print.c as
follow:
substitute

        default:
            if (g_ascii_isprint(*p))
                fputc(*p, fh);
            else {
                g_snprintf(temp_str, sizeof(temp_str), "\\u00%02x",
(guint8)*p);
                fputs(temp_str, fh);
            }

with

        default:
            fputc(*p, fh);

I do not know the Wireshark code, so I am not submitting a patch. This,
however, should work because JSON supports UTF-8 (see again [1]).

[1] From the JSON page on Wikipedia: JSON exchange in an open ecosystem
must be encoded in UTF-8 <https://en.wikipedia.org/wiki/UTF-8>. However, if
escaped, those characters must be written using UTF-16
<https://en.wikipedia.org/wiki/UTF-16> surrogate pairs, a detail missed by
some JSON parsers.
___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev () wireshark org>
Archives:    https://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://www.wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-request () wireshark org?subject=unsubscribe

Current thread: