Wireshark mailing list archives
Re: utf8 support on http dissectors
From: Roberto Ayuso <roberto.ayuso () gmail com>
Date: Mon, 19 Mar 2018 12:53:09 +0100
Thanks Really I mean two fields http.file_data and http,request_uri, both can have non ascii chars but are treated only as ascii on the source code. Cannot be added a option to manage that? Best Roberto. 2018-03-19 9:54 GMT+01:00 Guy Harris <guy () alum mit edu>:
(Don't CC individual developers on messages to wireshark-dev; we're all on that list, and we shouldn't be singled out, as none of us individually "own" this issue.) On Mar 18, 2018, at 11:28 PM, Roberto Ayuso <roberto.ayuso () gmail com> wrote:I have seen that http dissector only manages content on ASCII, Imodified the source for my project changing it with ENC_UTF_8 on http.request_uri and http.dataCan you consider put it as an option on the tshark command line? I haveno enough skills to do by myself. For request/response fields and headers: To quote RFC 7230: Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data. RFC 2047 is "MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text", which describes the "=?iso-8859-1?q?this=20is=20some=20text?=" mechanism used to encode non-ASCII - and not necessarily UTF-8 - text in mail message headers. So: 1) There appear to be "extended ASCII" encodings other than UTF-8 that have been used in HTTP requests and replies, so an option of that sort should perhaps allow more than just UTF-8 to be specified as the "default" encoding. (It would be implemented as a preference for the HTTP dissector, so it would allow a setting on the command line such as "-o http.charset=utf-8", but would also be settable through the GUI in Wireshark.) 2) Are there HTTP headers that are not in ASCII and that don't use percent-escaping for the non-ASCII characters? 3) RFC 3986 seems to be at least suggesting that percent-escape sequences in URLs represent UTF-8 encodings of characters (rather than, say, ISO 8859-n encodings, for some value of n); if that's the case, it would probably be appropriate to display the URL exactly as it appears in the message, *but* to also provide, as a separate field, the result of unescaping, *if* the result is valid UTF-8. For the body: There is no such field as "http.data". Did you mean "http.file_data", or something else? The Content-Type header should, if the body is text, what character encoding is used, e.g. Content-Type: text/plain;charset=utf-8 To quote RFC 2046: 4.1.2. Charset Parameter A critical parameter that may be specified in the Content-Type field for "text/plain" data is the character set. This is specified with a "charset" parameter, as in: Content-type: text/plain; charset=iso-8859-1 Unlike some other parameter values, the values of the charset parameter are NOT case sensitive. The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII. so if there's no "charset=", the character set must be assumed to be ASCII, not UTF-8.
___________________________________________________________________________ Sent via: Wireshark-dev mailing list <wireshark-dev () wireshark org> Archives: https://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://www.wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-request () wireshark org?subject=unsubscribe
Current thread:
- utf8 support on http dissectors Roberto Ayuso (Mar 19)
- Re: utf8 support on http dissectors Guy Harris (Mar 19)
- Re: utf8 support on http dissectors Roberto Ayuso (Mar 19)
- Re: utf8 support on http dissectors Guy Harris (Mar 19)
- Re: utf8 support on http dissectors Roberto Ayuso (Mar 19)
- Re: utf8 support on http dissectors Guy Harris (Mar 19)