Wireshark mailing list archives

Re: UTF8 vs. locale in error messages (bug 5715)


From: Guy Harris <guy () alum mit edu>
Date: Mon, 11 Jul 2011 16:54:57 -0700


On Jul 11, 2011, at 4:00 PM, Stephen Fisher wrote:

The popular SecureCRT terminal emulator defaults to "default" (same as 
local system) character encoding, at least on Windows systems.  This is 
not compatible with UTF-8 in my experience.

Not surprising, given that "default"/"same as local system" probably means "local code page".  Win32 first appeared in 
NT 3.1 in 1993, and Unicode first appeared in 1991 (and Microsoft joined the group doing it in 1990, at least according 
to the Wikipedia article), so it could support Unicode from Day One, and they could get away with saying "if you want 
Unicode you have to use the Unicode versions of the APIs, and strings are UCS-2 in those versions of the APIs", with 
the legacy "ASCII"/"ANSI" APIs using code pages.  UN*X didn't have that advantage, so UN*X systems support Unicode 
using UTF-8 rather than with Shiny New APIs.

So, on Windows, consoles, whether from Microsoft or third parties, probably tend to, if not using UCS-2/UTF-16 
characters, use the local code page.  For what it's worth, the Wikipedia article on the Win32 console:

        http://en.wikipedia.org/wiki/Win32_console

claims that

        Under Windows NT and CE based versions of Windows, the screen buffer uses four bytes per character cell: two 
bytes for character code, two bytes for attributes. The character is then encoded a 16-bit subset of Unicode 
(UCS-2).[2] For backward compatibility, the console APIs exist in two versions: Unicode and non-Unicode. The 
non-Unicode versions of APIs can usecode page switching to extend the range of displayed characters (but only if 
TrueType fonts are used for the console window, thereby extending the range of codes available). Even UTF-8is available 
as "code page 65001".

At least according to

        http://msdn.microsoft.com/en-us/library/ms683458(v=VS.85).aspx

the device-independent I/O functions ReadFile() and WriteFile() (for UN*X folks, think read() and write()) don't 
support Unicode:

        High-level I/O gives you a choice between the ReadFile and WriteFile functions and the ReadConsole and 
WriteConsole functions. They are identical, except for two important differences. The console functions support the use 
of either Unicode characters or the ANSI character set; the file I/O functions do not support Unicode. Also, the file 
I/O functions can be used to access files, pipes, and serial communications devices; the console functions can only be 
used with console handles. This distinction is important if an application relies on standard handles that may have 
been redirected.

and I suspect that the C library _read() and _write() functions, and the "standard I/O library" functions that are 
presumably built atop them, probably ultimately run atop ReadFile() and WriteFile(), so that they're device-independent.

On UN*X, you probably get similar behavior, *mutatis mutandis* (e.g., replacing "the system code page setting" with the 
code set portion of the setting of LANG or LC_CTYPE" or whatever), so we can't guarantee, on Windows or UN*X, that what 
gets printed with printf() or fprintf() can always be done in UTF-8, so

        1) we'd have to translate it to the appropriate character encoding

and

        2) not all Unicode characters can necessarily be represented in that encoding.

In the best of all possible worlds, all UN*X systems would be configured to use UTF-8 encoding and all Windows systems 
would be configured to use code page 65001, but....
___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev () wireshark org>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-request () wireshark org?subject=unsubscribe


Current thread: