On 03/13/2013 02:16 PM, Marc Grondin wrote: > Good Afternoon, Hello, and thanks for the report. > > My client was attempting to run the command : od -c on this xml file (sample only) > ------------------------------------------------------------------------------ > > > Here, you are representing a character in UTF-8 > He was getting this output : > ------------------------------------------------------------------------------ > 0000000 < ? x m l v e r s i o n = > 0000020 ' 1 . 0 ' e n c o d i n g = > 0000040 ' U T F - 8 ' ? > \n < t o p > > 0000060 \n < x > � � � < / x > \n and here, you were running od in a different character set: > This all based on the LANG env. He was using : > LANG=en_US.iso88591, instead of > LANG=en_US.UTF-8 In ISO-88591, every byte is a character, and those particular bytes happen to be printable, so od was faithfully replaying the character as printable, only to then be shown by your UTF-8 terminal as an invalid UTF-8 sequence. Mismatching character sets between your program and your terminal is always a recipe for confusion. However, you HAVE identified a bug, in our documentation. > > ------------------------------------------------------------------------------ > > Question : > Since the output is based on the ASCII character set, should it not, in both cases give a numerical output (as it did in scenario #2) > for a symbol outside the ascii/extended-ascii character set ? Our documentation is lying. Here's what POSIX says about od -c: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/od.html "Interpret bytes as characters specified by the current setting of the LC_CTYPE category. Certain non-graphic characters appear as C escapes: "NUL=\0" , "BS=\b" , "FF=\f" , "NL=\n" , "CR=\r" , "HT=\t" ; others appear as 3-digit octal numbers." Nothing in there restricts the output to ASCII only. The bytes that are showing up as � are graphic characters in your current choice of LC_CTYPE, so there is no escaping performed (since escaping is permitted only on non-graphic characters). If your terminal was using the same character set as you ran od under, you would see proper graphical characters in the ISO-88591 set (but then again, you wouldn't see the nice 丸 character that the UTF-8 was representing). Coreutils is properly obeying the locale, what is wrong is the info documentation which stated: `-c' Output as ASCII characters or backslash escapes. In reality, that should state something like: Output as characters in the current locale, using octal sequences or backslash escapes for all non-graphic bytes. Meanwhile, if you want to guarantee ASCII-only output from od, you have to use a different format, such as -b or -tx1, or use LC_ALL=C on a system where the C locale does not treat non-ascii bytes as graphical characters (most glibc systems, including the one you are using, fit this bill). -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org