GNU bug report logs -
#13947
bug report for core-utils command : OD
Previous Next
Full log
View this message in rfc822 format
On 03/13/2013 09:34 PM, Eric Blake wrote:
> On 03/13/2013 02:16 PM, Marc Grondin wrote:
>> Good Afternoon,
>
> Hello, and thanks for the report.
>
>>
>> My client was attempting to run the command : od -c on this xml file (sample only)
>> ------------------------------------------------------------------------------
>> <?xml version = '1.0' encoding = 'UTF-8'?>
>> <top>
>> <x>丸</x>
>
> Here, you are representing a character in UTF-8
>
>> He was getting this output :
>> ------------------------------------------------------------------------------
>> 0000000 < ? x m l v e r s i o n =
>> 0000020 ' 1 . 0 ' e n c o d i n g =
>> 0000040 ' U T F - 8 ' ? > \n < t o p >
>> 0000060 \n < x > � � � < / x > \n
>
> and here, you were running od in a different character set:
>
>> This all based on the LANG env. He was using :
>> LANG=en_US.iso88591, instead of
>> LANG=en_US.UTF-8
>
> In ISO-88591, every byte is a character, and those particular bytes
> happen to be printable, so od was faithfully replaying the character as
> printable, only to then be shown by your UTF-8 terminal as an invalid
> UTF-8 sequence. Mismatching character sets between your program and
> your terminal is always a recipe for confusion.
>
> However, you HAVE identified a bug, in our documentation.
>
>>
>> ------------------------------------------------------------------------------
>>
>> Question :
>> Since the output is based on the ASCII character set, should it not, in both cases give a numerical output (as it did in scenario #2)
>> for a symbol outside the ascii/extended-ascii character set ?
>
> Our documentation is lying. Here's what POSIX says about od -c:
>
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/od.html
> "Interpret bytes as characters specified by the current setting of the
> LC_CTYPE category. Certain non-graphic characters appear as C escapes:
> "NUL=\0" , "BS=\b" , "FF=\f" , "NL=\n" , "CR=\r" , "HT=\t" ; others
> appear as 3-digit octal numbers."
>
> Nothing in there restricts the output to ASCII only. The bytes that are
> showing up as � are graphic characters in your current choice of
> LC_CTYPE, so there is no escaping performed (since escaping is permitted
> only on non-graphic characters). If your terminal was using the same
> character set as you ran od under, you would see proper graphical
> characters in the ISO-88591 set (but then again, you wouldn't see the
> nice 丸 character that the UTF-8 was representing).
>
> Coreutils is properly obeying the locale, what is wrong is the info
> documentation which stated:
>
> `-c'
> Output as ASCII characters or backslash escapes.
I agree. Thanks for the detailed description.
> In reality, that should state something like:
> Output as characters in the current locale, using octal sequences
> or backslash escapes for all non-graphic bytes.
Note we output spaces, so I'd s/non-graphic/non-printable/.
Also multi byte is always going to be problematic displaying
in a grid like this, so we'll probably continue to do as
we do now for the utf8 example above and output octal and dots.
So therefore s/characters/single byte characters/.
>
> Meanwhile, if you want to guarantee ASCII-only output from od, you have
> to use a different format, such as -b or -tx1, or use LC_ALL=C on a
> system where the C locale does not treat non-ascii bytes as graphical
> characters (most glibc systems, including the one you are using, fit
> this bill).
>
cheers,
Pádraig.
This bug report was last modified 12 years and 59 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.