GNU bug report logs - #13947
bug report for core-utils command : OD

Previous Next

Package: coreutils;

Reported by: Marc Grondin <marc.grondin <at> oracle.com>

Date: Wed, 13 Mar 2013 20:25:02 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

Full log


Message #8 received at 13947 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Marc Grondin <marc.grondin <at> oracle.com>
Cc: Mark.Jaeger <at> oracle.com, 13947 <at> debbugs.gnu.org
Subject: Re: bug#13947: bug report for core-utils command :  OD
Date: Wed, 13 Mar 2013 15:34:14 -0600
[Message part 1 (text/plain, inline)]
On 03/13/2013 02:16 PM, Marc Grondin wrote:
> Good Afternoon, 

Hello, and thanks for the report.

> 
> My client was attempting to run the command : od -c on this xml file (sample only) 
> ------------------------------------------------------------------------------
> <?xml version = '1.0' encoding = 'UTF-8'?>
> <top>
>    <x>丸</x>

Here, you are representing a character in UTF-8

> He was getting this output : 
> ------------------------------------------------------------------------------
> 0000000   <   ?   x   m   l       v   e   r   s   i   o   n       =    
> 0000020   '   1   .   0   '       e   n   c   o   d   i   n   g       =
> 0000040       '   U   T   F   -   8   '   ?   >  \n   <   t   o   p   >
> 0000060  \n               <   x   >   �   �   �   <   /   x   >  \n    

and here, you were running od in a different character set:

> This all based on the LANG env.  He was using : 
> LANG=en_US.iso88591, instead of
> LANG=en_US.UTF-8 

In ISO-88591, every byte is a character, and those particular bytes
happen to be printable, so od was faithfully replaying the character as
printable, only to then be shown by your UTF-8 terminal as an invalid
UTF-8 sequence.  Mismatching character sets between your program and
your terminal is always a recipe for confusion.

However, you HAVE identified a bug, in our documentation.

> 
> ------------------------------------------------------------------------------
> 
> Question : 
> Since the output is based on the ASCII character set, should it not, in both cases give a numerical output (as it did in scenario #2) 
> for a symbol outside the ascii/extended-ascii character set ? 

Our documentation is lying.  Here's what POSIX says about od -c:

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/od.html
"Interpret bytes as characters specified by the current setting of the
LC_CTYPE category. Certain non-graphic characters appear as C escapes:
"NUL=\0" , "BS=\b" , "FF=\f" , "NL=\n" , "CR=\r" , "HT=\t" ; others
appear as 3-digit octal numbers."

Nothing in there restricts the output to ASCII only.  The bytes that are
showing up as � are graphic characters in your current choice of
LC_CTYPE, so there is no escaping performed (since escaping is permitted
only on non-graphic characters).  If your terminal was using the same
character set as you ran od under, you would see proper graphical
characters in the ISO-88591 set (but then again, you wouldn't see the
nice 丸 character that the UTF-8 was representing).

Coreutils is properly obeying the locale, what is wrong is the info
documentation which stated:

`-c'
     Output as ASCII characters or backslash escapes.

In reality, that should state something like:
     Output as characters in the current locale, using octal sequences
or backslash escapes for all non-graphic bytes.

Meanwhile, if you want to guarantee ASCII-only output from od, you have
to use a different format, such as -b or -tx1, or use LC_ALL=C on a
system where the C locale does not treat non-ascii bytes as graphical
characters (most glibc systems, including the one you are using, fit
this bill).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

This bug report was last modified 12 years and 59 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.