GNU bug report logs - #8871
Bug with "sort -i" ?

Previous Next

Package: coreutils;

Reported by: Al Bogner <suse-linux <at> ml082.pinguin.uni.cc>

Date: Wed, 15 Jun 2011 16:04:02 UTC

Severity: normal

Tags: notabug

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Eric Blake <eblake <at> redhat.com>
To: Al Bogner <suse-linux <at> ml082.pinguin.uni.cc>
Cc: 8871 <at> debbugs.gnu.org
Subject: bug#8871: Bug with "sort -i" ?
Date: Wed, 15 Jun 2011 14:08:49 -0600
[Message part 1 (text/plain, inline)]
retitle 8871 RFE enhance sort --debug -i
tag 8871 wishlist
thanks

On 06/15/2011 09:42 AM, Al Bogner wrote:
> Hi,
> 
> this looks like a bug for me:

Thanks for the report.  However, most likely this is not a bug in sort,
but a misunderstanding on your part about how locales affect which bytes
(or byte sequences, in multi-byte locales) are deemed printable.

> 
> var="φθινόπωρο,κισσός,Φύλλο"
> 
> 
> echo "$var" | sed -e 's/.*/\L&/' -e 's/,/_/g' | tr '_' '\n' | \

Wow, that's a complex way to change comma into newline.  Why not just:

var="φθινόπωρο
κισσός
Φύλλο"
echo "$var" | sort ...

[I'm assuming you've distilled this from a larger example where the
complex processing was actually useful rather than starting from the
right string to begin with]

> sort -f -u
> κισσός
> φθινόπωρο
> φύλλο
> 
> echo "$var" | sed -e 's/.*/\L&/' -e 's/,/_/g' | tr '_' '\n' | \
> sort -f -i -u
> φθινόπωρο

Let's put the new 'sort --debug' option to use to point out the
difference a locale makes (and note that on my machine, the C locale
deems non-ASCII bytes as non-printable, even though they still render
just fine on my terminal).  First, without -i:

$ echo "$var" | LC_ALL=en_US.UTF-8 sort --debug -fu
sort: using `en_US.UTF-8' sorting rules
κισσός
______
φθινόπωρο
_________
Φύλλο
_____
$ echo "$var" | LC_ALL=C sort --debug -fu
sort: using simple byte comparison
Φύλλο
__________
κισσός
____________
φθινόπωρο
__________________


Did you notice how the line lengths differ between the en_US.UTF-8
locale (which knows how to treat multi-byte characters as single
characters) and the C locale (where every byte is a character, and the
multi-byte UTF-8 entities are treated as multiple non-printable characters)?

Then adding -i:

$ echo "$var" | LC_ALL=en_US.UTF-8 sort --debug -fui
sort: using `en_US.UTF-8' sorting rules
κισσός
______
φθινόπωρο
_________
Φύλλο
_____
$ echo "$var" | LC_ALL=C sort --debug -fui
coreutils/src/sort: using simple byte comparison
φθινόπωρο
__________________

When all of the bytes are ignored as non-printable, then all three lines
are identical, hence -u prints only one line.

However, I think this report _did_ find a valid tangential issue - the
'sort --debug' option ought to be enhanced to use a different character
than '_' when flagging which bytes were ignored by -i as unprintable
characters.  That is, I would find it much nicer to see:

$ echo 'aφc' | LC_ALL=C sort --debug -i
aφc
_.._

to make it obvious that the two bytes for φ were being ignored from the
particular sort field that I requested, because -i was in effect.  Same
thing goes for other sort options, such as 'sort -k1n' ignoring
characters after the end of the first parsed number.

-- 
Eric Blake   eblake <at> redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

This bug report was last modified 14 years and 36 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.