#8871 - Bug with "sort -i" ? - GNU bug report logs

GNU bug report logs - #8871
Bug with "sort -i" ?

Reported by: Al Bogner <suse-linux <at> ml082.pinguin.uni.cc>

Date: Wed, 15 Jun 2011 16:04:02 UTC

Severity: normal

Tags: notabug

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Eric Blake <eblake <at> redhat.com> To: Al Bogner <suse-linux <at> ml082.pinguin.uni.cc> Cc: 8871 <at> debbugs.gnu.org Subject: bug#8871: Bug with "sort -i" ? Date: Wed, 15 Jun 2011 14:08:49 -0600

[Message part 1 (text/plain, inline)]

retitle 8871 RFE enhance sort --debug -i tag 8871 wishlist thanks On 06/15/2011 09:42 AM, Al Bogner wrote: > Hi, > > this looks like a bug for me: Thanks for the report. However, most likely this is not a bug in sort, but a misunderstanding on your part about how locales affect which bytes (or byte sequences, in multi-byte locales) are deemed printable. > > var="φθινόπωρο,κισσός,Φύλλο" > > > echo "$var" | sed -e 's/.*/\L&/' -e 's/,/_/g' | tr '_' '\n' | \ Wow, that's a complex way to change comma into newline. Why not just: var="φθινόπωρο κισσός Φύλλο" echo "$var" | sort ... [I'm assuming you've distilled this from a larger example where the complex processing was actually useful rather than starting from the right string to begin with] > sort -f -u > κισσός > φθινόπωρο > φύλλο > > echo "$var" | sed -e 's/.*/\L&/' -e 's/,/_/g' | tr '_' '\n' | \ > sort -f -i -u > φθινόπωρο Let's put the new 'sort --debug' option to use to point out the difference a locale makes (and note that on my machine, the C locale deems non-ASCII bytes as non-printable, even though they still render just fine on my terminal). First, without -i: $ echo "$var" | LC_ALL=en_US.UTF-8 sort --debug -fu sort: using `en_US.UTF-8' sorting rules κισσός ______ φθινόπωρο _________ Φύλλο _____ $ echo "$var" | LC_ALL=C sort --debug -fu sort: using simple byte comparison Φύλλο __________ κισσός ____________ φθινόπωρο __________________ Did you notice how the line lengths differ between the en_US.UTF-8 locale (which knows how to treat multi-byte characters as single characters) and the C locale (where every byte is a character, and the multi-byte UTF-8 entities are treated as multiple non-printable characters)? Then adding -i: $ echo "$var" | LC_ALL=en_US.UTF-8 sort --debug -fui sort: using `en_US.UTF-8' sorting rules κισσός ______ φθινόπωρο _________ Φύλλο _____ $ echo "$var" | LC_ALL=C sort --debug -fui coreutils/src/sort: using simple byte comparison φθινόπωρο __________________ When all of the bytes are ignored as non-printable, then all three lines are identical, hence -u prints only one line. However, I think this report _did_ find a valid tangential issue - the 'sort --debug' option ought to be enhanced to use a different character than '_' when flagging which bytes were ignored by -i as unprintable characters. That is, I would find it much nicer to see: $ echo 'aφc' | LC_ALL=C sort --debug -i aφc _.._ to make it obvious that the two bytes for φ were being ignored from the particular sort field that I requested, because -i was in effect. Same thing goes for other sort options, such as 'sort -k1n' ignoring characters after the end of the first parsed number. -- Eric Blake eblake <at> redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

This bug report was last modified 14 years and 36 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #8871 Bug with "sort -i" ?

GNU bug report logs - #8871
Bug with "sort -i" ?