retitle 8871 RFE enhance sort --debug -i tag 8871 wishlist thanks On 06/15/2011 09:42 AM, Al Bogner wrote: > Hi, > > this looks like a bug for me: Thanks for the report. However, most likely this is not a bug in sort, but a misunderstanding on your part about how locales affect which bytes (or byte sequences, in multi-byte locales) are deemed printable. > > var="φθινόπωρο,κισσός,Φύλλο" > > > echo "$var" | sed -e 's/.*/\L&/' -e 's/,/_/g' | tr '_' '\n' | \ Wow, that's a complex way to change comma into newline. Why not just: var="φθινόπωρο κισσός Φύλλο" echo "$var" | sort ... [I'm assuming you've distilled this from a larger example where the complex processing was actually useful rather than starting from the right string to begin with] > sort -f -u > κισσός > φθινόπωρο > φύλλο > > echo "$var" | sed -e 's/.*/\L&/' -e 's/,/_/g' | tr '_' '\n' | \ > sort -f -i -u > φθινόπωρο Let's put the new 'sort --debug' option to use to point out the difference a locale makes (and note that on my machine, the C locale deems non-ASCII bytes as non-printable, even though they still render just fine on my terminal). First, without -i: $ echo "$var" | LC_ALL=en_US.UTF-8 sort --debug -fu sort: using `en_US.UTF-8' sorting rules κισσός ______ φθινόπωρο _________ Φύλλο _____ $ echo "$var" | LC_ALL=C sort --debug -fu sort: using simple byte comparison Φύλλο __________ κισσός ____________ φθινόπωρο __________________ Did you notice how the line lengths differ between the en_US.UTF-8 locale (which knows how to treat multi-byte characters as single characters) and the C locale (where every byte is a character, and the multi-byte UTF-8 entities are treated as multiple non-printable characters)? Then adding -i: $ echo "$var" | LC_ALL=en_US.UTF-8 sort --debug -fui sort: using `en_US.UTF-8' sorting rules κισσός ______ φθινόπωρο _________ Φύλλο _____ $ echo "$var" | LC_ALL=C sort --debug -fui coreutils/src/sort: using simple byte comparison φθινόπωρο __________________ When all of the bytes are ignored as non-printable, then all three lines are identical, hence -u prints only one line. However, I think this report _did_ find a valid tangential issue - the 'sort --debug' option ought to be enhanced to use a different character than '_' when flagging which bytes were ignored by -i as unprintable characters. That is, I would find it much nicer to see: $ echo 'aφc' | LC_ALL=C sort --debug -i aφc _.._ to make it obvious that the two bytes for φ were being ignored from the particular sort field that I requested, because -i was in effect. Same thing goes for other sort options, such as 'sort -k1n' ignoring characters after the end of the first parsed number. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org