#17189 - Sort bug #2 - GNU bug report logs

GNU bug report logs - #17189
Sort bug #2

Reported by: Nikos Balkanas <nbalkanas <at> gmail.com>

Date: Sat, 5 Apr 2014 04:39:01 UTC

Severity: normal

Tags: notabug

Merged with 17188

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Bob Proulx <bob <at> proulx.com> To: Nikos Balkanas <nbalkanas <at> gmail.com> Cc: 17189 <at> debbugs.gnu.org Subject: bug#17189: Sort bug #2 Date: Sat, 5 Apr 2014 14:37:39 -0600

Nikos Balkanas wrote: > Thank you all. As I explained in my previous mail, an update of the man > pages is essential. A change in the UI would also be desirable, > if the standards allow it. Sorry, about my attitude, but I was getting > pretty desperate. Thanks for not flaming. > > To make it up I will look into updating the man pages ;-) Hopefully you will then see the WARNING section in the man page. *** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values. The sort documentation says this: Unless otherwise specified, all comparisons use the character collating sequence specified by the `LC_COLLATE' locale.(1) ... (1) If you use a non-POSIX locale (e.g., by setting `LC_ALL' to `en_US'), then `sort' may produce output that is sorted differently than you're accustomed to. In that case, set the `LC_ALL' environment variable to `C'. Note that setting only `LC_COLLATE' has two problems. First, it is ineffective if `LC_ALL' is also set. Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if `LC_CTYPE' is unset) is set to an incompatible value. For example, you get undefined behavior if `LC_CTYPE' is `ja_JP.PCK' but `LC_COLLATE' is `en_US.UTF-8'. > A suggestion. I think that sort should sort text based on the LOCALE > of the file, not the system. Couldn't it detect automatically from > the text, whether it is is dealing with UTF-8 or iso? If dealing > with Iso, it should employ the C Locale US-ASCII is a subset of UTF-8. Every ASCII file is also a valid UTF-8 file. That is by design. But it also makes it impossible to make an assumption like this. For example one would start out with: Lorem ipsum dolor sit amet Now is the time. Don't look Ethyl! That file would sort one way. Then someone would change the apostrophe to the unicode one. Lorem ipsum dolor sit amet Now is the time. Don’t look Ethel! If sort tried to automatically detect behavior based upon the file content then now the file will sort with dictionary sort ordering? I think this would cause a large number of complaints. It would be data dependent behavior and would break a lot of things. Plus this would require sort to add another pass to read the file first to determine this before applying sorting it. Please no. Besides... One person's file of human language is another person's file of raw bytes. Can't make assumptions like this. Bob

This bug report was last modified 11 years and 100 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #17189 Sort bug #2

GNU bug report logs - #17189
Sort bug #2