GNU bug report logs -
#36674
Sort Suggestion
Previous Next
Reported by: Marshall Lake <mlake <at> mlake.net>
Date: Mon, 15 Jul 2019 18:53:01 UTC
Severity: normal
Tags: notabug
Done: Assaf Gordon <assafgordon <at> gmail.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
tag 36674 notabug
close 36674
stop
Hello,
On Mon, Jul 15, 2019 at 11:42:01AM -0700, Marshall Lake wrote:
> Even though this isn't a bug, I was asked to send the following to this
> email address.
(General suggestions and discussions are better suited for
coreutils <at> gnu.org mailing list, that way the system won't open a new
bug item.)
>
> Re: SORT Command from GNU coreutils 8.25
>
> A suggestion for an additional option to the SORT command is to ignore
> non-alphanumeric characters.
>
> As an example, in attempting to sort an index ...
>
> Abbott, William 259
>
> sorts before:
>
> Abbot, William 099
>
> If non-alphanumeric characters were ignored then the same two records
> would sort as:
>
> Abbot, William 099
> Abbott, William 259
>
>
There's actually something else at play here:
In your case, sort does ignore non-alphanumeric characters,
but it ALSO ignores white space.
That happens because your locale is set to some language
(for example, en_US.UTF8).
Using such locale makes sort ignore all non-alphanumeric chareacters,
whitespace, and upper/lower cases.
In essense, you are compaing "AbbottWilliam" (two 't's) to
'AbbotWilliam' (one 't') - and then the second 't' is compared to a 'w',
and is determined to come first.
If you force a POSIX/C locate, then all characters are considered,
and the result will be as you requested.
Observe the following:
$ printf "%s\n" AbbottWilliam AbbotWilliam | LC_ALL=en_CA.utf8 sort
AbbottWilliam
AbbotWilliam
$ printf "%s\n" "Abbott William" "Abbot William" | LC_ALL=en_CA.utf8 sort
Abbott William
Abbot William
$ printf "%s\n" "Abbott William" "Abbot William" | LC_ALL=C sort
Abbot William
Abbott William
$ printf "%s\n" "Abbott, William" "Abbot, William" | LC_ALL=C sort
Abbot, William
Abbott, William
Note that 'sort' already has an option for dictionary style sorting:
-d, --dictionary-order: consider only blanks and alphanumeric characters.
However, locale rules take precedence over it, so effectively it only
works in "C" locale:
$ printf "%s\n" "Ab,,b,,ott William" "Abbot William" | LC_ALL=C sort
Ab,,b,,ott William
Abbot William
$ printf "%s\n" "Ab,,b,,ott William" "Abbot William" | LC_ALL=C sort -d
Abbot William
Ab,,b,,ott William
You can read past discussion about the confusion resulting from locale
sorting rules here:
https://debbugs.gnu.org/11621
https://debbugs.gnu.org/12783
As such, I'm closing this as "not a bug", but discussion can continue
by replying to this thread.
-assaf
This bug report was last modified 6 years and 4 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.