GNU bug report logs -
#36674
Sort Suggestion
Previous Next
Reported by: Marshall Lake <mlake <at> mlake.net>
Date: Mon, 15 Jul 2019 18:53:01 UTC
Severity: normal
Tags: notabug
Done: Assaf Gordon <assafgordon <at> gmail.com>
Bug is archived. No further changes may be made.
Full log
Message #10 received at control <at> debbugs.gnu.org (full text, mbox):
tag 36674 notabug
close 36674
stop
Hello,
On Mon, Jul 15, 2019 at 11:42:01AM -0700, Marshall Lake wrote:
> Even though this isn't a bug, I was asked to send the following to this
> email address.
(General suggestions and discussions are better suited for
coreutils <at> gnu.org mailing list, that way the system won't open a new
bug item.)
>
> Re: SORT Command from GNU coreutils 8.25
>
> A suggestion for an additional option to the SORT command is to ignore
> non-alphanumeric characters.
>
> As an example, in attempting to sort an index ...
>
> Abbott, William 259
>
> sorts before:
>
> Abbot, William 099
>
> If non-alphanumeric characters were ignored then the same two records
> would sort as:
>
> Abbot, William 099
> Abbott, William 259
>
>
There's actually something else at play here:
In your case, sort does ignore non-alphanumeric characters,
but it ALSO ignores white space.
That happens because your locale is set to some language
(for example, en_US.UTF8).
Using such locale makes sort ignore all non-alphanumeric chareacters,
whitespace, and upper/lower cases.
In essense, you are compaing "AbbottWilliam" (two 't's) to
'AbbotWilliam' (one 't') - and then the second 't' is compared to a 'w',
and is determined to come first.
If you force a POSIX/C locate, then all characters are considered,
and the result will be as you requested.
Observe the following:
$ printf "%s\n" AbbottWilliam AbbotWilliam | LC_ALL=en_CA.utf8 sort
AbbottWilliam
AbbotWilliam
$ printf "%s\n" "Abbott William" "Abbot William" | LC_ALL=en_CA.utf8 sort
Abbott William
Abbot William
$ printf "%s\n" "Abbott William" "Abbot William" | LC_ALL=C sort
Abbot William
Abbott William
$ printf "%s\n" "Abbott, William" "Abbot, William" | LC_ALL=C sort
Abbot, William
Abbott, William
Note that 'sort' already has an option for dictionary style sorting:
-d, --dictionary-order: consider only blanks and alphanumeric characters.
However, locale rules take precedence over it, so effectively it only
works in "C" locale:
$ printf "%s\n" "Ab,,b,,ott William" "Abbot William" | LC_ALL=C sort
Ab,,b,,ott William
Abbot William
$ printf "%s\n" "Ab,,b,,ott William" "Abbot William" | LC_ALL=C sort -d
Abbot William
Ab,,b,,ott William
You can read past discussion about the confusion resulting from locale
sorting rules here:
https://debbugs.gnu.org/11621
https://debbugs.gnu.org/12783
As such, I'm closing this as "not a bug", but discussion can continue
by replying to this thread.
-assaf
This bug report was last modified 6 years and 3 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.