GNU bug report logs -
#42340
"join" reports that "sort"ed input is not sorted
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 42340 in the body.
You can then email your comments to 42340 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#42340
; Package
coreutils
.
(Mon, 13 Jul 2020 00:36:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Beth Andres-Beck <bandresbeck <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Mon, 13 Jul 2020 00:36:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
In trying to use `join` with `sort` I discovered odd behavior: even after
running a file through `sort` using the same delimiter, `join` would still
complain that it was out of order.
The field I am sorting on is ip addresses, which means that depending on
which digits are zero they can be of different lengths, and the fields
include periods as well as alpha-numeric characters.
Here is a way to reproduce the problem:
> printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a.txt
> printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b.txt
> join -t, a.txt b.txt
join: b.txt:2: is not sorted: 1.1.1,b
The expected behavior would be that if a file has been sorted by "sort" it
will also be considered sorted by join.
---
I traced this back to what I believe to be a bug in sort.c when sorting on
a field other than the last field, where the original pointer is being
incremented one further than it ought to be.
On line 1675 it will always increment the pointer one position beyond the
delimiter unless the field is the last field. If both `eword` and `echar`
are 0 we incremented `eword` on line 1661.
Later when we use keylim (where the limfield value is stored) to calculate
the length of the field, it will include the delimiter in the comparison.
We can illustrate that the problem is including the delimiter because the
following case runs correctly without error:
> printf '1.1.1Z2\n1.1.12Z2\n1.1.2Z1' | sort -tZ > a.txt
> printf '1.1.12Za\n1.1.1Zb\n1.1.21Zc' | sort -tZ > b.txt
> join -tZ a.txt b.txt
In join.c, in comparison, we are comparing the contents of the keys without
the delimiter (on join.c:283 we call extract_field with `ptr` pointing to
the start of the key and len defined as `sep - ptr`, where `sep` is the
position of the tab character).
Cases illustrating the bug in sort:
> printf '12,\n1,\n' | sort -t, -k1
1,
12,
> printf '12,a\n1,a\n' | sort -t, -k1
12,a
1,a
Thank you,
Beth Andres-Beck
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#42340
; Package
coreutils
.
(Mon, 13 Jul 2020 06:59:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 42340 <at> debbugs.gnu.org (full text, mbox):
tags 42340 notabug
close 42340
stop
Hello,
On 2020-07-12 5:57 p.m., Beth Andres-Beck wrote:
> In trying to use `join` with `sort` I discovered odd behavior: even after
> running a file through `sort` using the same delimiter, `join` would still
> complain that it was out of order.
[...]
> Here is a way to reproduce the problem:
>
>> printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a.txt
>> printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b.txt
>> join -t, a.txt b.txt
> join: b.txt:2: is not sorted: 1.1.1,b
>
> The expected behavior would be that if a file has been sorted by "sort" it
> will also be considered sorted by join.
[...]
> I traced this back to what I believe to be a bug in sort.c
This is not a bug in sort or join, just a side-effect of the locale on
your system on the sorting results.
By forcing a C locale with "LC_ALL=C" (meaning simple ASCII order),
the files are ordered in the same way 'join' expected them to be:
$ printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | LC_ALL=C sort -t, > a.txt
$ printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | LC_ALL=C sort -t, > b.txt
$ join -t, a.txt b.txt
1.1.1,2,b
1.1.12,2,a
---
More details:
I'm going to assume your system uses some locale based on UTF-8.
You can check it by running 'locale', e.g. on my system:
$ locale
LANG=en_CA.utf8
LANGUAGE=en_CA:en
LC_CTYPE="en_CA.utf8"
..
..
Under most UTF-8 locales, punctuation characters are *ignored* in the
compared input lines. This might be confusing and non-intuitive, but
that's the way most systems have been working for many years (locale
ordering is defined in the GNU C Library, and coreutils has no way to
change it).
Observe the following:
$ printf '12,a\n1,b\n' | LC_ALL=en_CA.utf8 sort
12,a
1,b
$ printf '12,a\n1,b\n' | LC_ALL=C sort
1,b
12,a
With a UTF-8 locale, the comma character is ignored, and then "12a"
appears before "1b" (since the character '2' comes before the character
'b').
With "C" locale, forcing ASCII or "byte comparison", punctuation
characters are not ignored, and "1,b" appears before "12,a" (because
the comma ',' ASCII value is 44 , which is smaller then the ASCII value
digit '2').
---
Somewhat related:
Your sort command defines the delimiter ("-t,") but does not define
which columns to sort by; sort then uses the entire input line - and
there's no need to specify delimiter at all.
---
As such, I'm closing this as "not a bug", but discussion can continue by
replying to this thread.
regards,
- assaf
Added tag(s) notabug.
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Mon, 13 Jul 2020 06:59:02 GMT)
Full text and
rfc822 format available.
bug closed, send any further explanations to
42340 <at> debbugs.gnu.org and Beth Andres-Beck <bandresbeck <at> gmail.com>
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Mon, 13 Jul 2020 06:59:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#42340
; Package
coreutils
.
(Wed, 15 Jul 2020 23:12:02 GMT)
Full text and
rfc822 format available.
Message #15 received at 42340 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
If that is the intended behavior, the bug is that:
> printf '12,\n1,\n' | sort -t, -k1 -s
1,
12,
does _not_ take the remainder of the line into account, and only sorts on
the initial field, prioritizing length.
It is at the very least unexpected that adding an `a` to the end of both
lines would change the sort order of those lines:
> printf '12,a\n1,a\n' | sort -t, -k1 -s
12,a
1,a
On Sun, Jul 12, 2020 at 11:58 PM Assaf Gordon <assafgordon <at> gmail.com> wrote:
> tags 42340 notabug
> close 42340
> stop
>
> Hello,
>
> On 2020-07-12 5:57 p.m., Beth Andres-Beck wrote:
> > In trying to use `join` with `sort` I discovered odd behavior: even after
> > running a file through `sort` using the same delimiter, `join` would
> still
> > complain that it was out of order.
> [...]
> > Here is a way to reproduce the problem:
> >
> >> printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a.txt
> >> printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b.txt
> >> join -t, a.txt b.txt
> > join: b.txt:2: is not sorted: 1.1.1,b
> >
> > The expected behavior would be that if a file has been sorted by "sort"
> it
> > will also be considered sorted by join.
> [...]
> > I traced this back to what I believe to be a bug in sort.c
>
> This is not a bug in sort or join, just a side-effect of the locale on
> your system on the sorting results.
>
> By forcing a C locale with "LC_ALL=C" (meaning simple ASCII order),
> the files are ordered in the same way 'join' expected them to be:
>
> $ printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | LC_ALL=C sort -t, > a.txt
> $ printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | LC_ALL=C sort -t, > b.txt
> $ join -t, a.txt b.txt
> 1.1.1,2,b
> 1.1.12,2,a
>
> ---
>
> More details:
> I'm going to assume your system uses some locale based on UTF-8.
> You can check it by running 'locale', e.g. on my system:
> $ locale
> LANG=en_CA.utf8
> LANGUAGE=en_CA:en
> LC_CTYPE="en_CA.utf8"
> ..
> ..
>
> Under most UTF-8 locales, punctuation characters are *ignored* in the
> compared input lines. This might be confusing and non-intuitive, but
> that's the way most systems have been working for many years (locale
> ordering is defined in the GNU C Library, and coreutils has no way to
> change it).
>
> Observe the following:
>
> $ printf '12,a\n1,b\n' | LC_ALL=en_CA.utf8 sort
> 12,a
> 1,b
>
> $ printf '12,a\n1,b\n' | LC_ALL=C sort
> 1,b
> 12,a
>
> With a UTF-8 locale, the comma character is ignored, and then "12a"
> appears before "1b" (since the character '2' comes before the character
> 'b').
>
> With "C" locale, forcing ASCII or "byte comparison", punctuation
> characters are not ignored, and "1,b" appears before "12,a" (because
> the comma ',' ASCII value is 44 , which is smaller then the ASCII value
> digit '2').
>
> ---
>
> Somewhat related:
> Your sort command defines the delimiter ("-t,") but does not define
> which columns to sort by; sort then uses the entire input line - and
> there's no need to specify delimiter at all.
>
> ---
>
> As such, I'm closing this as "not a bug", but discussion can continue by
> replying to this thread.
>
> regards,
> - assaf
>
>
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#42340
; Package
coreutils
.
(Thu, 16 Jul 2020 00:39:01 GMT)
Full text and
rfc822 format available.
Message #18 received at 42340 <at> debbugs.gnu.org (full text, mbox):
Hello,
On 2020-07-15 2:12 p.m., Beth Andres-Beck wrote:
> If that is the intended behavior, the bug is that:
>> printf '12,\n1,\n' | sort -t, -k1 -s
> 1,
> 12,
>
> does _not_ take the remainder of the line into account, and only sorts on
> the initial field, prioritizing length.
>
> It is at the very least unexpected that adding an `a` to the end of both
> lines would change the sort order of those lines:
>> printf '12,a\n1,a\n' | sort -t, -k1 -s
> 12,a
> 1,a
>
Not a bug, just an incomplete usage :)
sort's -k/--key parameter takes two values (the second being optional):
the first and last column to use as the key. If the second value is
omitted (as in your case), then the key is taken from the first field
to the end of the line.
And so:
"sort -k1,1" means take the first *and only the first* field as the key.
"sort -k1" means take the first field until the end of the line as the key.
"sort -k1,3" means take the first,second and third fields as the single key.
"sort -k1,1 -k2,2 -k3,3" means take the first field as the first key,
second field as the second key, and third field as the third key.
---
The "--debug" option can help illustrate what sort is doing,
by adding underscore characters to show which characters are being used
as keys in each line. Consider the following:
$ printf '12,\n1,\n' | sort -t, -k1 -s --debug
sort: using ‘en_CA.utf8’ sorting rules
1,
__
12,
___
$ printf '12,\n1,\n' | sort -t, -k1,1 -s --debug
sort: using ‘en_CA.utf8’ sorting rules
1,
_
12,
__
In the first example, the "-k1" means from first field till end of line,
the underscore includes the "," characters.
In the second example, the "-k1,1" means only the first field, and the
comma is not used.
Now consider your second case of adding an "a" at the end of each line:
$ printf '12,a\n1,a\n' | sort -t, -k1 -s --debug
sort: using ‘en_CA.utf8’ sorting rules
12,a
____
1,a
___
$ printf '12,a\n1,a\n' | sort -t, -k1,1 -s --debug
sort: using ‘en_CA.utf8’ sorting rules
1,a
_
12,a
__
In the first example, "-k1" means: from first field until the end of the
line, and so the entire string "12,a" is compared against "1,a".
**AND**, because the locale is a "utf-8" locale, punctuation characters
are ignored (as mentioned in the previous email in this thread).
So effectively the compared strings are "12a" vs "1a".
The ASCII value of "2" is smaller than the ASCII value of "a", and
therefore "12a" appears before "1a".
If we force C locale, then the order is reversed:
$ printf '12,a\n1,a\n' | LC_ALL=C sort -t, -k1 -s --debug
sort: using simple byte comparison
1,a
___
12,a
____
Because now punctuation characters are used, and the ASCII value of ","
is smaller than the ASCII value of "2".
**HOWEVER**, this result of using "LC_ALL=C" together with "-k1" is
only correct by a happy accident :)
it is still very likely that "-k1" is not what you wanted - you
probably meant to do "-k1,1".
---
Lastly, the "-s/--stable" option in the above contrived examples is
superfluous - it doesn't affect the output order because there are no
equal field values (i.e. "1" vs "12").
A slightly better example to illustrate how "-s" affects ordering is this:
$ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1
1,a
2,b
2,x
$ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1 -s
1,a
2,x
2,b
Here, "1" comes before "2" - that's obvious. But should "2,b" come
before "2,x" ?
If we do not use "-s/--stable", then "sort" ALSO does one additional
comparison of the entire line as a last step (hence "sort --help" says
"[disable] last-resort comparison" about "-s/--stable").
The substring ",b" comes before ",x" - therefore "2,b" appears first.
If we add "-s/--stable", the last comparison step of the entire line is
skipped, and the lines of "2" appear in the order they were in the input
(hence - "stable").
By using "--debug" we can see the additional comparison step (indicated
by additional underscore lines);
$ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1 --debug
sort: using ‘en_CA.utf8’ sorting rules
1,a
_
___
2,b
_
___
2,x
_
___
$ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1 -s --debug
sort: using ‘en_CA.utf8’ sorting rules
1,a
_
2,x
_
2,b
_
---
Hope this helps.
regards,
- assaf
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 13 Aug 2020 11:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 5 years and 32 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.