GNU bug report logs - #42340
"join" reports that "sort"ed input is not sorted

Previous Next

Package: coreutils;

Reported by: Beth Andres-Beck <bandresbeck <at> gmail.com>

Date: Mon, 13 Jul 2020 00:36:01 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


Message #12 received at control <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Beth Andres-Beck <bandresbeck <at> gmail.com>, 42340 <at> debbugs.gnu.org
Subject: Re: bug#42340: "join" reports that "sort"ed input is not sorted
Date: Mon, 13 Jul 2020 00:58:32 -0600
tags 42340 notabug
close 42340
stop

Hello,

On 2020-07-12 5:57 p.m., Beth Andres-Beck wrote:
> In trying to use `join` with `sort` I discovered odd behavior: even after
> running a file through `sort` using the same delimiter, `join` would still
> complain that it was out of order.
[...]
> Here is a way to reproduce the problem:
> 
>> printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a.txt
>> printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b.txt
>> join -t, a.txt b.txt
>   join: b.txt:2: is not sorted: 1.1.1,b
> 
> The expected behavior would be that if a file has been sorted by "sort" it
> will also be considered sorted by join.
[...]
> I traced this back to what I believe to be a bug in sort.c 

This is not a bug in sort or join, just a side-effect of the locale on 
your system on the sorting results.

By forcing a C locale with "LC_ALL=C" (meaning simple ASCII order),
the files are ordered in the same way 'join' expected them to be:

 $ printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | LC_ALL=C sort -t, > a.txt
 $ printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | LC_ALL=C sort -t, > b.txt
 $ join -t, a.txt b.txt
 1.1.1,2,b
 1.1.12,2,a

---

More details:
I'm going to assume your system uses some locale based on UTF-8.
You can check it by running 'locale', e.g. on my system:
  $ locale
  LANG=en_CA.utf8
  LANGUAGE=en_CA:en
  LC_CTYPE="en_CA.utf8"
  ..
  ..

Under most UTF-8 locales, punctuation characters are *ignored* in the
compared input lines. This might be confusing and non-intuitive, but
that's the way most systems have been working for many years (locale
ordering is defined in the GNU C Library, and coreutils has no way to
change it).

Observe the following:

  $ printf '12,a\n1,b\n' | LC_ALL=en_CA.utf8 sort
  12,a
  1,b

  $ printf '12,a\n1,b\n' | LC_ALL=C sort
  1,b
  12,a

With a UTF-8 locale, the comma character is ignored, and then "12a" 
appears before "1b" (since the character '2' comes before the character
'b').

With "C" locale, forcing ASCII or "byte comparison", punctuation 
characters are not ignored, and "1,b" appears before "12,a" (because
the comma ',' ASCII value is 44	, which is smaller then the ASCII value 
digit '2').

---

Somewhat related:
Your sort command defines the delimiter ("-t,") but does not define 
which columns to sort by; sort then uses the entire input line - and 
there's no need to specify delimiter at all.

---

As such, I'm closing this as "not a bug", but discussion can continue by
replying to this thread.

regards,
 - assaf





This bug report was last modified 5 years and 32 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.