GNU bug report logs - #23665
spaces in keys: doc, --debug in LC_ALL=C

Previous Next

Package: coreutils;

Reported by: Karl Berry <karl <at> freefriends.org>

Date: Tue, 31 May 2016 18:33:02 UTC

Severity: normal

Tags: fixed

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Karl Berry <karl <at> freefriends.org>, 23665 <at> debbugs.gnu.org
Subject: bug#23665: spaces in keys: doc, --debug in LC_ALL=C
Date: Tue, 31 May 2016 15:11:10 -0400
Hello Karl!

On 05/31/2016 02:32 PM, Karl Berry wrote:
> I run
>    LC_ALL=en_US.UTF-8 sort --debug -k 2 /tmp/foo  # or -k 2,2 et al.
> And get the nicely explanatory output for the "surprising" result:
[...]

Just to verify, the surprising result is in C locale?

I'm seeing the following, for "en_US.UTF-8" it's the order I'd expect, but the "C" is surprising:

    $ cat -A k.txt
    M  Build/zfile$
    M  Master/mfile$
    MM Build/afile$

    $ LC_ALL=en_US.UTF-8 sort -k2 k.txt
    MM Build/afile
    M  Build/zfile
    M  Master/mfile

    $ LC_ALL=C sort -k2 k.txt
    M  Build/zfile
    M  Master/mfile
    MM Build/afile

 
> But the information is just as valid in C as in UTF-8, so far as I can
> see.  Thus it would be nice for it to be present.

If I understand correctly, one could argue the warning is even more important in C locale than in UTF-8 locales,
as collating rules for UTF-8 make leading spaces less significant.

As in:

    $ cat -A s.txt
    M A$
    M  B$
    M   D$
    M  C$

UTF-8 makes leading spaces less important:

    $ LC_ALL=en_US.UTF-8 sort -k2 s.txt
    M A
    M  B
    M  C
    M   D

in C locale, spaces (as simple bytes) do matter:

    $ LC_ALL=C sort -k2 s.txt
    M   D
    M  B
    M  C
    M A

-b skips leading spaces:

    $ LC_ALL=C sort -k2b s.txt
    M A
    M  B
    M  C
    M   D


> More importantly, I urge that the documentation for sort give an example
> of this.  The idea that following blanks after the first become part of
> the next field is highly counter-intuitive.

I agree,
I can add the above example to the documentation (also possibly to the FAQ or Gotcha pages?).
What do you think?

The condition to print this message is here:
 http://lingrok.org/xref/coreutils/src/sort.c#2435
I can try to suggest a patch to print it in C locale as well (hopefully tonight).


> It would also be nice if the definition of "key 1" was stated.
> Awfully easy to misread that as "field 1".

How about "leading blanks are significant in sort key [...]" ?
(in http://lingrok.org/xref/coreutils/src/sort.c#2439 )


regards,
 - assaf









This bug report was last modified 6 years and 245 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.