GNU bug report logs - #23665
spaces in keys: doc, --debug in LC_ALL=C

Previous Next

Package: coreutils;

Reported by: Karl Berry <karl <at> freefriends.org>

Date: Tue, 31 May 2016 18:33:02 UTC

Severity: normal

Tags: fixed

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 23665 in the body.
You can then email your comments to 23665 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#23665; Package coreutils. (Tue, 31 May 2016 18:33:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Karl Berry <karl <at> freefriends.org>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Tue, 31 May 2016 18:33:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Karl Berry <karl <at> freefriends.org>
To: bug-coreutils <at> gnu.org
Subject: spaces in keys: doc, --debug in LC_ALL=C
Date: Tue, 31 May 2016 18:32:27 GMT
Consider this three-line source file, say /tmp/foo:
M  Build/zfile
M  Master/mfile
MM Build/afile
There are two spaces after the M on the first two lines (and no trailing
spaces on any line).  I was trying to sort on the second "field".

I run
  LC_ALL=en_US.UTF-8 sort --debug -k 2 /tmp/foo  # or -k 2,2 et al.
And get the nicely explanatory output for the "surprising" result:
  sort: using ‘en_US.UTF-8’ sorting rules
  sort: leading blanks are significant in key 1; consider also specifying 'b'
  MM Build/afile
  ...

However, if I run that same command in the C locale:
  LC_ALL=C sort --debug -k 2 /tmp/foo  # or -k 2,2 et al.
the output lacks that crucial commentary line:
  sort: leading blanks are significant ...

But the information is just as valid in C as in UTF-8, so far as I can
see.  Thus it would be nice for it to be present.

It would also be nice if the definition of "key 1" was stated.
Awfully easy to misread that as "field 1".

More importantly, I urge that the documentation for sort give an example
of this.  The idea that following blanks after the first become part of
the next field is highly counter-intuitive.  The information is
implicitly there in "non-blank to blank transition", but it is a common
confounding of expectations and deserves explicit mention, IMHO.  (If
it's there, sorry, I didn't see it.)

This is with coreutils 8.25 (from original source).

Thanks,
Karl




Information forwarded to bug-coreutils <at> gnu.org:
bug#23665; Package coreutils. (Tue, 31 May 2016 19:12:01 GMT) Full text and rfc822 format available.

Message #8 received at 23665 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Karl Berry <karl <at> freefriends.org>, 23665 <at> debbugs.gnu.org
Subject: Re: bug#23665: spaces in keys: doc, --debug in LC_ALL=C
Date: Tue, 31 May 2016 15:11:10 -0400
Hello Karl!

On 05/31/2016 02:32 PM, Karl Berry wrote:
> I run
>    LC_ALL=en_US.UTF-8 sort --debug -k 2 /tmp/foo  # or -k 2,2 et al.
> And get the nicely explanatory output for the "surprising" result:
[...]

Just to verify, the surprising result is in C locale?

I'm seeing the following, for "en_US.UTF-8" it's the order I'd expect, but the "C" is surprising:

    $ cat -A k.txt
    M  Build/zfile$
    M  Master/mfile$
    MM Build/afile$

    $ LC_ALL=en_US.UTF-8 sort -k2 k.txt
    MM Build/afile
    M  Build/zfile
    M  Master/mfile

    $ LC_ALL=C sort -k2 k.txt
    M  Build/zfile
    M  Master/mfile
    MM Build/afile

 
> But the information is just as valid in C as in UTF-8, so far as I can
> see.  Thus it would be nice for it to be present.

If I understand correctly, one could argue the warning is even more important in C locale than in UTF-8 locales,
as collating rules for UTF-8 make leading spaces less significant.

As in:

    $ cat -A s.txt
    M A$
    M  B$
    M   D$
    M  C$

UTF-8 makes leading spaces less important:

    $ LC_ALL=en_US.UTF-8 sort -k2 s.txt
    M A
    M  B
    M  C
    M   D

in C locale, spaces (as simple bytes) do matter:

    $ LC_ALL=C sort -k2 s.txt
    M   D
    M  B
    M  C
    M A

-b skips leading spaces:

    $ LC_ALL=C sort -k2b s.txt
    M A
    M  B
    M  C
    M   D


> More importantly, I urge that the documentation for sort give an example
> of this.  The idea that following blanks after the first become part of
> the next field is highly counter-intuitive.

I agree,
I can add the above example to the documentation (also possibly to the FAQ or Gotcha pages?).
What do you think?

The condition to print this message is here:
 http://lingrok.org/xref/coreutils/src/sort.c#2435
I can try to suggest a patch to print it in C locale as well (hopefully tonight).


> It would also be nice if the definition of "key 1" was stated.
> Awfully easy to misread that as "field 1".

How about "leading blanks are significant in sort key [...]" ?
(in http://lingrok.org/xref/coreutils/src/sort.c#2439 )


regards,
 - assaf









Information forwarded to bug-coreutils <at> gnu.org:
bug#23665; Package coreutils. (Tue, 31 May 2016 22:47:02 GMT) Full text and rfc822 format available.

Message #11 received at 23665 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Assaf Gordon <assafgordon <at> gmail.com>, Karl Berry <karl <at> freefriends.org>,
 23665 <at> debbugs.gnu.org
Subject: Re: bug#23665: spaces in keys: doc, --debug in LC_ALL=C
Date: Tue, 31 May 2016 23:46:47 +0100
On 31/05/16 20:11, Assaf Gordon wrote:
> Hello Karl!
>
> On 05/31/2016 02:32 PM, Karl Berry wrote:
>> I run
>>     LC_ALL=en_US.UTF-8 sort --debug -k 2 /tmp/foo  # or -k 2,2 et al.
>> And get the nicely explanatory output for the "surprising" result:
> [...]
>
> Just to verify, the surprising result is in C locale?
>
> I'm seeing the following, for "en_US.UTF-8" it's the order I'd expect, but the "C" is surprising:
>
>       $ cat -A k.txt
>       M  Build/zfile$
>       M  Master/mfile$
>       MM Build/afile$
>
>       $ LC_ALL=en_US.UTF-8 sort -k2 k.txt
>       MM Build/afile
>       M  Build/zfile
>       M  Master/mfile
>
>       $ LC_ALL=C sort -k2 k.txt
>       M  Build/zfile
>       M  Master/mfile
>       MM Build/afile
>
>
>> But the information is just as valid in C as in UTF-8, so far as I can
>> see.  Thus it would be nice for it to be present.
>
> If I understand correctly, one could argue the warning is even more important in C locale than in UTF-8 locales,
> as collating rules for UTF-8 make leading spaces less significant.
>
> As in:
>
>       $ cat -A s.txt
>       M A$
>       M  B$
>       M   D$
>       M  C$
>
> UTF-8 makes leading spaces less important:
>
>       $ LC_ALL=en_US.UTF-8 sort -k2 s.txt
>       M A
>       M  B
>       M  C
>       M   D
>
> in C locale, spaces (as simple bytes) do matter:
>
>       $ LC_ALL=C sort -k2 s.txt
>       M   D
>       M  B
>       M  C
>       M A
>
> -b skips leading spaces:
>
>       $ LC_ALL=C sort -k2b s.txt
>       M A
>       M  B
>       M  C
>       M   D
>
>
>> More importantly, I urge that the documentation for sort give an example
>> of this.  The idea that following blanks after the first become part of
>> the next field is highly counter-intuitive.
>
> I agree,
> I can add the above example to the documentation (also possibly to the FAQ or Gotcha pages?).
> What do you think?
>
> The condition to print this message is here:
>    http://lingrok.org/xref/coreutils/src/sort.c#2435
> I can try to suggest a patch to print it in C locale as well (hopefully tonight).

The warning was suppressed in this case as one might be using
such a command to sort right aligned indexes:
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=v8.5-40-g63761c0
Now I was probably over thinking that a bit,
so I'd be happy for the removal of the maybe_space_aligned from the condition.

cheers,
Pádraig.





Information forwarded to bug-coreutils <at> gnu.org:
bug#23665; Package coreutils. (Tue, 31 May 2016 23:16:02 GMT) Full text and rfc822 format available.

Message #14 received at 23665 <at> debbugs.gnu.org (full text, mbox):

From: Karl Berry <karl <at> freefriends.org>
To: assafgordon <at> gmail.com
Cc: 23665 <at> debbugs.gnu.org
Subject: Re: bug#23665: spaces in keys: doc, --debug in LC_ALL=C
Date: Tue, 31 May 2016 23:15:21 GMT
    Just to verify, the surprising result is in C locale?

Yes.

    as collating rules for UTF-8 make leading spaces less significant.

Yes, which is a different problem, in itself.  Let me ask this: Are the
collation rules for en_US.UTF-8 documented or even reasonably
comprehensively described anywhere?  Or just buried in the bowels of
libc code?  I looked online and found nothing usable (surprisingly).

    I can add the above example to the documentation 

Yes please.  The C example.  It can probably be cut down to two lines
and "foo" vs. "bar" or whatever.

    (also possibly to the FAQ or Gotcha pages?).  What do you think?

You (or Paul, Jim, Bob, ...) would know better than me what deserves to
be on those pages (wherever they are).  My feeling would be yes.  Along
with mentioning --debug to, well, debug such things.

    How about "leading blanks are significant in sort key [...]" ?

I'm not sure what you mean by [...].  The %lu?
Are you proposing to just add the word "sort"?  That's not needed IMHO.

What I was thinking was something like this:
  sort: leading blanks are significant in key 1 [-k 2]; consider also specifying 'b'

Since this is debugging output, the more information the better, in
theory, seems to me.  Maybe it is not feasible.  No biggie.

    (in http://lingrok.org/xref/coreutils/src/sort.c#2439 )

I don't see the change.  Sorry.  --thanks, karl.




Information forwarded to bug-coreutils <at> gnu.org:
bug#23665; Package coreutils. (Tue, 31 May 2016 23:37:01 GMT) Full text and rfc822 format available.

Message #17 received at 23665 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Karl Berry <karl <at> freefriends.org>, assafgordon <at> gmail.com
Cc: 23665 <at> debbugs.gnu.org
Subject: Re: bug#23665: spaces in keys: doc, --debug in LC_ALL=C
Date: Tue, 31 May 2016 16:36:13 -0700
On 05/31/2016 04:15 PM, Karl Berry wrote:
> Are the
> collation rules for en_US.UTF-8 documented or even reasonably
> comprehensively described anywhere?

Although I think they are taken from ISO/IEC 14651, I expect they've 
diverged from the standard by now, as a new version of the standard came 
out this year. I don't know of any documentation other than the glibc 
source code itself.





Information forwarded to bug-coreutils <at> gnu.org:
bug#23665; Package coreutils. (Wed, 01 Jun 2016 00:39:01 GMT) Full text and rfc822 format available.

Message #20 received at 23665 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Karl Berry <karl <at> freefriends.org>
Cc: 23665 <at> debbugs.gnu.org
Subject: Re: bug#23665: spaces in keys: doc, --debug in LC_ALL=C
Date: Tue, 31 May 2016 20:38:11 -0400
[Message part 1 (text/plain, inline)]
Hello Karl and all,

> On May 31, 2016, at 19:15, Karl Berry <karl <at> freefriends.org> wrote:
[...]
> I'm not sure what you mean by [...].  The %lu?
> Are you proposing to just add the word "sort"?  That's not needed IMHO.

I was suggesting exactly that :)
Also, the word "key" appears in few other messages, the attached patch adds "sort" to them all. This is of course just a suggestion, and we can use another syntax like the one you listed.

Attached are 3 patches, not finalized but good as a starting point for comments.

1. rephrase "key" with "sort key". Perhaps this is superfluous - thoughts ?

2. add a bit more verbose progress information to the 'sort-debug-warn.sh' test - just so it'll be easier to discuss to the changed messages.

3. removes the 'maybe_space_aligned' and modifies the condition a bit.

Expanding on the third patch:
In the following two tests, the "leading space" warning is now printed (the numbers refer to the numbers added in patch 2):
   #7 
   sort -gbr -k1,1n -k1,1r --debug /dev/null
   #11
   sort -k1,1r --debug /dev/null

I would say this is correct, as spaces do matter for LC_ALL=C with "-r" sorting:

    $ cat -A 1.txt 
    x  A$
    x   B$
    x C$

    $ LC_ALL=C ./src/sort -k2,2r 1.txt 
    x C
    x  A
    x   B

    $ LC_ALL=C ./src/sort -k2b,2r 1.txt 
    x C
    x   B
    x  A

The "leading space" warning is removed from the last test, because for keys that are "zero widths" and are ignored there's no point in printing the warning.


Comments welcomed,
  - assaf


[0001-sort-clearify-key-meaning-in-debug-warnings.patch (application/octet-stream, attachment)]
[Message part 3 (text/plain, inline)]

[0002-tests-sort-debug-warn-add-progress-information-lines.patch (application/octet-stream, attachment)]
[Message part 5 (text/plain, inline)]

[0003-sort-modify-leading-spaces-debug-warning-scenarios.patch (application/octet-stream, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#23665; Package coreutils. (Wed, 01 Jun 2016 00:55:02 GMT) Full text and rfc822 format available.

Message #23 received at 23665 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Assaf Gordon <assafgordon <at> gmail.com>, Karl Berry <karl <at> freefriends.org>
Cc: 23665 <at> debbugs.gnu.org
Subject: Re: bug#23665: spaces in keys: doc, --debug in LC_ALL=C
Date: Wed, 1 Jun 2016 01:54:40 +0100
On 01/06/16 01:38, Assaf Gordon wrote:
> Hello Karl and all,
>
>> On May 31, 2016, at 19:15, Karl Berry <karl <at> freefriends.org> wrote:
> [...]
>> I'm not sure what you mean by [...].  The %lu?
>> Are you proposing to just add the word "sort"?  That's not needed IMHO.
>
> I was suggesting exactly that :)
> Also, the word "key" appears in few other messages, the attached patch adds "sort" to them all. This is of course just a suggestion, and we can use another syntax like the one you listed.
>
> Attached are 3 patches, not finalized but good as a starting point for comments.
>
> 1. rephrase "key" with "sort key". Perhaps this is superfluous - thoughts ?
>
> 2. add a bit more verbose progress information to the 'sort-debug-warn.sh' test - just so it'll be easier to discuss to the changed messages.
>
> 3. removes the 'maybe_space_aligned' and modifies the condition a bit.

I'm 50:50 on 1.
2 and 3 are good to push.

thanks!





Information forwarded to bug-coreutils <at> gnu.org:
bug#23665; Package coreutils. (Wed, 01 Jun 2016 01:53:01 GMT) Full text and rfc822 format available.

Message #26 received at 23665 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Assaf Gordon <assafgordon <at> gmail.com>, Karl Berry <karl <at> freefriends.org>
Cc: 23665 <at> debbugs.gnu.org
Subject: Re: bug#23665: spaces in keys: doc, --debug in LC_ALL=C
Date: Tue, 31 May 2016 18:52:21 -0700
Assaf Gordon wrote:
> 1. rephrase "key" with "sort key". Perhaps this is superfluous - thoughts ?

I would leave it alone. This is the 'sort' program, after all, so it's hard to 
misinterpret "key". Plus, "obsolescent sort key" might be misinterpreted as 
meaning a key for an obsolescent sort, not a sort key that is obsolescent.




Information forwarded to bug-coreutils <at> gnu.org:
bug#23665; Package coreutils. (Wed, 01 Jun 2016 02:16:01 GMT) Full text and rfc822 format available.

Message #29 received at 23665 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: 23665 <at> debbugs.gnu.org, Karl Berry <karl <at> freefriends.org>
Subject: Re: bug#23665: spaces in keys: doc, --debug in LC_ALL=C
Date: Tue, 31 May 2016 22:15:22 -0400
> On May 31, 2016, at 20:54, Pádraig Brady <P <at> draigBrady.com> wrote:
> 
> On 01/06/16 01:38, Assaf Gordon wrote:
>> 
>> 2. add a bit more verbose progress information to the 'sort-debug-warn.sh' test - just so it'll be easier to discuss to the changed messages.
>> 
>> 3. removes the 'maybe_space_aligned' and modifies the condition a bit.
> 
> 2 and 3 are good to push.

Thank you, pushed in:
 http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=d548f87595a193e21b170368bc8fc2ded4dadb73
 http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=6223bf94bfeac75fb4252095864a80545ba00a0d

====

Regarding documentation: how about the following?

'sort' without '-t' separates fields by whitespace (tab and space characters) and considers the whitespace characters to be part of the field's text. Use 'b' sorting option to skip leading spaces.

Example:

   $ cat  s.txt
   M A
   M  C
   M   D
   M  B

Without 'b' leading spaces affect sorting order of the second field:

   $ LC_ALL=C sort -k2 s.txt
   M   D
   M  B
   M  C
   M A

With 'b', leading spaces are skipped:

   $ LC_ALL=C sort -k2b s.txt
   M A
   M  B
   M  C
   M   D

For troubleshooting use 'sort --debug':

   $ LC_ALL=C ./src/sort --debug -k2 s.txt 
   sort: using simple byte comparison
   sort: leading blanks are significant in key 1; consider also specifying 'b'
   M   D
    ____
   _____
   M  B
    ___
   ____
   M  C
    ___
   ____
   M A
    __
   ___

=========

Should such an example go in the documentation, or in the new 'gotcha' page ?

I can shorten the example (e.g. with only two letter, such as 'printf "A\n B\n"'), but perhaps a slightly longer more verbose example would help understand the issue in a glance.
The fixed first field "M" is there to make it visually clear where the spaces are.

comments welcomed,
 - assaf









Information forwarded to bug-coreutils <at> gnu.org:
bug#23665; Package coreutils. (Wed, 01 Jun 2016 11:06:01 GMT) Full text and rfc822 format available.

Message #32 received at 23665 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 23665 <at> debbugs.gnu.org, Karl Berry <karl <at> freefriends.org>
Subject: Re: bug#23665: spaces in keys: doc, --debug in LC_ALL=C
Date: Wed, 1 Jun 2016 12:05:07 +0100
[Message part 1 (text/plain, inline)]
On 01/06/16 03:15, Assaf Gordon wrote:
>
>> On May 31, 2016, at 20:54, Pádraig Brady <P <at> draigBrady.com> wrote:
>>
>> On 01/06/16 01:38, Assaf Gordon wrote:
>>>
>>> 2. add a bit more verbose progress information to the 'sort-debug-warn.sh' test - just so it'll be easier to discuss to the changed messages.
>>>
>>> 3. removes the 'maybe_space_aligned' and modifies the condition a bit.
>>
>> 2 and 3 are good to push.
>
> Thank you, pushed in:
>   http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=d548f87595a193e21b170368bc8fc2ded4dadb73
>   http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=6223bf94bfeac75fb4252095864a80545ba00a0d
>
> ====
>
> Regarding documentation: how about the following?
>
> 'sort' without '-t' separates fields by whitespace (tab and space characters) and considers the whitespace characters to be part of the field's text. Use 'b' sorting option to skip leading spaces.
>

I've added essentially that summary to the --key description in the attached.

> Example:

I think these examples are a bit verbose for the info docs
for the amount of info they convey. There are so many combinations
of field handling options that it's best to give the rules
and defer to the --debug option for what's actually happening.

What I have done is to expand this discussion on field handling in:
http://www.pixelbeat.org/docs/coreutils-gotchas.html#sort

and drilling down from there have given an example of a comparison
where leading blanks are significant (and useful) at the bottom of:
http://www.pixelbeat.org/patches/coreutils/sort-debug/

cheers,
Pádraig.
[sort-key-docs.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#23665; Package coreutils. (Sun, 28 Oct 2018 06:03:02 GMT) Full text and rfc822 format available.

Message #35 received at 23665 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: 23665 <at> debbugs.gnu.org
Subject: Re: bug#23665: spaces in keys: doc, --debug in LC_ALL=C
Date: Sun, 28 Oct 2018 00:02:28 -0600
tags 23665 fixed
close 23665
stop

(triaging old bugs)

On 2016-05-31 8:15 p.m., Assaf Gordon wrote:
> 
>> On May 31, 2016, at 20:54, Pádraig Brady <P <at> draigBrady.com> wrote:
>>
>> On 01/06/16 01:38, Assaf Gordon wrote:
>>>
>>> 2. add a bit more verbose progress information to the 'sort-debug-warn.sh' test - just so it'll be easier to discuss to the changed messages.
>>>
>>> 3. removes the 'maybe_space_aligned' and modifies the condition a bit.
>>
>> 2 and 3 are good to push.
> 
> Thank you, pushed in:
>   http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=d548f87595a193e21b170368bc8fc2ded4dadb73
>   http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=6223bf94bfeac75fb4252095864a80545ba00a0d
> 

With no further follow-ups in 2 years, I'm closing this as fixed.

-assaf




Added tag(s) fixed. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Sun, 28 Oct 2018 06:03:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 23665 <at> debbugs.gnu.org and Karl Berry <karl <at> freefriends.org> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Sun, 28 Oct 2018 06:03:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 25 Nov 2018 12:24:10 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 245 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.