GNU bug report logs - #10985
sort -k behavior possible problem: field span across the boundaries

Previous Next

Package: coreutils;

Reported by: Oleg Moskalenko <oleg.moskalenko <at> citrix.com>

Date: Fri, 9 Mar 2012 19:58:01 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 10985 in the body.
You can then email your comments to 10985 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#10985; Package coreutils. (Fri, 09 Mar 2012 19:58:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Oleg Moskalenko <oleg.moskalenko <at> citrix.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Fri, 09 Mar 2012 19:58:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Oleg Moskalenko <oleg.moskalenko <at> citrix.com>
To: "'bug-coreutils <at> gnu.org'" <bug-coreutils <at> gnu.org>
Subject: sort -k behavior possible problem: field span across the boundaries
Date: Fri, 9 Mar 2012 11:46:45 -0800
[Message part 1 (text/plain, inline)]
Hi

While testing different GNU coreutils sort versions on different platforms (Linux and FreeBSD) I found that some behavior is probably not what a utility user expects.

Let's, say, we have to sort (numerically stable) just two lines:

$ sort -t "|" -ns -k2.3,2.7 <<!
1|234
1|2|34
!

The GNU sort output is:

1|234
1|2|34


The correct output (from my point of view) must be:

1|2|34
1|234

My reasoning is that applying the key specs "-k2.3,2.7" to string "1|234" we obtain the key "4", and applying the same key to the string "1|2|34" we must obtain "" (empty string), because the second field is just "2" and symbols from 3rd to 7th position give us an empty string. And the empty string is smaller than a number, numerically, according to the "info sort".

On the other hand, the GNU sort (I suppose) just takes an offset from the field start, without taking into account the real field length. It yields the key "34", and this is larger, numerically, than "4".

I do not know whether this is an intended behavior or a bug, but this is definitely non-intuitive and not what a reasonable user would expect.

Thanks a lot !
Oleg Moskalenko

[Message part 2 (text/html, inline)]

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Fri, 09 Mar 2012 20:22:01 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Fri, 09 Mar 2012 20:22:02 GMT) Full text and rfc822 format available.

Notification sent to Oleg Moskalenko <oleg.moskalenko <at> citrix.com>:
bug acknowledged by developer. (Fri, 09 Mar 2012 20:22:02 GMT) Full text and rfc822 format available.

Message #12 received at 10985-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Oleg Moskalenko <oleg.moskalenko <at> citrix.com>
Cc: 10985-done <at> debbugs.gnu.org
Subject: Re: bug#10985: sort -k behavior possible problem: field span across
	the boundaries
Date: Fri, 09 Mar 2012 13:20:48 -0700
[Message part 1 (text/plain, inline)]
tag 10985 notabug
thanks

On 03/09/2012 12:46 PM, Oleg Moskalenko wrote:
> Hi
> 
> While testing different GNU coreutils sort versions on different platforms (Linux and FreeBSD) I found that some behavior is probably not what a utility user expects.

Thanks for the report.  However, you probably found behavior that is
required by POSIX.

> 
> Let's, say, we have to sort (numerically stable) just two lines:
> 
> $ sort -t "|" -ns -k2.3,2.7 <<!
> 1|234
> 1|2|34
> !

Let's use 'sort --debug' to see what really happened:

$ LC_ALL=C sort --debug -t\| -ns -k2.3,2.7 <<a
> 1|234
> 1|2|34
> a
sort: using simple byte comparison
1|234
    _
1|2|34
    __

So this sorted by locating the start of the second field ("234" of one
line, and "2|34" of the other line), then starting at the 3rd byte past
that location (even if it is in the next field).

This behavior is required by POSIX:

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html

> 
> The correct output (from my point of view) must be:
> 
> 1|2|34
> 1|234

Sorry, but that interpretation does not match POSIX.

> 
> My reasoning is that applying the key specs "-k2.3,2.7" to string "1|234" we obtain the key "4", and applying the same key to the string "1|2|34" we must obtain "" (empty string),

That's where you are wrong.  POSIX states:

>> The notation:
>> 
>> -k field_start[type][,field_end[type]]
>> 
>> shall define a key field that begins at field_start and ends at field_end inclusive, unless field_start falls beyond the end of the line or after field_end, in which case the key field is empty. A missing field_end shall mean the last character of the line.
>> 
>> A field comprises a maximal sequence of non-separating characters and, in the absence of option -t, any preceding field separator.
>> 
>> The field_start portion of the keydef option-argument shall have the form:
>> 
>> field_number[.first_character]
>> 
>> Fields and characters within fields shall be numbered starting with 1. The field_number and first_character pieces, interpreted as positive decimal integers, shall specify the first character to be used as part of a sort key. If .first_character is omitted, it shall refer to the first character of the field.

That is, the field_start 2.3 means to start at the third character past
the second field, regardless if any intermediate field separators are
located, and that _only_ the end of a line (and not another field
separator) can result in an empty key field.

> 
> I do not know whether this is an intended behavior or a bug,

Intended and mandated by the standards.

> but this is definitely non-intuitive and not what a reasonable user would expect.

Perhaps so, but if you want it changed, you need to file a bug report
against POSIX.  As such, I'm going to close out this coreutils bug.

-- 
Eric Blake   eblake <at> redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Message #13 received at 10985-done <at> debbugs.gnu.org (full text, mbox):

From: Oleg Moskalenko <oleg.moskalenko <at> citrix.com>
To: 'Eric Blake' <eblake <at> redhat.com>
Cc: "10985-done <at> debbugs.gnu.org" <10985-done <at> debbugs.gnu.org>
Subject: RE: bug#10985: sort -k behavior possible problem: field span across
	the boundaries
Date: Fri, 9 Mar 2012 12:28:13 -0800
Hi Blake

Thank you for the reply and explanations !

Best regards,
Oleg





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 07 Apr 2012 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 13 years and 70 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.