GNU bug report logs -
#10985
sort -k behavior possible problem: field span across the boundaries
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 10985 in the body.
You can then email your comments to 10985 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#10985
; Package
coreutils
.
(Fri, 09 Mar 2012 19:58:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Oleg Moskalenko <oleg.moskalenko <at> citrix.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Fri, 09 Mar 2012 19:58:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi
While testing different GNU coreutils sort versions on different platforms (Linux and FreeBSD) I found that some behavior is probably not what a utility user expects.
Let's, say, we have to sort (numerically stable) just two lines:
$ sort -t "|" -ns -k2.3,2.7 <<!
1|234
1|2|34
!
The GNU sort output is:
1|234
1|2|34
The correct output (from my point of view) must be:
1|2|34
1|234
My reasoning is that applying the key specs "-k2.3,2.7" to string "1|234" we obtain the key "4", and applying the same key to the string "1|2|34" we must obtain "" (empty string), because the second field is just "2" and symbols from 3rd to 7th position give us an empty string. And the empty string is smaller than a number, numerically, according to the "info sort".
On the other hand, the GNU sort (I suppose) just takes an offset from the field start, without taking into account the real field length. It yields the key "34", and this is larger, numerically, than "4".
I do not know whether this is an intended behavior or a bug, but this is definitely non-intuitive and not what a reasonable user would expect.
Thanks a lot !
Oleg Moskalenko
[Message part 2 (text/html, inline)]
Added tag(s) notabug.
Request was from
Eric Blake <eblake <at> redhat.com>
to
control <at> debbugs.gnu.org
.
(Fri, 09 Mar 2012 20:22:01 GMT)
Full text and
rfc822 format available.
Reply sent
to
Eric Blake <eblake <at> redhat.com>
:
You have taken responsibility.
(Fri, 09 Mar 2012 20:22:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Oleg Moskalenko <oleg.moskalenko <at> citrix.com>
:
bug acknowledged by developer.
(Fri, 09 Mar 2012 20:22:02 GMT)
Full text and
rfc822 format available.
Message #12 received at 10985-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
tag 10985 notabug
thanks
On 03/09/2012 12:46 PM, Oleg Moskalenko wrote:
> Hi
>
> While testing different GNU coreutils sort versions on different platforms (Linux and FreeBSD) I found that some behavior is probably not what a utility user expects.
Thanks for the report. However, you probably found behavior that is
required by POSIX.
>
> Let's, say, we have to sort (numerically stable) just two lines:
>
> $ sort -t "|" -ns -k2.3,2.7 <<!
> 1|234
> 1|2|34
> !
Let's use 'sort --debug' to see what really happened:
$ LC_ALL=C sort --debug -t\| -ns -k2.3,2.7 <<a
> 1|234
> 1|2|34
> a
sort: using simple byte comparison
1|234
_
1|2|34
__
So this sorted by locating the start of the second field ("234" of one
line, and "2|34" of the other line), then starting at the 3rd byte past
that location (even if it is in the next field).
This behavior is required by POSIX:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html
>
> The correct output (from my point of view) must be:
>
> 1|2|34
> 1|234
Sorry, but that interpretation does not match POSIX.
>
> My reasoning is that applying the key specs "-k2.3,2.7" to string "1|234" we obtain the key "4", and applying the same key to the string "1|2|34" we must obtain "" (empty string),
That's where you are wrong. POSIX states:
>> The notation:
>>
>> -k field_start[type][,field_end[type]]
>>
>> shall define a key field that begins at field_start and ends at field_end inclusive, unless field_start falls beyond the end of the line or after field_end, in which case the key field is empty. A missing field_end shall mean the last character of the line.
>>
>> A field comprises a maximal sequence of non-separating characters and, in the absence of option -t, any preceding field separator.
>>
>> The field_start portion of the keydef option-argument shall have the form:
>>
>> field_number[.first_character]
>>
>> Fields and characters within fields shall be numbered starting with 1. The field_number and first_character pieces, interpreted as positive decimal integers, shall specify the first character to be used as part of a sort key. If .first_character is omitted, it shall refer to the first character of the field.
That is, the field_start 2.3 means to start at the third character past
the second field, regardless if any intermediate field separators are
located, and that _only_ the end of a line (and not another field
separator) can result in an empty key field.
>
> I do not know whether this is an intended behavior or a bug,
Intended and mandated by the standards.
> but this is definitely non-intuitive and not what a reasonable user would expect.
Perhaps so, but if you want it changed, you need to file a bug report
against POSIX. As such, I'm going to close out this coreutils bug.
--
Eric Blake eblake <at> redhat.com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Message #13 received at 10985-done <at> debbugs.gnu.org (full text, mbox):
Hi Blake
Thank you for the reply and explanations !
Best regards,
Oleg
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 07 Apr 2012 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 13 years and 70 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.