GNU bug report logs -
#10985
sort -k behavior possible problem: field span across the boundaries
Previous Next
Full log
Message #7 received at control <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
tag 10985 notabug
thanks
On 03/09/2012 12:46 PM, Oleg Moskalenko wrote:
> Hi
>
> While testing different GNU coreutils sort versions on different platforms (Linux and FreeBSD) I found that some behavior is probably not what a utility user expects.
Thanks for the report. However, you probably found behavior that is
required by POSIX.
>
> Let's, say, we have to sort (numerically stable) just two lines:
>
> $ sort -t "|" -ns -k2.3,2.7 <<!
> 1|234
> 1|2|34
> !
Let's use 'sort --debug' to see what really happened:
$ LC_ALL=C sort --debug -t\| -ns -k2.3,2.7 <<a
> 1|234
> 1|2|34
> a
sort: using simple byte comparison
1|234
_
1|2|34
__
So this sorted by locating the start of the second field ("234" of one
line, and "2|34" of the other line), then starting at the 3rd byte past
that location (even if it is in the next field).
This behavior is required by POSIX:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html
>
> The correct output (from my point of view) must be:
>
> 1|2|34
> 1|234
Sorry, but that interpretation does not match POSIX.
>
> My reasoning is that applying the key specs "-k2.3,2.7" to string "1|234" we obtain the key "4", and applying the same key to the string "1|2|34" we must obtain "" (empty string),
That's where you are wrong. POSIX states:
>> The notation:
>>
>> -k field_start[type][,field_end[type]]
>>
>> shall define a key field that begins at field_start and ends at field_end inclusive, unless field_start falls beyond the end of the line or after field_end, in which case the key field is empty. A missing field_end shall mean the last character of the line.
>>
>> A field comprises a maximal sequence of non-separating characters and, in the absence of option -t, any preceding field separator.
>>
>> The field_start portion of the keydef option-argument shall have the form:
>>
>> field_number[.first_character]
>>
>> Fields and characters within fields shall be numbered starting with 1. The field_number and first_character pieces, interpreted as positive decimal integers, shall specify the first character to be used as part of a sort key. If .first_character is omitted, it shall refer to the first character of the field.
That is, the field_start 2.3 means to start at the third character past
the second field, regardless if any intermediate field separators are
located, and that _only_ the end of a line (and not another field
separator) can result in an empty key field.
>
> I do not know whether this is an intended behavior or a bug,
Intended and mandated by the standards.
> but this is definitely non-intuitive and not what a reasonable user would expect.
Perhaps so, but if you want it changed, you need to file a bug report
against POSIX. As such, I'm going to close out this coreutils bug.
--
Eric Blake eblake <at> redhat.com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
This bug report was last modified 13 years and 71 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.