GNU bug report logs - #10985
sort -k behavior possible problem: field span across the boundaries

Previous Next

Package: coreutils;

Reported by: Oleg Moskalenko <oleg.moskalenko <at> citrix.com>

Date: Fri, 9 Mar 2012 19:58:01 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Eric Blake <eblake <at> redhat.com>
Cc: tracker <at> debbugs.gnu.org
Subject: bug#10985: closed (sort -k behavior possible problem: field span
 across the boundaries)
Date: Fri, 09 Mar 2012 20:22:02 +0000
[Message part 1 (text/plain, inline)]
Your message dated Fri, 09 Mar 2012 13:20:48 -0700
with message-id <4F5A6620.8050008 <at> redhat.com>
and subject line Re: bug#10985: sort -k behavior possible problem: field span across the boundaries
has caused the debbugs.gnu.org bug report #10985,
regarding sort -k behavior possible problem: field span across the boundaries
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)


-- 
10985: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=10985
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Oleg Moskalenko <oleg.moskalenko <at> citrix.com>
To: "'bug-coreutils <at> gnu.org'" <bug-coreutils <at> gnu.org>
Subject: sort -k behavior possible problem: field span across the boundaries
Date: Fri, 9 Mar 2012 11:46:45 -0800
[Message part 3 (text/plain, inline)]
Hi

While testing different GNU coreutils sort versions on different platforms (Linux and FreeBSD) I found that some behavior is probably not what a utility user expects.

Let's, say, we have to sort (numerically stable) just two lines:

$ sort -t "|" -ns -k2.3,2.7 <<!
1|234
1|2|34
!

The GNU sort output is:

1|234
1|2|34


The correct output (from my point of view) must be:

1|2|34
1|234

My reasoning is that applying the key specs "-k2.3,2.7" to string "1|234" we obtain the key "4", and applying the same key to the string "1|2|34" we must obtain "" (empty string), because the second field is just "2" and symbols from 3rd to 7th position give us an empty string. And the empty string is smaller than a number, numerically, according to the "info sort".

On the other hand, the GNU sort (I suppose) just takes an offset from the field start, without taking into account the real field length. It yields the key "34", and this is larger, numerically, than "4".

I do not know whether this is an intended behavior or a bug, but this is definitely non-intuitive and not what a reasonable user would expect.

Thanks a lot !
Oleg Moskalenko

[Message part 4 (text/html, inline)]
[Message part 5 (message/rfc822, inline)]
From: Eric Blake <eblake <at> redhat.com>
To: Oleg Moskalenko <oleg.moskalenko <at> citrix.com>
Cc: 10985-done <at> debbugs.gnu.org
Subject: Re: bug#10985: sort -k behavior possible problem: field span across
	the boundaries
Date: Fri, 09 Mar 2012 13:20:48 -0700
[Message part 6 (text/plain, inline)]
tag 10985 notabug
thanks

On 03/09/2012 12:46 PM, Oleg Moskalenko wrote:
> Hi
> 
> While testing different GNU coreutils sort versions on different platforms (Linux and FreeBSD) I found that some behavior is probably not what a utility user expects.

Thanks for the report.  However, you probably found behavior that is
required by POSIX.

> 
> Let's, say, we have to sort (numerically stable) just two lines:
> 
> $ sort -t "|" -ns -k2.3,2.7 <<!
> 1|234
> 1|2|34
> !

Let's use 'sort --debug' to see what really happened:

$ LC_ALL=C sort --debug -t\| -ns -k2.3,2.7 <<a
> 1|234
> 1|2|34
> a
sort: using simple byte comparison
1|234
    _
1|2|34
    __

So this sorted by locating the start of the second field ("234" of one
line, and "2|34" of the other line), then starting at the 3rd byte past
that location (even if it is in the next field).

This behavior is required by POSIX:

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html

> 
> The correct output (from my point of view) must be:
> 
> 1|2|34
> 1|234

Sorry, but that interpretation does not match POSIX.

> 
> My reasoning is that applying the key specs "-k2.3,2.7" to string "1|234" we obtain the key "4", and applying the same key to the string "1|2|34" we must obtain "" (empty string),

That's where you are wrong.  POSIX states:

>> The notation:
>> 
>> -k field_start[type][,field_end[type]]
>> 
>> shall define a key field that begins at field_start and ends at field_end inclusive, unless field_start falls beyond the end of the line or after field_end, in which case the key field is empty. A missing field_end shall mean the last character of the line.
>> 
>> A field comprises a maximal sequence of non-separating characters and, in the absence of option -t, any preceding field separator.
>> 
>> The field_start portion of the keydef option-argument shall have the form:
>> 
>> field_number[.first_character]
>> 
>> Fields and characters within fields shall be numbered starting with 1. The field_number and first_character pieces, interpreted as positive decimal integers, shall specify the first character to be used as part of a sort key. If .first_character is omitted, it shall refer to the first character of the field.

That is, the field_start 2.3 means to start at the third character past
the second field, regardless if any intermediate field separators are
located, and that _only_ the end of a line (and not another field
separator) can result in an empty key field.

> 
> I do not know whether this is an intended behavior or a bug,

Intended and mandated by the standards.

> but this is definitely non-intuitive and not what a reasonable user would expect.

Perhaps so, but if you want it changed, you need to file a bug report
against POSIX.  As such, I'm going to close out this coreutils bug.

-- 
Eric Blake   eblake <at> redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

This bug report was last modified 13 years and 70 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.