GNU bug report logs -
#14988
sort enhancement request
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 14988 in the body.
You can then email your comments to 14988 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#14988
; Package
coreutils
.
(Tue, 30 Jul 2013 21:45:03 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Danny Nicholas <danny.nicholas <at> pinnacledatasystems.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Tue, 30 Jul 2013 21:45:04 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi guys,
I am presently using version 7.1 on a Solaris box. I downloaded 8.21 and really love the improvement in speed (almost 50% in some tests). I am looking to replace the commercial product NSORT and would like this feature in the source instead of a wrapper. If I have a file
XXXX300001XXXX
XXXX300002XXXX
XXXX300003XXXX
XXXX300003XXXX
XXXX300003XXXX
XXXX300003XXXX
XXXX300004XXXX
XXXX300005XXXX
XXXX300006XXXX
XXXX300007XXXX
NSORT keeps the 4 300003 records together in entry sequence. My present work-around is to use a Python script that reads in the whole file and creates a pseudo-key that is 30000X plus an 8 digit sequence number (I process millions of records). What I am thinking of is an -es (--entry-sequence) that would add a hidden -k to process on this internal sequence. If I figure out how to do this on my own, I will submit it to you.
Thanks,
Danny Nicholas
Applications Programmer
Pinnacle Data Systems L.L.C.
Office: (205) 307-6874
danny.nicholas <at> pinnacledatasystems.com
www.pinnacledatasystems.com<http://www.pinnacledatasystems.com/>
[Description: Description: Description: https://encrypted-tbn1.google.com/images?q=tbn:ANd9GcRglmT5RwJEUk-1ZNPo_FI8y_udB6BL29pkwTt-Qh442v-FI1gH] <http://www.linkedin.com/company/pinnacle-data-systems-llc> [Description: Description: Description: https://encrypted-tbn0.google.com/images?q=tbn:ANd9GcSfD26ooDfMWD_xWRaMfbMcaBmkIKcG2oRxlaj6tBGYguC_aD71lw] <https://twitter.com/#!/PinnacleDataSy1>
Follow us on LinkedIn and Twitter
CONFIDENTIALITY: This email (including any attachments) may contain confidential, proprietary and privileged information, and unauthorized disclosure or use is prohibited. If you received this email in error, please notify the sender and delete this email from your system.
[Message part 2 (text/html, inline)]
[image001.jpg (image/jpeg, inline)]
[image002.jpg (image/jpeg, inline)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#14988
; Package
coreutils
.
(Tue, 30 Jul 2013 22:35:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 14988 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
tag 14988 needinfo
thanks
On 07/30/2013 02:51 PM, Danny Nicholas wrote:
> Hi guys,
[can you convince your mailer to wrap long lines?]
> I am presently using version 7.1 on a Solaris box. I downloaded 8.21 and really love the improvement in speed (almost 50% in some tests). I am looking to replace the commercial product NSORT and would like this feature in the source instead of a wrapper. If I have a file
> XXXX300001XXXX
> XXXX300002XXXX
> XXXX300003XXXX
> XXXX300003XXXX
> XXXX300003XXXX
> XXXX300003XXXX
> XXXX300004XXXX
> XXXX300005XXXX
> XXXX300006XXXX
> XXXX300007XXXX
As written, your example is already sorted in the same order as written,
and with no other distinguishing features on the line, you haven't
proven that sort isn't outputting lines in the order you want. I also
can't tell if the XXXX represent the actual bytes you are sorting, or if
you meant them as placeholders for a sanitized version of your actual
data set. You'll need to give as an actual example of lines that are
sorted differently by nsort and GNU sort, and the command line options
you attempted for GNU sort, before we can tell you what to try next.
>
> NSORT keeps the 4 300003 records together in entry sequence. My present work-around is to use a Python script that reads in the whole file and creates a pseudo-key that is 30000X plus an 8 digit sequence number (I process millions of records). What I am thinking of is an -es (--entry-sequence) that would add a hidden -k to process on this internal sequence. If I figure out how to do this on my own, I will submit it to you.
Short options must be one letter long; writing your proposed 'sort -es'
would be the same as 'sort -e -s'. Also, we are reluctant to burn short
options; these days, it's better to add a long option only, until it
proves its popularity, so that we don't collide with any future
standardized short options.
It SOUNDS like you are merely asking for a stable sort option. Have you
tried the -s/--stable option? That effectively adds an invisible key of
last resort that says if two lines otherwise compare equal, sort them so
that the line occurring first in input also occurs first in output.
At any rate, I'm marking this bug as 'needinfo' so that we can get more
feedback on whether --stable already meets your needs, or at least so we
can get a test case that we can play with to see what you are really
asking for.
Also, have you played with 'sort --debug'? It shows you a lot more
details on EXACTLY what sort is looking at. For example, I am able to
do a numeric sort on JUST the 6 digits in between the XXXX fillers of
the example you listed:
$ printf 'XXXX300002XXXX\nXXXX300001XXXX\n' \
| LC_ALL=C sort --debug -k1.5,1.10n -s
sort: using simple byte comparison
XXXX300001XXXX
______
XXXX300002XXXX
______
>
> CONFIDENTIALITY: This email (including any attachments) may contain confidential,
Sorry, but this disclaimer is unenforceable on publicly archived lists.
It is considered poor netiquette to use your employers email if they
insist on adding this on your behalf, and you may be better off sending
the mail from a personal account.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Added tag(s) moreinfo.
Request was from
Eric Blake <eblake <at> redhat.com>
to
control <at> debbugs.gnu.org
.
(Tue, 30 Jul 2013 22:35:03 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#14988
; Package
coreutils
.
(Tue, 30 Jul 2013 22:44:05 GMT)
Full text and
rfc822 format available.
Message #13 received at 14988 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 07/30/2013 04:33 PM, Eric Blake wrote:
> It SOUNDS like you are merely asking for a stable sort option. Have you
> tried the -s/--stable option? That effectively adds an invisible key of
> last resort that says if two lines otherwise compare equal, sort them so
> that the line occurring first in input also occurs first in output.
An alternative view of how -s works (that more closely matches what you
will see in 'sort --debug' output) is that POSIX requires that 'sort'
behave as if a key of -k1 were always the last key present (if two lines
otherwise compare equal, sort them by a strcoll() comparison of the
entire line), where -s is the GNU extension that disables this POSIX
implicit full-line sort key, so that you are left with a stable sort.
Since 'adding an option to remove a key' sounds a little fishy on the
surface, you can see why I gave my first explanation instead. But at
the end of the day, all that matters is the result, so pick whichever
mental representation you find easier to understand :)
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Added tag(s) notabug.
Request was from
Eric Blake <eblake <at> redhat.com>
to
control <at> debbugs.gnu.org
.
(Wed, 31 Jul 2013 14:00:06 GMT)
Full text and
rfc822 format available.
Reply sent
to
Eric Blake <eblake <at> redhat.com>
:
You have taken responsibility.
(Wed, 31 Jul 2013 14:00:10 GMT)
Full text and
rfc822 format available.
Notification sent
to
Danny Nicholas <danny.nicholas <at> pinnacledatasystems.com>
:
bug acknowledged by developer.
(Wed, 31 Jul 2013 14:00:13 GMT)
Full text and
rfc822 format available.
Message #20 received at 14988-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
tag 14988 notabug
thanks
[re-adding the list; and please don't top-post on technical lists]
On 07/31/2013 07:19 AM, Danny Nicholas wrote:
> Thank you Eric. We have two sorts on our system. Our /usr/bin/sort does not support the -s option,
Makes sense - the '-s' option is a GNU extension, and your /usr/bin/sort
is probably not GNU sort. If you want stable sorting using only POSIX
features, then you have to supply enough sort keys so that no two lines
ever compare equal (since POSIX has no way to disable the full-line sort
of last resort). And depending on your input to be sorted; this may
indeed require a pre-filter run that adds line numbering (by the way,
sed's '=' command can do this much more efficiently than a python
script), then sorting, then a post-filter run that removes the line number.
> but our /usr/local/bin/sort does.
Indeed - life is simpler if you can write your script to ensure that it
always sets PATH to use the full power of the GNU tools.
> Unfortunately, that did not resolve the issue. Here is a portion of the file I'm trying to sort
Thank you - THIS makes much more sense for understanding your problem.
> 010_000001_0000731_00001_200000081610_<Customer>
> 010_000001_0000731_00002_200000081610_ <CCODEPAGE>4102 LANGUAGE EN</CCODEPAGE>
> 010_000001_0000731_00003_200000081610_ <FirstCopy>YES</FirstCopy>
> 010_000001_0000731_00003_200000081610_ <eapprovetype>010</eapprovetype>
> 010_000001_0000731_00003_200000081610_ <lastpaymentdate>06/12/2013</lastpaymentdate>
> 010_000001_0000731_00003_200000081610_ <lastpaymentamount> 277.59</lastpaymentamount>
> 010_000001_0000731_00003_200000081610_ <SuppressOutBadVariableCopies></SuppressOutBadVariableCopies>
> 010_000001_0000731_00003_200000081610_ <CPAGENAME>PAGE1</CPAGENAME>
> 010_000001_0000731_00004_200000081610_ <DG_BILL_LAYOUT>REGULAR</DG_BILL_LAYOUT>
> 010_000001_0000731_00005_200000081610_ <DC-DEVICE>PRINTER</DC-DEVICE>
> 010_000001_0000731_00006_200000081610_ <DC-RDI>S</DC-RDI>
> 010_000001_0000731_00007_200000081610_ <DC-SENDTYPE>PRINTER</DC-SENDTYPE>
> 010_000001_0000731_00008_200000081610_ <DSY-SYSID>R3P</DSY-SYSID>
>
> What I am executing is /usr/local/bin/sort -k 1,36 -s file -o file2
So, with "-k1,36" you asked sort to treat as its sort key the portion of
the line ranging from the first field to the 36th field. I only see 2
fields in most of the lines (a few have more, but none of them with 36
fields), so you are basically sorting by the entire line. You didn't
provide any other keys, but since your first key is already botched as
the ENTIRE line, there were no lines that compared equal for -s to make
any difference. Again, sort --debug makes this clear (using a subset of
just two lines of your input):
>> $ printf '010_000001_0000731_00003_200000081610_ <SuppressOutBadVariableCopies></SuppressOutBadVariableCopies>\n010_000001_0000731_00003_200000081610_ <CPAGENAME>PAGE1</CPAGENAME>\n' \
>> | LC_ALL=C sort --debug -k1,36 -s
>> sort: using simple byte comparison
>> 010_000001_0000731_00003_200000081610_ <CPAGENAME>PAGE1</CPAGENAME>
>> _______________________________________________________________________
>> 010_000001_0000731_00003_200000081610_ <SuppressOutBadVariableCopies></SuppressOutBadVariableCopies>
>> ________________________________________________________________________________________________________
But it appears that what you WANTED was to sort on just the first 36
bytes, with a stable sort of the results. If so, then ASK for that, by
using the correct -k option:
>> $ printf '010_000001_0000731_00003_200000081610_ <SuppressOutBadVariableCopies></SuppressOutBadVariableCopies>\n010_000001_0000731_00003_200000081610_ <CPAGENAME>PAGE1</CPAGENAME>\n' \
>> | LC_ALL=C sort --debug -k1,1.36 -s
>> sort: using simple byte comparison
>> 010_000001_0000731_00003_200000081610_ <SuppressOutBadVariableCopies></SuppressOutBadVariableCopies>
>> ____________________________________
>> 010_000001_0000731_00003_200000081610_ <CPAGENAME>PAGE1</CPAGENAME>
>> ____________________________________
Note how I asked for a sort key -k1,1.36, which says to start in the
first field, and end 36 bytes into the first field (hmm, it looks like
you actually want 38 bytes - but I'll leave that for you to decide).
Also note that -s now makes a difference, when the content of that first
sort key is identical so the last-resort full-line comparison swaps
unequal lines when -s is not used:
>> $ printf '010_000001_0000731_00003_200000081610_ <SuppressOutBadVariableCopies></SuppressOutBadVariableCopies>\n010_000001_0000731_00003_200000081610_ <CPAGENAME>PAGE1</CPAGENAME>\n' \
>> | LC_ALL=C sort --debug -k1,1.36
>> sort: using simple byte comparison
>> 010_000001_0000731_00003_200000081610_ <CPAGENAME>PAGE1</CPAGENAME>
>> ____________________________________
>> _______________________________________________________________________
>> 010_000001_0000731_00003_200000081610_ <SuppressOutBadVariableCopies></SuppressOutBadVariableCopies>
>> ____________________________________
>> ________________________________________________________________________________________________________
As this is a case of you not passing the correct command line arguments,
rather than a bug in sort, I am marking this bug as closed. However,
feel free to continue to comment on the topic (preferably on-list) if
you have more questions.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 29 Aug 2013 11:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 12 years and 16 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.