GNU bug report logs - #15450
SORT failing on some lines

Previous Next

Package: coreutils;

Reported by: sam <at> netinetics.com

Date: Mon, 23 Sep 2013 21:57:05 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 15450 in the body.
You can then email your comments to 15450 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#15450; Package coreutils. (Mon, 23 Sep 2013 21:57:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to sam <at> netinetics.com:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 23 Sep 2013 21:57:05 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: sam <at> netinetics.com
To: bug-coreutils <at> gnu.org
Subject: SORT failing on some lines
Date: Mon, 23 Sep 2013 05:28:38 +0300
Hi there.
I am using Ubuntu Linux and the SORT command to sort a 605MB index  
file I have created from a wikipedia dump. The index file contains the  
article name followed by a separator and then the reference numbers of  
the byte offset within the wikipedia dump file.

This Index file I have created needs to be sorted into alphabetical  
order so that it can be searched quickly. I have found that although  
most of the lines in the file are sorted correctly, some are not, and  
this is throwing off the index searching.

While most items are alphabetically sorted, the following occurs (for  
example):

"Universe (1960 film)"
"Universe"

"Yellow 2G"
"Yellow"

the lines are in the wrong order. My C++ program which searches the  
index expects that "Universe" comes before "Universe (1960 film)" when  
doing a string compare.

Interestingly, if I copy these problem lines into a separate text file  
and run SORT on them, it sorts correctly.
I have tried every switch combination I can think of but the problem remains.
I am wondering if it is something to do with the size of the file I am  
trying to sort. 605 megabytes, about 10,000,000 lines of text. Again,  
most of the lines are sorted correctly, but some (and I haven't  
checked exactly how many, but am finding them at random) are not.

Would appreciate any help or comments you could offer
Many thanks

Best regards,
Sam


Sam Brown
Netinetics Oy
PL 23
00251
Helsinki
Finland






Information forwarded to bug-coreutils <at> gnu.org:
bug#15450; Package coreutils. (Mon, 23 Sep 2013 22:24:01 GMT) Full text and rfc822 format available.

Message #8 received at 15450 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: sam <at> netinetics.com
Cc: 15450 <at> debbugs.gnu.org
Subject: Re: bug#15450: SORT failing on some lines
Date: Mon, 23 Sep 2013 16:23:08 -0600
[Message part 1 (text/plain, inline)]
tag 15450 needinfo
thanks

On 09/22/2013 08:28 PM, sam <at> netinetics.com wrote:

> While most items are alphabetically sorted, the following occurs (for
> example):
> 
> "Universe (1960 film)"
> "Universe"
> 
> "Yellow 2G"
> "Yellow"

It sounds like you might be falling foul of a FAQ:
https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021

But to know that for sure, you need to provide more information: what
locale settings are you using, and does your locale treat punctuation as
insignificant when doing strcoll()?  Are you using LC_ALL=C to force C
locale sorting?

> 
> the lines are in the wrong order. My C++ program which searches the
> index expects that "Universe" comes before "Universe (1960 film)" when
> doing a string compare.
> 
> Interestingly, if I copy these problem lines into a separate text file
> and run SORT on them, it sorts correctly.
> I have tried every switch combination I can think of but the problem
> remains.

You didn't show the actual options you are trying, so it's hard to say
without more information.  Are those the full line that you are sorting,
or are you sorting something more like this

$ LC_ALL=en_US.UTF-8 sort -t/ foo
Yellow 2G/1
Yellow/2
$ LC_ALL=en_US.UTF-8 sort -t/ -k1,1 foo
Yellow/2
Yellow 2G/1

Note how in the en_US locale, which ignores punctuation, I was able to
get a different sort order depending on whether I remembered to
terminate the sort key at the separator, vs. letting it strcoll() on the
full line.

Have you played with the --debug option, to make sure you are sorting on
what you THINK you should be sorting on?

> I am wondering if it is something to do with the size of the file I am
> trying to sort. 605 megabytes, about 10,000,000 lines of text. Again,
> most of the lines are sorted correctly, but some (and I haven't checked
> exactly how many, but am finding them at random) are not.

Most likely, the size of the file probably has nothing to do with it.
To guarantee it is not a bad merge when sort uses multiple files, rerun
your command with 'sort --parallel=1 $your_options...' to ensure that
there are no temporary files to be merged (if there IS a bug with how
temporaries are merged, we definitely want to fix that; it would show up
with --parallel larger than 1).

Again, I suspect it is in your locale or command line, but without
enough details I can't prove that.  So I'll leave this bug open while
waiting for more details.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Added tag(s) moreinfo. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Mon, 23 Sep 2013 22:24:02 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Wed, 25 Sep 2013 19:42:03 GMT) Full text and rfc822 format available.

Notification sent to sam <at> netinetics.com:
bug acknowledged by developer. (Wed, 25 Sep 2013 19:42:03 GMT) Full text and rfc822 format available.

Message #15 received at 15450-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: sam <at> netinetics.com, 15450-done <at> debbugs.gnu.org
Subject: Re: bug#15450: SORT failing on some lines
Date: Wed, 25 Sep 2013 13:41:52 -0600
[Message part 1 (text/plain, inline)]
tag 15450 -moreinfo
tag 15450 +notabug
thanks

On 09/25/2013 12:28 PM, sam <at> netinetics.com wrote:
> 
> Hello Eric,
> Thank you kindly for your speedy reply.
> I should apologize for the lack of information included with my email.
> It was a hurried one.

Re-adding the list for closure, with permission.

> 
> In fact your suggestions and link and a bit of tinkering have cured the
> problem. SORT works fine it seems. I should have had more faith.
> The problem was purely with Locale, which I read up on in the FAQ link
> you sent. I had looked at Locale previously but didn't seem to have any
> success with it. I had also been trying various options for SORT,
> including -i, -d and even the field separation. (-t'#' -k1,1) I didn't
> have any luck but I realized after reading through your reply that it
> was the combination of these things which hadn't come right.
> 
> I'd just like to add here for anybody else who stumbles across this same
> problem, a description of the problem I was having in more detail (now
> solved)
> 
> The text file was a 605MB list of title texts extracted from Wikipedia,
> separated by a #--# and followed by the 'long long' integer offsets of
> where the article appeared in the dump file. (XML)
> Example lines:
> 
> Alps Electric#--#7701298893,12,24,364,394,420
> Alps Electric Co.#--#4280442890,12,28,339,3144,3170
> Alps Electric Corporation#--#9562165739,12,36,447,477,503
> 
> My machine was set to en-GB locale, although I had switched this to
> en-US with same (wrong) results.
> 
> It was necessary to set the locale to LC_ALL=C and also to instruct SORT
> only to look at the first field (up to the first #) using the -t'#' and
> -k1,1 switches as you mentioned.
> Obvious really, but the combination of the two is what caused my confusion.
> 
> It is really worth reading up on Locale for anybody using SORT and other
> utilities as it can profoundly change the results of an operation.
> Even setting locale to en-US doesn't help, as I read in the FAQ you
> linked, because en-US quite drastically reduces sort possibilities
> (case, punctuation etc ignored)
> 
> I'm sorry for the bother - but you put me on the right track.
> Many thanks for that.

Glad to hear it.  As such, I've closed the bug in the tracker.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Removed tag(s) moreinfo. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Wed, 25 Sep 2013 19:45:02 GMT) Full text and rfc822 format available.

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Wed, 25 Sep 2013 19:45:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 24 Oct 2013 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 239 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.