GNU bug report logs - #15450
SORT failing on some lines

Previous Next

Package: coreutils;

Reported by: sam <at> netinetics.com

Date: Mon, 23 Sep 2013 21:57:05 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: sam <at> netinetics.com
To: 15450 <at> debbugs.gnu.org
Subject: bug#15450: SORT failing on some lines
Date: Mon, 23 Sep 2013 05:28:38 +0300
Hi there.
I am using Ubuntu Linux and the SORT command to sort a 605MB index  
file I have created from a wikipedia dump. The index file contains the  
article name followed by a separator and then the reference numbers of  
the byte offset within the wikipedia dump file.

This Index file I have created needs to be sorted into alphabetical  
order so that it can be searched quickly. I have found that although  
most of the lines in the file are sorted correctly, some are not, and  
this is throwing off the index searching.

While most items are alphabetically sorted, the following occurs (for  
example):

"Universe (1960 film)"
"Universe"

"Yellow 2G"
"Yellow"

the lines are in the wrong order. My C++ program which searches the  
index expects that "Universe" comes before "Universe (1960 film)" when  
doing a string compare.

Interestingly, if I copy these problem lines into a separate text file  
and run SORT on them, it sorts correctly.
I have tried every switch combination I can think of but the problem remains.
I am wondering if it is something to do with the size of the file I am  
trying to sort. 605 megabytes, about 10,000,000 lines of text. Again,  
most of the lines are sorted correctly, but some (and I haven't  
checked exactly how many, but am finding them at random) are not.

Would appreciate any help or comments you could offer
Many thanks

Best regards,
Sam


Sam Brown
Netinetics Oy
PL 23
00251
Helsinki
Finland






This bug report was last modified 11 years and 239 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.