tag 15450 -moreinfo tag 15450 +notabug thanks On 09/25/2013 12:28 PM, sam@netinetics.com wrote: > > Hello Eric, > Thank you kindly for your speedy reply. > I should apologize for the lack of information included with my email. > It was a hurried one. Re-adding the list for closure, with permission. > > In fact your suggestions and link and a bit of tinkering have cured the > problem. SORT works fine it seems. I should have had more faith. > The problem was purely with Locale, which I read up on in the FAQ link > you sent. I had looked at Locale previously but didn't seem to have any > success with it. I had also been trying various options for SORT, > including -i, -d and even the field separation. (-t'#' -k1,1) I didn't > have any luck but I realized after reading through your reply that it > was the combination of these things which hadn't come right. > > I'd just like to add here for anybody else who stumbles across this same > problem, a description of the problem I was having in more detail (now > solved) > > The text file was a 605MB list of title texts extracted from Wikipedia, > separated by a #--# and followed by the 'long long' integer offsets of > where the article appeared in the dump file. (XML) > Example lines: > > Alps Electric#--#7701298893,12,24,364,394,420 > Alps Electric Co.#--#4280442890,12,28,339,3144,3170 > Alps Electric Corporation#--#9562165739,12,36,447,477,503 > > My machine was set to en-GB locale, although I had switched this to > en-US with same (wrong) results. > > It was necessary to set the locale to LC_ALL=C and also to instruct SORT > only to look at the first field (up to the first #) using the -t'#' and > -k1,1 switches as you mentioned. > Obvious really, but the combination of the two is what caused my confusion. > > It is really worth reading up on Locale for anybody using SORT and other > utilities as it can profoundly change the results of an operation. > Even setting locale to en-US doesn't help, as I read in the FAQ you > linked, because en-US quite drastically reduces sort possibilities > (case, punctuation etc ignored) > > I'm sorry for the bother - but you put me on the right track. > Many thanks for that. Glad to hear it. As such, I've closed the bug in the tracker. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org