#21916 - sort -u drops unique lines with some locales

GNU bug report logs - #21916
sort -u drops unique lines with some locales

Reported by: Christoph Anton Mitterer <calestyo <at> scientia.net>

Date: Sat, 14 Nov 2015 05:39:02 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Message #8 received at 21916 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com> To: Christoph Anton Mitterer <calestyo <at> scientia.net>, 21916 <at> debbugs.gnu.org Subject: Re: bug#21916: sort -u drops unique lines with some locales Date: Sat, 14 Nov 2015 11:06:22 +0000

tag 21916 notabug close 21916 stop On 14/11/15 05:38, Christoph Anton Mitterer wrote: > Hey. > > (GNU coreutils 8.23) > > Attached is a file, that, when sort -u'ed in my locale, looses lines > which are however unique. > > I've also attached the locale, since it's a custom made one, but the > same seem to happen with "standard" locales as well, see e.g. > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=695489 > > Cheers, > Chris. > > PS: Please keep me CCed, as I'm writing off list. Unfortunately the roman numeral code points compare equal: $ printf '%s\n' Ⅱ Ⅰ | ltrace -e strcoll sort sort->strcoll("\342\205\241", "\342\205\240") = 0 Ⅱ Ⅰ If you compare at the byte level you'll get appropriate grouping: $ printf '%s\n' Ⅱ Ⅰ | LC_ALL=C sort Ⅰ Ⅱ The same goes for other similar representations, like full width forms of latin numbers: $ printf '%s\n' ２１ | ltrace -e strcoll sort sort->strcoll("\357\274\222", "\357\274\221") = 0 ２１ That's a bit surprising, though maybe since only a limited number of these representations are provided, it was not thought appropriate to provide collation orders for them. There are details on the unicode representation at: https://en.wikipedia.org/wiki/Numerals_in_Unicode#Roman_numerals_in_Unicode Where it says "[f]or most purposes, it is preferable to compose the Roman numerals from sequences of the appropriate Latin letters" For example you could mix ISO 8859-1 and ISO 8859-5 to get appropriate sorting: $ printf '%s\n' I II III IV V VI VII VIII ІХ Х ХI ХII ХIII ХIV ХV ХVI ХVII ХVIII ХІХ | sort I II III IV V VI VII VIII ІХ Х ХI ХII ХIII ХIV ХV ХVI ХVII ХVIII ХІХ If there were only portions of the line that was appropriate to treat in the C locale (not for your grouping case really, but generally for sorting for example), then you'd need to consider transformations like enclosed, fullwidth, halfwidth -> ASCII which might be done with a separate utility, and for number specific transformations like the above, handled within the numfmt utility? One thing we might do immediately, is maybe with the sort --debug option, to provide some indication when strcoll() and memcmp() differ in direction. cheers, Pádraig.

This bug report was last modified 6 years and 207 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #21916 sort -u drops unique lines with some locales

GNU bug report logs - #21916
sort -u drops unique lines with some locales