GNU bug report logs -
#12285
uniq on a UTF8 file with roman numerals
Previous Next
Reported by: "P. Michaud" <pierrecmichaud <at> aol.com>
Date: Sun, 26 Aug 2012 19:04:03 UTC
Severity: normal
Tags: notabug
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
tag 12285 + notabug
close 12285
stop
more info below...
On 08/26/2012 09:53 PM, Pádraig Brady wrote:
> On 08/26/2012 06:49 PM, P. Michaud wrote:
>> Hello,
>>
>> I used the command
>>
>> "uniq -dc myfile.txt'
>>
>> here are some lines of the output
>>
>> 2 ☼ turvy
>> 2 ☼ with gay abandon
>> 2 ☼ with reckless abandon
>> 10 ☼ yyⅰ
>> 9 ☼ yyⅹⅲ
>> 2 ☼ yyⅺ
>> 12 ☼ zzⅰ
>>
>>
>> The three first lines above are correct and correspond to real duplicates lines in the file, but the numbers on the 4 last one are erroneous, each of them correspond to a single line in the file.
>>
>> Yours faithfully.
>>
>> Pierre Michaud
>
> What system are you on
> What version of uniq
> What is the input exactly
>
> I suspect your locale is equating roman numerals (though that is surprising),
It seems that these roman numerals are treated a equal in collating order,
so uniq is behaving as expected:
$ sort <(printf "%s\n" ⅲ ⅱ ⅰ)
ⅲ
ⅱ
ⅰ
$ uniq -dc <(printf "%s\n" ⅲ ⅱ ⅰ)
3 ⅲ
You can avoid this behaviour by doing a byte comparison
by using LC_ALL=C.
$ LC_ALL=C sort <(printf "%s\n" ⅲ ⅱ ⅰ)
ⅰ
ⅱ
ⅲ
$ LC_ALL=C uniq -c <(printf "%s\n" ⅲ ⅱ ⅰ)
1 ⅲ
1 ⅱ
1 ⅰ
thanks,
Pádraig.
This bug report was last modified 12 years and 272 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.