GNU bug report logs -
#12285
uniq on a UTF8 file with roman numerals
Previous Next
Reported by: "P. Michaud" <pierrecmichaud <at> aol.com>
Date: Sun, 26 Aug 2012 19:04:03 UTC
Severity: normal
Tags: notabug
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 12285 in the body.
You can then email your comments to 12285 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#12285
; Package
coreutils
.
(Sun, 26 Aug 2012 19:04:03 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
"P. Michaud" <pierrecmichaud <at> aol.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Sun, 26 Aug 2012 19:04:03 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hello,
I used the command
"uniq -dc myfile.txt'
here are some lines of the output
2 ☼ turvy
2 ☼ with gay abandon
2 ☼ with reckless abandon
10 ☼ yyⅰ
9 ☼ yyⅹⅲ
2 ☼ yyⅺ
12 ☼ zzⅰ
The three first lines above are correct and correspond to real duplicates lines in the file, but the numbers on the 4 last one are erroneous, each of them correspond to a single line in the file.
Yours faithfully.
Pierre Michaud
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#12285
; Package
coreutils
.
(Sun, 26 Aug 2012 20:31:01 GMT)
Full text and
rfc822 format available.
Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):
On Sun, 2012-08-26 at 13:49 -0400, P. Michaud wrote:
> "uniq -dc myfile.txt'
Can you provide a copy of this file (or, if it's not a file you want to
make public, a modified version of it that causes the same problem)?
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#12285
; Package
coreutils
.
(Sun, 26 Aug 2012 20:55:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 12285 <at> debbugs.gnu.org (full text, mbox):
On 08/26/2012 06:49 PM, P. Michaud wrote:
> Hello,
>
> I used the command
>
> "uniq -dc myfile.txt'
>
> here are some lines of the output
>
> 2 ☼ turvy
> 2 ☼ with gay abandon
> 2 ☼ with reckless abandon
> 10 ☼ yyⅰ
> 9 ☼ yyⅹⅲ
> 2 ☼ yyⅺ
> 12 ☼ zzⅰ
>
>
> The three first lines above are correct and correspond to real duplicates lines in the file, but the numbers on the 4 last one are erroneous, each of them correspond to a single line in the file.
>
> Yours faithfully.
>
> Pierre Michaud
What system are you on
What version of uniq
What is the input exactly
I suspect your locale is equating roman numerals (though that is surprising),
but I can't reproduce with the following on coreutils-8.10-2.fc15.x86_64 at least.
locale -a | while read locale; do
LC_ALL=$locale uniq -dc t.in
done | grep -v " *2"
cheers,
Pádraig.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#12285
; Package
coreutils
.
(Mon, 27 Aug 2012 00:18:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 12285 <at> debbugs.gnu.org (full text, mbox):
tag 12285 + notabug
close 12285
stop
more info below...
On 08/26/2012 09:53 PM, Pádraig Brady wrote:
> On 08/26/2012 06:49 PM, P. Michaud wrote:
>> Hello,
>>
>> I used the command
>>
>> "uniq -dc myfile.txt'
>>
>> here are some lines of the output
>>
>> 2 ☼ turvy
>> 2 ☼ with gay abandon
>> 2 ☼ with reckless abandon
>> 10 ☼ yyⅰ
>> 9 ☼ yyⅹⅲ
>> 2 ☼ yyⅺ
>> 12 ☼ zzⅰ
>>
>>
>> The three first lines above are correct and correspond to real duplicates lines in the file, but the numbers on the 4 last one are erroneous, each of them correspond to a single line in the file.
>>
>> Yours faithfully.
>>
>> Pierre Michaud
>
> What system are you on
> What version of uniq
> What is the input exactly
>
> I suspect your locale is equating roman numerals (though that is surprising),
It seems that these roman numerals are treated a equal in collating order,
so uniq is behaving as expected:
$ sort <(printf "%s\n" ⅲ ⅱ ⅰ)
ⅲ
ⅱ
ⅰ
$ uniq -dc <(printf "%s\n" ⅲ ⅱ ⅰ)
3 ⅲ
You can avoid this behaviour by doing a byte comparison
by using LC_ALL=C.
$ LC_ALL=C sort <(printf "%s\n" ⅲ ⅱ ⅰ)
ⅰ
ⅱ
ⅲ
$ LC_ALL=C uniq -c <(printf "%s\n" ⅲ ⅱ ⅰ)
1 ⅲ
1 ⅱ
1 ⅰ
thanks,
Pádraig.
Added tag(s) notabug.
Request was from
Pádraig Brady <P <at> draigBrady.com>
to
control <at> debbugs.gnu.org
.
(Mon, 27 Aug 2012 00:18:03 GMT)
Full text and
rfc822 format available.
bug closed, send any further explanations to
12285 <at> debbugs.gnu.org and "P. Michaud" <pierrecmichaud <at> aol.com>
Request was from
Pádraig Brady <P <at> draigBrady.com>
to
control <at> debbugs.gnu.org
.
(Mon, 27 Aug 2012 00:18:03 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 24 Sep 2012 11:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 12 years and 272 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.