GNU bug report logs - #8598
Bug in uniq?

Previous Next

Package: coreutils;

Reported by: emijrp <emijrp <at> gmail.com>

Date: Sat, 30 Apr 2011 17:21:01 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 8598 in the body.
You can then email your comments to 8598 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#8598; Package coreutils. (Sat, 30 Apr 2011 17:21:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to emijrp <emijrp <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sat, 30 Apr 2011 17:21:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: emijrp <emijrp <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: Bug in uniq?
Date: Sat, 30 Apr 2011 14:03:22 +0200
[Message part 1 (text/plain, inline)]
Hi all;

I'm not sure if this is a bug.

If I download this file[1], unzip and do:

grep "<title>" wikiindexorg-20110409-history.xml | sort | uniq -D

It shows:

    <title>Felix Pleşoianu Wiki</title>
    <title>Felix Pleșoianu Wiki</title>
    <title>ᐧᐃᑭᐱᑎᔭ</title>
    <title>위키낱말사전</title>
    <title>ウィクショナリー</title>
    <title>언사이클로피디어</title>
    <title>ไทย Wikipedia</title>
    <title>한국어 Wikipedia</title>

But obviously, they are all different lines. Why?

Thanks,
emijrp

[1]
http://code.google.com/p/wikiteam/downloads/detail?name=wikiindexorg-20110409-history.xml.7z
[Message part 2 (text/html, inline)]

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#8598; Package coreutils. (Sat, 30 Apr 2011 17:58:01 GMT) Full text and rfc822 format available.

Message #8 received at 8598 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: emijrp <emijrp <at> gmail.com>
Cc: 8598 <at> debbugs.gnu.org
Subject: Re: bug#8598: Bug in uniq?
Date: Sat, 30 Apr 2011 11:56:55 -0600
[Message part 1 (text/plain, inline)]
On 04/30/2011 06:03 AM, emijrp wrote:
> Hi all;
> 
> I'm not sure if this is a bug.

Most likely not a bug, but a function of your locale.

> 
> If I download this file[1], unzip and do:
> 
> grep "<title>" wikiindexorg-20110409-history.xml | sort | uniq -D
> 
> It shows:
> 
>     <title>Felix Pleşoianu Wiki</title>
>     <title>Felix Pleșoianu Wiki</title>

Identical.  (How, you ask? Read on.)

>     <title>ᐧᐃᑭᐱᑎᔭ</title>
>     <title>위키낱말사전</title>

Identical.

>     <title>ウィクショナリー</title>
>     <title>언사이클로피디어</title>

Identical.

>     <title>ไทย Wikipedia</title>
>     <title>한국어 Wikipedia</title>

Identical.

> 
> But obviously, they are all different lines. Why?

That depends on your locale.  In the C locale, all of those lines are
distinct except for the first two.  But in other locales, strcoll()
compares lines equal depending on your current locale, and if your
current locale punts and collates all non-ASCII characters as the same
collation symbol, then those lines are identical.

I was able to reproduce your results with the en_US.UTF-8 locale that
ships with Fedora 14.  To see the difference, try again with:

$ grep "<title>" wikiindexorg-20110409-history.xml | sort \
    | LC_ALL=C uniq --all-repeated=separate
$ grep "<title>" wikiindexorg-20110409-history.xml | sort \
    | LC_ALL=en_US.UTF-8 uniq --all-repeated=separate

    <title>Felix Pleşoianu Wiki</title>
    <title>Felix Pleșoianu Wiki</title>

    <title>ᐧᐃᑭᐱᑎᔭ</title>
    <title>위키낱말사전</title>

    <title>ウィクショナリー</title>
    <title>언사이클로피디어</title>

    <title>ไทย Wikipedia</title>
    <title>한국어 Wikipedia</title>

This is because that particular locale does not try to distinguish a
collation sequence for non-English characters.

-- 
Eric Blake   eblake <at> redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Sat, 30 Apr 2011 18:00:04 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 8598 <at> debbugs.gnu.org and emijrp <emijrp <at> gmail.com> Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Sat, 30 Apr 2011 18:00:05 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 29 May 2011 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 14 years and 81 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.