From unknown Sat Aug 16 13:46:20 2025 X-Loop: help-debbugs@gnu.org Subject: bug#8598: Bug in uniq? Resent-From: emijrp Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-To: owner@debbugs.gnu.org Resent-CC: bug-coreutils@gnu.org Resent-Date: Sat, 30 Apr 2011 17:21:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 8598 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: 8598@debbugs.gnu.org X-Debbugs-Original-To: bug-coreutils@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.130418402828361 (code B ref -1); Sat, 30 Apr 2011 17:21:01 +0000 Received: (at submit) by debbugs.gnu.org; 30 Apr 2011 17:20:28 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QGDqB-0007NK-NO for submit@debbugs.gnu.org; Sat, 30 Apr 2011 13:20:28 -0400 Received: from eggs.gnu.org ([140.186.70.92]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QG8u1-00009q-Jl for submit@debbugs.gnu.org; Sat, 30 Apr 2011 08:04:02 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QG8tv-0002ck-EU for submit@debbugs.gnu.org; Sat, 30 Apr 2011 08:03:56 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,FREEMAIL_FROM, HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,T_DKIM_INVALID, T_TO_NO_BRKTS_FREEMAIL autolearn=unavailable version=3.3.1 Received: from lists.gnu.org ([140.186.70.17]:56162) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QG8tv-0002cf-AJ for submit@debbugs.gnu.org; Sat, 30 Apr 2011 08:03:55 -0400 Received: from eggs.gnu.org ([140.186.70.92]:58469) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QG8tu-0005t5-Cc for bug-coreutils@gnu.org; Sat, 30 Apr 2011 08:03:55 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QG8tt-0002bo-AB for bug-coreutils@gnu.org; Sat, 30 Apr 2011 08:03:54 -0400 Received: from mail-qw0-f41.google.com ([209.85.216.41]:50897) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QG8tt-0002bb-7g for bug-coreutils@gnu.org; Sat, 30 Apr 2011 08:03:53 -0400 Received: by qwa26 with SMTP id 26so2812355qwa.0 for ; Sat, 30 Apr 2011 05:03:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:from:date:message-id:subject:to :content-type; bh=3lbmm8SzECUn9hGTUXj9AVr4ChpU/LJ9xgbssu706ZI=; b=dcfZH8Q/SxMJu7sDoBYjUPL8xrNdotUAYy9HMvuP1nGigA5LxB+G+k4TZsNUnhTh1B XceoOmuTMA4Xdiy+04CYVy39CUz/J97cLVHiz//AhFLLm58klEBQ2E98rkmGkGtrjWdR iIMnb9Y9txIzsmgjhlW1AJ5aSO4fGdnLqizO0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:from:date:message-id:subject:to:content-type; b=GcxKoI30eAlNUl3MIU5lwC59RAXoxDlcWyA6NX1IR+A2h0dCMTJH8WPAtbCP7RDiXn kuqrgIxc33W5kHCuFSeZJ5s5jYMehYT3Z1AC3i4adiSh1eCnpbVQUSyA5egR0pdA1CxZ CKhAiG7Rk14ZaSMpqEzeoxWFRxVi2LBBT7H8E= Received: by 10.229.26.194 with SMTP id f2mr4354841qcc.220.1304165032189; Sat, 30 Apr 2011 05:03:52 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.7.79 with HTTP; Sat, 30 Apr 2011 05:03:22 -0700 (PDT) From: emijrp Date: Sat, 30 Apr 2011 14:03:22 +0200 Message-ID: Content-Type: multipart/alternative; boundary=0016364eebe278e03e04a22195f6 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 140.186.70.17 X-Spam-Score: -5.9 (-----) X-Mailman-Approved-At: Sat, 30 Apr 2011 13:20:21 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -5.9 (-----) --0016364eebe278e03e04a22195f6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi all; I'm not sure if this is a bug. If I download this file[1], unzip and do: grep "" wikiindexorg-20110409-history.xml | sort | uniq -D It shows: <title>Felix Ple=C5=9Foianu Wiki Felix Ple=C8=99oianu Wiki =E1=90=A7=E1=90=83=E1=91=AD=E1=90=B1=E1=91=8E=E1=94=AD =EC=9C=84=ED=82=A4=EB=82=B1=EB=A7=90=EC=82=AC=EC=A0=84 =E3=82=A6=E3=82=A3=E3=82=AF=E3=82=B7=E3=83=A7=E3=83=8A=E3=83=AA= =E3=83=BC =EC=96=B8=EC=82=AC=EC=9D=B4=ED=81=B4=EB=A1=9C=ED=94=BC=EB=94=94= =EC=96=B4 =E0=B9=84=E0=B8=97=E0=B8=A2 Wikipedia =ED=95=9C=EA=B5=AD=EC=96=B4 Wikipedia But obviously, they are all different lines. Why? Thanks, emijrp [1] http://code.google.com/p/wikiteam/downloads/detail?name=3Dwikiindexorg-2011= 0409-history.xml.7z --0016364eebe278e03e04a22195f6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi all;

I'm not sure if this is a bug.

If I download this= file[1], unzip and do:

grep "= ;<title>" wikiindexorg-20110409-history.xml | sort | uniq -D

It shows:

=C2=A0=C2=A0= =C2=A0 <title>Felix Ple=C5=9Foianu Wiki</title>
=C2=A0=C2=A0= =C2=A0 <title>Felix Ple=C8=99oianu Wiki</title>
=C2=A0=C2=A0= =C2=A0 <title>=E1=90=A7=E1=90=83=E1=91=AD=E1=90=B1=E1=91=8E=E1=94=AD&= lt;/title>
=C2=A0=C2=A0=C2=A0 <title>=EC=9C=84=ED=82=A4=EB=82= =B1=EB=A7=90=EC=82=AC=EC=A0=84</title>
=C2=A0=C2=A0=C2=A0 <title>=E3=82=A6=E3=82=A3=E3=82=AF=E3=82=B7=E3=83= =A7=E3=83=8A=E3=83=AA=E3=83=BC</title>
=C2=A0=C2=A0=C2=A0 <titl= e>=EC=96=B8=EC=82=AC=EC=9D=B4=ED=81=B4=EB=A1=9C=ED=94=BC=EB=94=94=EC=96= =B4</title>
=C2=A0=C2=A0=C2=A0 <title>=E0=B9=84=E0=B8=97=E0= =B8=A2 Wikipedia</title>
=C2=A0=C2=A0=C2=A0 <title>=ED=95=9C= =EA=B5=AD=EC=96=B4 Wikipedia</title>

But obviously, they= are all different lines. Why?

Thanks,
emijrp

[1] http://cod= e.google.com/p/wikiteam/downloads/detail?name=3Dwikiindexorg-20110409-histo= ry.xml.7z
--0016364eebe278e03e04a22195f6-- From unknown Sat Aug 16 13:46:20 2025 X-Loop: help-debbugs@gnu.org Subject: bug#8598: Bug in uniq? Resent-From: Eric Blake Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-To: owner@debbugs.gnu.org Resent-CC: bug-coreutils@gnu.org Resent-Date: Sat, 30 Apr 2011 17:58:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 8598 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: emijrp Cc: 8598@debbugs.gnu.org Received: via spool by 8598-submit@debbugs.gnu.org id=B8598.130418622631544 (code B ref 8598); Sat, 30 Apr 2011 17:58:01 +0000 Received: (at 8598) by debbugs.gnu.org; 30 Apr 2011 17:57:06 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QGEPh-0008Cj-Ep for submit@debbugs.gnu.org; Sat, 30 Apr 2011 13:57:05 -0400 Received: from mx1.redhat.com ([209.132.183.28]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QGEPe-0008CG-Pm for 8598@debbugs.gnu.org; Sat, 30 Apr 2011 13:57:04 -0400 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id p3UHuu9j001503 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sat, 30 Apr 2011 13:56:57 -0400 Received: from [10.3.113.75] (ovpn-113-75.phx2.redhat.com [10.3.113.75]) by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id p3UHuueU023388; Sat, 30 Apr 2011 13:56:56 -0400 Message-ID: <4DBC4D67.30208@redhat.com> Date: Sat, 30 Apr 2011 11:56:55 -0600 From: Eric Blake Organization: Red Hat User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.15) Gecko/20110307 Fedora/3.1.9-0.39.b3pre.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.9 MIME-Version: 1.0 References: In-Reply-To: X-Enigmail-Version: 1.1.2 OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="------------enigC54D844FD55B3543E1A0AEB9" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23 X-Spam-Score: -10.3 (----------) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -10.3 (----------) This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigC54D844FD55B3543E1A0AEB9 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 04/30/2011 06:03 AM, emijrp wrote: > Hi all; >=20 > I'm not sure if this is a bug. Most likely not a bug, but a function of your locale. >=20 > If I download this file[1], unzip and do: >=20 > grep "" wikiindexorg-20110409-history.xml | sort | uniq -D >=20 > It shows: >=20 > <title>Felix Ple=C5=9Foianu Wiki > Felix Ple=C8=99oianu Wiki Identical. (How, you ask? Read on.) > =E1=90=A7=E1=90=83=E1=91=AD=E1=90=B1=E1=91=8E=E1=94=AD</titl= e> > <title>=EC=9C=84=ED=82=A4=EB=82=B1=EB=A7=90=EC=82=AC=EC=A0=84</titl= e> Identical. > <title>=E3=82=A6=E3=82=A3=E3=82=AF=E3=82=B7=E3=83=A7=E3=83=8A=E3=83= =AA=E3=83=BC > =EC=96=B8=EC=82=AC=EC=9D=B4=ED=81=B4=EB=A1=9C=ED=94=BC=EB=94= =94=EC=96=B4 Identical. > =E0=B9=84=E0=B8=97=E0=B8=A2 Wikipedia > =ED=95=9C=EA=B5=AD=EC=96=B4 Wikipedia Identical. >=20 > But obviously, they are all different lines. Why? That depends on your locale. In the C locale, all of those lines are distinct except for the first two. But in other locales, strcoll() compares lines equal depending on your current locale, and if your current locale punts and collates all non-ASCII characters as the same collation symbol, then those lines are identical. I was able to reproduce your results with the en_US.UTF-8 locale that ships with Fedora 14. To see the difference, try again with: $ grep "" wikiindexorg-20110409-history.xml | sort \ | LC_ALL=3DC uniq --all-repeated=3Dseparate $ grep "<title>" wikiindexorg-20110409-history.xml | sort \ | LC_ALL=3Den_US.UTF-8 uniq --all-repeated=3Dseparate <title>Felix Ple=C5=9Foianu Wiki Felix Ple=C8=99oianu Wiki =E1=90=A7=E1=90=83=E1=91=AD=E1=90=B1=E1=91=8E=E1=94=AD= =EC=9C=84=ED=82=A4=EB=82=B1=EB=A7=90=EC=82=AC=EC=A0=84= =E3=82=A6=E3=82=A3=E3=82=AF=E3=82=B7=E3=83=A7=E3=83=8A=E3=83=AA= =E3=83=BC =EC=96=B8=EC=82=AC=EC=9D=B4=ED=81=B4=EB=A1=9C=ED=94=BC=EB=94=94= =EC=96=B4 =E0=B9=84=E0=B8=97=E0=B8=A2 Wikipedia =ED=95=9C=EA=B5=AD=EC=96=B4 Wikipedia This is because that particular locale does not try to distinguish a collation sequence for non-English characters. --=20 Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org --------------enigC54D844FD55B3543E1A0AEB9 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iQEcBAEBCAAGBQJNvE1nAAoJEKeha0olJ0NqJnsH/0lkQATxxio8ChhvyiGG9PNI mCo15Q7OEcqlCKXX9Vl9I+dl14iXtVu/H4KdVnHPwPkJ+KfjgF/xkefAeRO8o21z k7O4zIMYjcNnTSCVaCuj+mDjrNYi8HMuf5/U6HR/C04DllSweiHckfGX2ZvBB5Su ty5hbbOqWNsLAjZAMNY5LL+C5CsAFz0DoJCtQvOPHtFOYqWlY8NdpAUl5lSlPiOY dQV+htHVUMQwxk5krT/3UZL4K9IgPGW3/ZrCOTNN2xfNBAUOQhf2XukEsclwiU/4 6tMDQA0phFq9fYDP+LRzWFpk6XK79xLAXag/BE9DEdGeBgTuAru7FkPENMFPs6M= =cEwC -----END PGP SIGNATURE----- --------------enigC54D844FD55B3543E1A0AEB9-- From debbugs-submit-bounces@debbugs.gnu.org Sat Apr 30 14:00:02 2011 Received: (at control) by debbugs.gnu.org; 30 Apr 2011 18:00:03 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QGESX-0008Gg-Nr for submit@debbugs.gnu.org; Sat, 30 Apr 2011 14:00:01 -0400 Received: from mx1.redhat.com ([209.132.183.28]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QGESV-0008GG-Lh for control@debbugs.gnu.org; Sat, 30 Apr 2011 14:00:00 -0400 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id p3UHxrXX001909 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Sat, 30 Apr 2011 13:59:54 -0400 Received: from [10.3.113.75] (ovpn-113-75.phx2.redhat.com [10.3.113.75]) by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id p3UHxrjc024022 for ; Sat, 30 Apr 2011 13:59:53 -0400 Message-ID: <4DBC4E19.6030006@redhat.com> Date: Sat, 30 Apr 2011 11:59:53 -0600 From: Eric Blake Organization: Red Hat User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.15) Gecko/20110307 Fedora/3.1.9-0.39.b3pre.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.9 MIME-Version: 1.0 To: control@debbugs.gnu.org Subject: close 8598 X-Enigmail-Version: 1.1.2 OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="------------enigBCE811BD2B8719CCC4F82701" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23 X-Spam-Score: -10.3 (----------) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -10.3 (----------) This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigBCE811BD2B8719CCC4F82701 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable tag 8598 notabug close 8598 thanks --=20 Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org --------------enigBCE811BD2B8719CCC4F82701 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iQEcBAEBCAAGBQJNvE4ZAAoJEKeha0olJ0NqHqQH/R3cHcS8Slm4heNBTQOTfU5V pKb0SQtIgKFSlO1gnNebIJiBHpLu0KEwNRFbUkobnagT+VeGIxDlbJTLDr4Q3VSG SL6hB2QL7OheMcjSjMoQQDinEZwlo6hir4S/q5R7Y7cOJil7jyZxEFQ0lJfi80EY t7vZisEnj6+5anvqGM1n/Yrx17WpHmUrslvhMZhOM9H1Z3gsjImvTT8fi3r7JPcv efwIsy4KDLkPQiAjM+RNd6RGJSZ3p1Rqovcu7WBU9Gl4M7PvTwxtschucmOnwXU8 sFJzjNunY6dSXQCvWzRMvseoBVIB7nEj9s6C3dv5y52hr2NGCpb6X90wggvOIl0= =lDLs -----END PGP SIGNATURE----- --------------enigBCE811BD2B8719CCC4F82701--