From unknown Fri Jun 20 18:16:49 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#12285 <12285@debbugs.gnu.org> To: bug#12285 <12285@debbugs.gnu.org> Subject: Status: uniq on a UTF8 file with roman numerals Reply-To: bug#12285 <12285@debbugs.gnu.org> Date: Sat, 21 Jun 2025 01:16:49 +0000 retitle 12285 uniq on a UTF8 file with roman numerals reassign 12285 coreutils submitter 12285 "P. Michaud" severity 12285 normal tag 12285 notabug thanks From debbugs-submit-bounces@debbugs.gnu.org Sun Aug 26 15:03:38 2012 Received: (at submit) by debbugs.gnu.org; 26 Aug 2012 19:03:38 +0000 Received: from localhost ([127.0.0.1]:51091 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T5i7V-0005lk-MP for submit@debbugs.gnu.org; Sun, 26 Aug 2012 15:03:38 -0400 Received: from eggs.gnu.org ([208.118.235.92]:45065) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T5gyR-00048e-Rz for submit@debbugs.gnu.org; Sun, 26 Aug 2012 13:50:13 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1T5gxe-0007CC-F3 for submit@debbugs.gnu.org; Sun, 26 Aug 2012 13:49:23 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM, HTML_MESSAGE,RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=unavailable version=3.3.2 Received: from lists.gnu.org ([208.118.235.17]:57831) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T5gxe-0007C8-Bo for submit@debbugs.gnu.org; Sun, 26 Aug 2012 13:49:22 -0400 Received: from eggs.gnu.org ([208.118.235.92]:33095) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T5gxd-0002QZ-GN for bug-coreutils@gnu.org; Sun, 26 Aug 2012 13:49:22 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1T5gxc-0007BX-6j for bug-coreutils@gnu.org; Sun, 26 Aug 2012 13:49:21 -0400 Received: from imr-da03.mx.aol.com ([205.188.105.145]:52207) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T5gxc-0007Ai-02 for bug-coreutils@gnu.org; Sun, 26 Aug 2012 13:49:20 -0400 Received: from mtaomg-ma06.r1000.mx.aol.com (mtaomg-ma06.r1000.mx.aol.com [172.29.41.13]) by imr-da03.mx.aol.com (8.14.1/8.14.1) with ESMTP id q7QHnCpk009792 for ; Sun, 26 Aug 2012 13:49:12 -0400 Received: from core-dlb005c.r1000.mail.aol.com (core-dlb005.r1000.mail.aol.com [172.29.180.209]) by mtaomg-ma06.r1000.mx.aol.com (OMAG/Core Interface) with ESMTP id 7F3E0E000088 for ; Sun, 26 Aug 2012 13:49:12 -0400 (EDT) To: bug-coreutils@gnu.org Subject: uniq on a UTF8 file with roman numerals X-MB-Message-Source: WebUI X-MB-Message-Type: User MIME-Version: 1.0 From: "P. Michaud" Content-Type: multipart/alternative; boundary="--------MB_8CF51CA3C8671BF_FA0_7FB91_webmail-d132.sysops.aol.com" X-Mailer: AOL WebMail 36912 - STANDARD Received: from 109.10.25.227 by webmail-d132.sysops.aol.com (149.174.18.22) with HTTP (WebMailUI); Sun, 26 Aug 2012 13:49:12 -0400 Message-Id: <8CF51CA3C7CEC3B-FA0-24063@webmail-d132.sysops.aol.com> X-Originating-IP: [109.10.25.227] Date: Sun, 26 Aug 2012 13:49:12 -0400 (EDT) x-aol-global-disposition: G DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mx.aol.com; s=20110426; t=1346003352; bh=NYyTs9tvvqFLeSA9ltjok5f2a/QSfW1YKWetn3HdrR0=; h=From:To:Subject:Message-Id:Date:MIME-Version:Content-Type; b=nBued+osEB7/Dp9n29P0z3vcHuJ2aTYSYu0q+eF2JMLBsURSZ9qAWGIskiIjiUx7A mQCiZ3mbHDCV7duIZjH2rB610GwzDSMBr8L6oi+IA9KxJkj2xXLvVvhmpOM77O+kBo RTYt4c7QbtpntWv2ggRNcQW/FUJmbEUnFuPclgfk= X-AOL-SCOLL-SCORE: 0:2:319966656:93952408 X-AOL-SCOLL-URL_COUNT: 0 x-aol-sid: 3039ac1d290d503a61982ab3 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 208.118.235.17 X-Spam-Score: -6.1 (------) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Sun, 26 Aug 2012 15:03:37 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -6.1 (------) This is a multi-part message in MIME format. ----------MB_8CF51CA3C8671BF_FA0_7FB91_webmail-d132.sysops.aol.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Hello, I used the command "uniq -dc myfile.txt' here are some lines of the output 2 =E2=98=BC turvy 2 =E2=98=BC with gay abandon 2 =E2=98=BC with reckless abandon 10 =E2=98=BC yy=E2=85=B0 9 =E2=98=BC yy=E2=85=B9=E2=85=B2 2 =E2=98=BC yy=E2=85=BA 12 =E2=98=BC zz=E2=85=B0 The three first lines above are correct and correspond to real duplicates l= ines in the file, but the numbers on the 4 last one are erroneous, each of = them correspond to a single line in the file. Yours faithfully. Pierre Michaud ----------MB_8CF51CA3C8671BF_FA0_7FB91_webmail-d132.sysops.aol.com Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="utf-8" Hello,

I used the command

"uniq -dc myfile.txt'

here are some lines of the output

      2 =E2=98=BC turvy
      2 =E2=98=BC with gay abandon
      2 =E2=98=BC with reckless abandon
     10 =E2=98=BC yy=E2=85=B0
      9 =E2=98=BC yy=E2=85=B9=E2=85=B2
      2 =E2=98=BC yy=E2=85=BA
     12 =E2=98=BC zz=E2=85=B0


The three first lines above are correct and correspond to real duplicates l= ines in the file, but the numbers on the 4 last one are erroneous, each of = them correspond to a single line in the file.

Yours faithfully.

Pierre Michaud



----------MB_8CF51CA3C8671BF_FA0_7FB91_webmail-d132.sysops.aol.com-- From debbugs-submit-bounces@debbugs.gnu.org Sun Aug 26 16:30:13 2012 Received: (at submit) by debbugs.gnu.org; 26 Aug 2012 20:30:13 +0000 Received: from localhost ([127.0.0.1]:51182 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T5jTI-0007i2-Rx for submit@debbugs.gnu.org; Sun, 26 Aug 2012 16:30:13 -0400 Received: from eggs.gnu.org ([208.118.235.92]:37309) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T5jTG-0007hv-8H for submit@debbugs.gnu.org; Sun, 26 Aug 2012 16:30:11 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1T5jSS-0000Xe-Av for submit@debbugs.gnu.org; Sun, 26 Aug 2012 16:29:21 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM, RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=unavailable version=3.3.2 Received: from lists.gnu.org ([208.118.235.17]:57259) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T5jSS-0000Xa-7l for submit@debbugs.gnu.org; Sun, 26 Aug 2012 16:29:20 -0400 Received: from eggs.gnu.org ([208.118.235.92]:53373) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T5jSR-0000cY-F5 for bug-coreutils@gnu.org; Sun, 26 Aug 2012 16:29:20 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1T5jSP-0000XN-Er for bug-coreutils@gnu.org; Sun, 26 Aug 2012 16:29:19 -0400 Received: from mail-wg0-f45.google.com ([74.125.82.45]:52712) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T5jSP-0000XI-8Q for bug-coreutils@gnu.org; Sun, 26 Aug 2012 16:29:17 -0400 Received: by wgbdq12 with SMTP id dq12so2429206wgb.26 for ; Sun, 26 Aug 2012 13:29:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:subject:from:to:cc:date:in-reply-to:references :content-type:x-mailer:mime-version:content-transfer-encoding; bh=1NW3dJPPuP4ulq7M6i9N2AFUWzGwXD8HM0aLhTNAePQ=; b=EaQ0llaSENOEpV3uRUPbCnE+hah7yYHG0IeEzoV2HZyS2GzRxfWW5fl6uDDdjFlQW8 O/uCPbU5o3hm4O5bHCWIN8hk4nDR2rAjIRHu5l4J5iD8QujpRepLjMtlBWykyOwKfhKF w4wtB9cgiBFAZIDYF4hIZJr5sbofjTP6/L75t/c7kF/tqaAl/W0ypMdsuLCHfbdE7Hzp ucqmuxmNKMLbjBxE3qMqHIB6qFLsuoiarWKMd29mdFAw6DzX3uVL8wV1a/GYCFgbtIRq xygx3aoopgMumGOCVG1T49YRsgEbeDmrfzBdy/1ED0ufn5qox4NhhEIq+6Jq57Yox91w NE5Q== Received: by 10.216.123.69 with SMTP id u47mr5954130weh.89.1346012956245; Sun, 26 Aug 2012 13:29:16 -0700 (PDT) Received: from [192.168.0.49] (cpc7-enfi18-2-0-cust239.20-2.cable.virginmedia.com. [82.45.247.240]) by mx.google.com with ESMTPS id l5sm16790090wix.5.2012.08.26.13.29.14 (version=TLSv1/SSLv3 cipher=OTHER); Sun, 26 Aug 2012 13:29:15 -0700 (PDT) Message-ID: <1346012953.5642.9.camel@rivertam> Subject: Re: bug#12285: uniq on a UTF8 file with roman numerals From: Robert Day To: bug-coreutils@gnu.org Date: Sun, 26 Aug 2012 21:29:13 +0100 In-Reply-To: <8CF51CA3C7CEC3B-FA0-24063@webmail-d132.sysops.aol.com> References: <8CF51CA3C7CEC3B-FA0-24063@webmail-d132.sysops.aol.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.3 (3.4.3-2.fc17) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 208.118.235.17 X-Spam-Score: -6.1 (------) X-Debbugs-Envelope-To: submit Cc: "P. Michaud" X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -6.1 (------) On Sun, 2012-08-26 at 13:49 -0400, P. Michaud wrote: > "uniq -dc myfile.txt' Can you provide a copy of this file (or, if it's not a file you want to make public, a modified version of it that causes the same problem)? From debbugs-submit-bounces@debbugs.gnu.org Sun Aug 26 16:54:49 2012 Received: (at 12285) by debbugs.gnu.org; 26 Aug 2012 20:54:49 +0000 Received: from localhost ([127.0.0.1]:51195 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T5jr6-0008HA-RY for submit@debbugs.gnu.org; Sun, 26 Aug 2012 16:54:49 -0400 Received: from mx1.redhat.com ([209.132.183.28]:63163) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T5jr3-0008H0-Eu for 12285@debbugs.gnu.org; Sun, 26 Aug 2012 16:54:46 -0400 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id q7QKrrIj016856 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sun, 26 Aug 2012 16:53:53 -0400 Received: from [10.36.116.35] (ovpn-116-35.ams2.redhat.com [10.36.116.35]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id q7QKrp6Z003839 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Sun, 26 Aug 2012 16:53:52 -0400 Message-ID: <503A8CDF.4010704@draigBrady.com> Date: Sun, 26 Aug 2012 21:53:51 +0100 From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20110816 Thunderbird/6.0 MIME-Version: 1.0 To: "P. Michaud" Subject: Re: bug#12285: uniq on a UTF8 file with roman numerals References: <8CF51CA3C7CEC3B-FA0-24063@webmail-d132.sysops.aol.com> In-Reply-To: <8CF51CA3C7CEC3B-FA0-24063@webmail-d132.sysops.aol.com> X-Enigmail-Version: 1.3.2 Content-Type: text/plain; charset=UTF-8 X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by mx1.redhat.com id q7QKrrIj016856 X-Spam-Score: -6.9 (------) X-Debbugs-Envelope-To: 12285 Cc: 12285@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -6.9 (------) On 08/26/2012 06:49 PM, P. Michaud wrote: > Hello, >=20 > I used the command >=20 > "uniq -dc myfile.txt' >=20 > here are some lines of the output >=20 > 2 =E2=98=BC turvy > 2 =E2=98=BC with gay abandon > 2 =E2=98=BC with reckless abandon > 10 =E2=98=BC yy=E2=85=B0 > 9 =E2=98=BC yy=E2=85=B9=E2=85=B2 > 2 =E2=98=BC yy=E2=85=BA > 12 =E2=98=BC zz=E2=85=B0 >=20 >=20 > The three first lines above are correct and correspond to real duplicat= es lines in the file, but the numbers on the 4 last one are erroneous, ea= ch of them correspond to a single line in the file. >=20 > Yours faithfully. >=20 > Pierre Michaud What system are you on What version of uniq What is the input exactly I suspect your locale is equating roman numerals (though that is surprisi= ng), but I can't reproduce with the following on coreutils-8.10-2.fc15.x86_64 = at least. locale -a | while read locale; do LC_ALL=3D$locale uniq -dc t.in done | grep -v " *2" cheers, P=C3=A1draig. From debbugs-submit-bounces@debbugs.gnu.org Sun Aug 26 20:17:24 2012 Received: (at 12285) by debbugs.gnu.org; 27 Aug 2012 00:17:24 +0000 Received: from localhost ([127.0.0.1]:51412 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T5n19-0004Kx-DM for submit@debbugs.gnu.org; Sun, 26 Aug 2012 20:17:24 -0400 Received: from mx1.redhat.com ([209.132.183.28]:22141) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T5n16-0004Kl-3Z; Sun, 26 Aug 2012 20:17:21 -0400 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id q7R0GRmF014910 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sun, 26 Aug 2012 20:16:27 -0400 Received: from [10.36.116.19] (ovpn-116-19.ams2.redhat.com [10.36.116.19]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id q7R0GOpX015074 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Sun, 26 Aug 2012 20:16:26 -0400 Message-ID: <503ABC58.6060209@draigBrady.com> Date: Mon, 27 Aug 2012 01:16:24 +0100 From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20110816 Thunderbird/6.0 MIME-Version: 1.0 To: "P. Michaud" Subject: Re: bug#12285: uniq on a UTF8 file with roman numerals References: <8CF51CA3C7CEC3B-FA0-24063@webmail-d132.sysops.aol.com> <503A8CDF.4010704@draigBrady.com> In-Reply-To: <503A8CDF.4010704@draigBrady.com> X-Enigmail-Version: 1.3.2 Content-Type: text/plain; charset=UTF-8 X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24 Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by mx1.redhat.com id q7R0GRmF014910 X-Spam-Score: -6.9 (------) X-Debbugs-Envelope-To: 12285 Cc: 12285@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -6.9 (------) tag 12285 + notabug close 12285 stop more info below... On 08/26/2012 09:53 PM, P=C3=A1draig Brady wrote: > On 08/26/2012 06:49 PM, P. Michaud wrote: >> Hello, >> >> I used the command >> >> "uniq -dc myfile.txt' >> >> here are some lines of the output >> >> 2 =E2=98=BC turvy >> 2 =E2=98=BC with gay abandon >> 2 =E2=98=BC with reckless abandon >> 10 =E2=98=BC yy=E2=85=B0 >> 9 =E2=98=BC yy=E2=85=B9=E2=85=B2 >> 2 =E2=98=BC yy=E2=85=BA >> 12 =E2=98=BC zz=E2=85=B0 >> >> >> The three first lines above are correct and correspond to real duplica= tes lines in the file, but the numbers on the 4 last one are erroneous, e= ach of them correspond to a single line in the file. >> >> Yours faithfully. >> >> Pierre Michaud >=20 > What system are you on > What version of uniq > What is the input exactly >=20 > I suspect your locale is equating roman numerals (though that is surpri= sing), It seems that these roman numerals are treated a equal in collating order= , so uniq is behaving as expected: $ sort <(printf "%s\n" =E2=85=B2 =E2=85=B1 =E2=85=B0) =E2=85=B2 =E2=85=B1 =E2=85=B0 $ uniq -dc <(printf "%s\n" =E2=85=B2 =E2=85=B1 =E2=85=B0) 3 =E2=85=B2 You can avoid this behaviour by doing a byte comparison by using LC_ALL=3DC. $ LC_ALL=3DC sort <(printf "%s\n" =E2=85=B2 =E2=85=B1 =E2=85=B0) =E2=85=B0 =E2=85=B1 =E2=85=B2 $ LC_ALL=3DC uniq -c <(printf "%s\n" =E2=85=B2 =E2=85=B1 =E2=85=B0) 1 =E2=85=B2 1 =E2=85=B1 1 =E2=85=B0 thanks, P=C3=A1draig. From unknown Fri Jun 20 18:16:49 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Mon, 24 Sep 2012 11:24:05 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator