From unknown Sat Jun 21 05:19:22 2025 X-Loop: help-debbugs@gnu.org Subject: bug#38627: uniq -c gets wrong count with non-ascii strings Resent-From: Roy Smith Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Sun, 15 Dec 2019 19:41:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 38627 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: 38627@debbugs.gnu.org X-Debbugs-Original-To: bug-coreutils@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.15764388209553 (code B ref -1); Sun, 15 Dec 2019 19:41:01 +0000 Received: (at submit) by debbugs.gnu.org; 15 Dec 2019 19:40:20 +0000 Received: from localhost ([127.0.0.1]:37131 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1igZka-0002U1-Gz for submit@debbugs.gnu.org; Sun, 15 Dec 2019 14:40:20 -0500 Received: from lists.gnu.org ([209.51.188.17]:47165) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1igZkY-0002Ts-Gl for submit@debbugs.gnu.org; Sun, 15 Dec 2019 14:40:18 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:36522) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1igZkX-0004X5-9s for bug-coreutils@gnu.org; Sun, 15 Dec 2019 14:40:18 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.5 required=5.0 tests=BAYES_50,HTML_MESSAGE, RCVD_IN_DNSWL_MED autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1igZkW-0000Bc-7q for bug-coreutils@gnu.org; Sun, 15 Dec 2019 14:40:17 -0500 Received: from mailbackend.panix.com ([166.84.1.89]:52231) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1igZkW-00007i-0P for bug-coreutils@gnu.org; Sun, 15 Dec 2019 14:40:16 -0500 Received: from [10.0.1.14] (ool-45734927.dyn.optonline.net [69.115.73.39]) by mailbackend.panix.com (Postfix) with ESMTPSA id 47bZW318YVz1pcS for ; Sun, 15 Dec 2019 14:40:14 -0500 (EST) From: Roy Smith Content-Type: multipart/alternative; boundary="Apple-Mail=_319628D4-F62F-467C-A73E-667B8418C1C0" Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Message-Id: Date: Sun, 15 Dec 2019 14:40:14 -0500 X-Mailer: Apple Mail (2.3445.9.1) X-detected-operating-system: by eggs.gnu.org: GNU/Linux (Android) [fuzzy] X-Received-From: 166.84.1.89 X-Spam-Score: -1.6 (-) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.6 (--) --Apple-Mail=_319628D4-F62F-467C-A73E-667B8418C1C0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 With the following input: > $ cat x > "=E2=81=BF=E1=B5=98=CB=A1=CB=A1" > "=DC=A5=DC=9D=DC=AA=DC=90=DC=A9" Running "uniq -c" says there's two copies of the same line! > $ uniq -c x > 2 "=E2=81=BF=E1=B5=98=CB=A1=CB=A1" I've attached a copy of the test file, and here's the octal dump: > $ od -b x > 0000000 042 342 201 277 341 265 230 313 241 313 241 042 012 042 334 = 245 > 0000020 334 235 334 252 334 220 334 251 042 012 > 0000032 I'm getting this on: > Linux tools-sgebastion-08 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 = (2018-10-27) x86_64 GNU/Linux > uniq (GNU coreutils) 8.26 My MacOS 10.13.6 box gets it right: > $ uniq -c x > 1 "=E2=81=BF=E1=B5=98=CB=A1=CB=A1" > 1 "=DC=A5=DC=9D=DC=AA=DC=90=DC=A9" --Apple-Mail=_319628D4-F62F-467C-A73E-667B8418C1C0 Content-Type: multipart/mixed; boundary="Apple-Mail=_6E85B91E-BD97-4FE9-843F-21F5E789A4D7" --Apple-Mail=_6E85B91E-BD97-4FE9-843F-21F5E789A4D7 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 With = the following input:

$ cat x
"=E2=81=BF=E1=B5=98=CB=A1=CB=A1"
"=DC=A5=DC=9D=DC=AA=DC=90=DC=A9"

Running "uniq -c" says there's two copies of the same = line!

$ uniq -c x
      2 = "=E2=81=BF=E1=B5=98=CB=A1=CB=A1"

I've = attached a copy of the test file, and here's the octal = dump:

$ od -b x
0000000 042 342 201 277 341 265 230 313 = 241 313 241 042 012 042 334 245
0000020 = 334 235 334 252 334 220 334 251 042 012
0000032


I'm getting this on:

Linux tools-sgebastion-08 4.9.0-8-amd64 = #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux
uniq (GNU coreutils) = 8.26

My MacOS 10.13.6 box gets it right:

$ uniq = -c x
   1 "=E2=81=BF=E1=B5=98=CB=A1=CB=A1"
   1 "=DC=A5=DC=9D=DC=AA=DC=90=DC=A9"


= --Apple-Mail=_6E85B91E-BD97-4FE9-843F-21F5E789A4D7 Content-Disposition: attachment; filename=x Content-Type: application/octet-stream; x-unix-mode=0644; name="x" Content-Transfer-Encoding: base64 IuKBv+G1mMuhy6EiCiLcpdyd3KrckNypIgo= --Apple-Mail=_6E85B91E-BD97-4FE9-843F-21F5E789A4D7 Content-Transfer-Encoding: 7bit Content-Type: text/html; charset=us-ascii
--Apple-Mail=_6E85B91E-BD97-4FE9-843F-21F5E789A4D7-- --Apple-Mail=_319628D4-F62F-467C-A73E-667B8418C1C0-- From unknown Sat Jun 21 05:19:22 2025 X-Loop: help-debbugs@gnu.org Subject: bug#38627: uniq -c gets wrong count with non-ascii strings Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Mon, 16 Dec 2019 09:42:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 38627 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: Roy Smith Cc: Jim Meyering , 38627@debbugs.gnu.org Received: via spool by 38627-submit@debbugs.gnu.org id=B38627.157648928418965 (code B ref 38627); Mon, 16 Dec 2019 09:42:01 +0000 Received: (at 38627) by debbugs.gnu.org; 16 Dec 2019 09:41:24 +0000 Received: from localhost ([127.0.0.1]:37906 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1igmsV-0004vp-Mx for submit@debbugs.gnu.org; Mon, 16 Dec 2019 04:41:23 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:60034) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1igmsT-0004vY-0Y for 38627@debbugs.gnu.org; Mon, 16 Dec 2019 04:41:22 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 9C07B1605CA; Mon, 16 Dec 2019 01:41:14 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id wZqz5UaVYfCS; Mon, 16 Dec 2019 01:41:13 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id DE4B11605D5; Mon, 16 Dec 2019 01:41:13 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id dTVq2ZaPuNrv; Mon, 16 Dec 2019 01:41:13 -0800 (PST) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id B62C71605CA; Mon, 16 Dec 2019 01:41:13 -0800 (PST) References: From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <871e974e-0fdd-062d-13b5-53676ee78538@cs.ucla.edu> Date: Mon, 16 Dec 2019 01:41:13 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) On 12/15/19 11:40 AM, Roy Smith wrote: > With the following input: >=20 >> $ cat x >> "=E2=81=BF=E1=B5=98=CB=A1=CB=A1" >> "=DC=A5=DC=9D=DC=AA=DC=90=DC=A9" >=20 >=20 > Running "uniq -c" says there's two copies of the same line! >=20 >> $ uniq -c x >> 2 "=E2=81=BF=E1=B5=98=CB=A1=CB=A1" Thanks for the bug report. I expect this is because GNU 'uniq' uses the equivalent of strcoll (locale-dependent comparison) to compare lines, whe= reas macOS 'uniq' uses the equivalent of strcmp (byte comparison). Since the t= wo lines compare equal in your locale, GNU 'uniq' says there's just one line= . The GNU 'uniq' behavior appears to be a consequence of this commit: commit 545c2323d493c7ed9c770d9b8e45a15db6f615bc Author: Jim Meyering Date: Fri Aug 2 14:42:37 2002 +0000 with a change noted this way in NEWS: * uniq now obeys the LC_COLLATE locale, as per POSIX 1003.1-2001 TC1. However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'un= iq', and I expect this means that the 2002 commit should be reverted so that G= NU 'uniq' behaves like macOS 'uniq' (a behavior that I think makes more sens= e anyway). I'll CC: this email to Jim Meyering to see whether he has an opinion abou= t this. In the meantime you can work around the problem by using 'LC_ALL=3DC uniq= ' instead of plain 'uniq' in your shell script. From unknown Sat Jun 21 05:19:22 2025 X-Loop: help-debbugs@gnu.org Subject: bug#38627: uniq -c gets wrong count with non-ascii strings Resent-From: Roy Smith Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Tue, 17 Dec 2019 00:47:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 38627 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: Paul Eggert Cc: Jim Meyering , 38627@debbugs.gnu.org Received: via spool by 38627-submit@debbugs.gnu.org id=B38627.157654360314896 (code B ref 38627); Tue, 17 Dec 2019 00:47:02 +0000 Received: (at 38627) by debbugs.gnu.org; 17 Dec 2019 00:46:43 +0000 Received: from localhost ([127.0.0.1]:40552 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ih10d-0003sC-In for submit@debbugs.gnu.org; Mon, 16 Dec 2019 19:46:43 -0500 Received: from mailbackend.panix.com ([166.84.1.89]:44571) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ih10c-0003s4-69 for 38627@debbugs.gnu.org; Mon, 16 Dec 2019 19:46:42 -0500 Received: from [10.0.1.14] (ool-45734927.dyn.optonline.net [69.115.73.39]) by mailbackend.panix.com (Postfix) with ESMTPSA id 47cKG90yynz1JNy; Mon, 16 Dec 2019 19:46:40 -0500 (EST) From: Roy Smith Message-Id: <815E72D0-3240-45E5-94F1-A31B2F276657@panix.com> Content-Type: multipart/alternative; boundary="Apple-Mail=_54D8BDC9-50C7-472B-80D4-DC87C799D7F6" Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Date: Mon, 16 Dec 2019 19:46:39 -0500 In-Reply-To: <871e974e-0fdd-062d-13b5-53676ee78538@cs.ucla.edu> References: <871e974e-0fdd-062d-13b5-53676ee78538@cs.ucla.edu> X-Mailer: Apple Mail (2.3445.9.1) X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) --Apple-Mail=_54D8BDC9-50C7-472B-80D4-DC87C799D7F6 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Yup, this does depend on the locale. In my original example, I had = LANG=3Den_US.UTF-8. Setting it to C.UTF-8 gets me the right result: > $ LANG=3DC.UTF-8 uniq -c x > 1 "=E2=81=BF=E1=B5=98=CB=A1=CB=A1" > 1 "=DC=A5=DC=9D=DC=AA=DC=90=DC=A9" But, that doesn't fully explain what's going on. I find it difficult to = believe that there's any collation sequence in the world where those two = strings should compare the same. I've been playing around with the ICU = string compare demo = = and can't reproduce this there. Possibly I just haven't hit upon the = right combination of options to set, but I think it's far-fetched that = there's any such combination for which those two strings comparing equal = is legitimate. --Apple-Mail=_54D8BDC9-50C7-472B-80D4-DC87C799D7F6 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Yup, = this does depend on the locale.  In my original example, I had = LANG=3Den_US.UTF-8.  Setting it to C.UTF-8 gets me the right = result:

$ LANG=3DC.UTF-8 uniq -c = x
  =     1 "=E2=81=BF=E1=B5=98=CB=A1=CB=A1"
      1 = "=DC=A5=DC=9D=DC=AA=DC=90=DC=A9"


But, that doesn't fully explain what's going on.  I find = it difficult to believe that there's any collation sequence in the world = where those two strings should compare the same.  I've been playing = around with the ICU string compare demo and can't = reproduce this there.  Possibly I just haven't hit upon the right = combination of options to set, but I think it's far-fetched that there's = any such combination for which those two strings comparing equal is = legitimate.

= --Apple-Mail=_54D8BDC9-50C7-472B-80D4-DC87C799D7F6-- From unknown Sat Jun 21 05:19:22 2025 X-Loop: help-debbugs@gnu.org Subject: bug#38627: uniq -c gets wrong count with non-ascii strings Resent-From: Roy Smith Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Tue, 17 Dec 2019 17:26:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 38627 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: Paul Eggert Cc: Jim Meyering , 38627@debbugs.gnu.org Received: via spool by 38627-submit@debbugs.gnu.org id=B38627.157660355817259 (code B ref 38627); Tue, 17 Dec 2019 17:26:01 +0000 Received: (at 38627) by debbugs.gnu.org; 17 Dec 2019 17:25:58 +0000 Received: from localhost ([127.0.0.1]:42360 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ihGbe-0004UJ-5W for submit@debbugs.gnu.org; Tue, 17 Dec 2019 12:25:58 -0500 Received: from mailbackend.panix.com ([166.84.1.89]:52060) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ihGbc-0004UB-4N for 38627@debbugs.gnu.org; Tue, 17 Dec 2019 12:25:57 -0500 Received: from [10.0.1.14] (ool-45734927.dyn.optonline.net [69.115.73.39]) by mailbackend.panix.com (Postfix) with ESMTPSA id 47clR73ybZz1C64; Tue, 17 Dec 2019 12:25:55 -0500 (EST) From: Roy Smith Message-Id: Content-Type: multipart/alternative; boundary="Apple-Mail=_8E12E4CF-F345-4A76-ADD7-6D8681F8CBC4" Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Date: Tue, 17 Dec 2019 12:25:54 -0500 In-Reply-To: <815E72D0-3240-45E5-94F1-A31B2F276657@panix.com> References: <871e974e-0fdd-062d-13b5-53676ee78538@cs.ucla.edu> <815E72D0-3240-45E5-94F1-A31B2F276657@panix.com> X-Mailer: Apple Mail (2.3445.9.1) X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) --Apple-Mail=_8E12E4CF-F345-4A76-ADD7-6D8681F8CBC4 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 I stopped short of actually building uniq.c from source (bootstrap, = prerequisites, ...), but looking at the code, it looks like the call = chain is: different() xmemcoll() memcoll() strcoll() so I tried a little test at the strcoll() level: #include #include #include int main (int argc, char **argv) { unsigned char null[] =3D { 0342, 0201, 0277, 0341, 0265, 0230, 0313, 0241, 0313, 0241, 0 }; unsigned char iraq[] =3D { 0334, 0245, 0334, 0235, 0334, 0252, 0334, 0220, 0334, 0251, 0}; printf("%s\n", null); printf("%s\n", iraq); int m =3D strcoll(null, iraq); printf("m =3D %d\n", m); } That correctly says the strings are different: $ LANG=3Den_US.UTF-8 ./a.out =E2=81=BF=E1=B5=98=CB=A1=CB=A1 =DC=A5=DC=9D=DC=AA=DC=90=DC=A9 m =3D 6 > On Dec 16, 2019, at 7:46 PM, Roy Smith wrote: >=20 > Yup, this does depend on the locale. In my original example, I had = LANG=3Den_US.UTF-8. Setting it to C.UTF-8 gets me the right result: >=20 >> $ LANG=3DC.UTF-8 uniq -c x >> 1 "=E2=81=BF=E1=B5=98=CB=A1=CB=A1" >> 1 "=DC=A5=DC=9D=DC=AA=DC=90=DC=A9" >=20 >=20 > But, that doesn't fully explain what's going on. I find it difficult = to believe that there's any collation sequence in the world where those = two strings should compare the same. I've been playing around with the = ICU string compare demo = = and can't reproduce this there. Possibly I just haven't hit upon the = right combination of options to set, but I think it's far-fetched that = there's any such combination for which those two strings comparing equal = is legitimate. >=20 --Apple-Mail=_8E12E4CF-F345-4A76-ADD7-6D8681F8CBC4 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
I stopped short of actually building uniq.c from source = (bootstrap, prerequisites, ...), but looking at the code, it looks like = the call chain is:

different()
xmemcoll()
memcoll()
strcoll()

so I tried a little test = at the strcoll() level:

#include <stdio.h>
#include <unistd.h>
#include <string.h>

int
main (int argc, char **argv)
{
  unsigned char null[] =3D {

    0342, = 0201, 0277, 0341, 0265, 0230, 0313, 0241, 0313, 0241, 0
  };
  unsigned char iraq[] =3D {
    0334, 0245, 0334, 0235, 0334, 0252, 0334, = 0220, 0334, 0251, 0};

  printf("%s\n", null);
  printf("%s\n", iraq);

  int m =3D = strcoll(null, iraq);
  printf("m =3D = %d\n", m);
}

That correctly = says the strings are different:

$ LANG=3Den_US.UTF-8 = ./a.out
=E2=81=BF=E1=B5=98=CB=A1=CB=A1
=DC=A5=DC=9D=DC=AA=DC=90=DC=A9
m =3D = 6






On Dec = 16, 2019, at 7:46 PM, Roy Smith <roy@panix.com> wrote:

Yup, this does depend = on the locale.  In my original example, I had LANG=3Den_US.UTF-8. =  Setting it to C.UTF-8 gets me the right result:

$ = LANG=3DC.UTF-8 uniq -c x
  =     1 "=E2=81=BF=E1=B5=98=CB=A1=CB=A1"
      1 = "=DC=A5=DC=9D=DC=AA=DC=90=DC=A9"


But, that doesn't fully explain what's going on.  I find = it difficult to believe that there's any collation sequence in the world = where those two strings should compare the same.  I've been playing = around with the ICU string compare demo and can't = reproduce this there.  Possibly I just haven't hit upon the right = combination of options to set, but I think it's far-fetched that there's = any such combination for which those two strings comparing equal is = legitimate.


= --Apple-Mail=_8E12E4CF-F345-4A76-ADD7-6D8681F8CBC4-- From unknown Sat Jun 21 05:19:22 2025 X-Loop: help-debbugs@gnu.org Subject: bug#38627: uniq -c gets wrong count with non-ascii strings Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Tue, 17 Dec 2019 23:11:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 38627 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: Paul Eggert Cc: Roy Smith , 38627@debbugs.gnu.org Received: via spool by 38627-submit@debbugs.gnu.org id=B38627.157662425327149 (code B ref 38627); Tue, 17 Dec 2019 23:11:02 +0000 Received: (at 38627) by debbugs.gnu.org; 17 Dec 2019 23:10:53 +0000 Received: from localhost ([127.0.0.1]:42504 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ihLzQ-00073p-Vh for submit@debbugs.gnu.org; Tue, 17 Dec 2019 18:10:53 -0500 Received: from mail-wr1-f52.google.com ([209.85.221.52]:39397) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ihLzP-00073a-0m for 38627@debbugs.gnu.org; Tue, 17 Dec 2019 18:10:51 -0500 Received: by mail-wr1-f52.google.com with SMTP id y11so245253wrt.6 for <38627@debbugs.gnu.org>; Tue, 17 Dec 2019 15:10:50 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=DQlQCoU9tFN0XhcKqZl+i18AZE9VRQOCfDUEwrjhzJY=; b=bFmJKW0PKZp0zP6dBLqd/pcwU/OlCzSsABzSaMj/VkEmQ48TQBsgmhMNGAswgjO7rZ wsUdBb6FfnC39rivtQJiO3o5dF4pSLFLJtmtDPAJxAAyN2yLK6wa0ehg84t8VvpUz1kG LLb/s1toxLQuKquPh242qoUNGG5NDIj2Q5jTXGcyCEinFEAdxAH4ELup5pXmW9mWtsc9 vtVcZmxHOfkFRLut28JWSdM82DUc3Mp1LrcrWjasKTrleYoCGnKBZXRZawA3hpNpZiQZ xQtkDBkZxOnpVGP1pw1qFm8/IT1cO8XUxuB3D31MdMHUfQPe+sp3yMB0pU+pwrj7BLgb eeeA== X-Gm-Message-State: APjAAAUTCqvWJXBdkonvwmMEL7vaBqARC567E/Fz6YQ045CwHXzoh86W counDXQB9MOI0ZvyExd1vVlViwigrc7P8h0rDlI= X-Google-Smtp-Source: APXvYqy716xBVlwMEakBnlyQlP4i1wX8IsOuv0vuq2/ELnuBLBlG2adHKL6M0fWtbK1P+mwTaXfI8+bSpf3NhQQzIno= X-Received: by 2002:a5d:670a:: with SMTP id o10mr122840wru.227.1576624245256; Tue, 17 Dec 2019 15:10:45 -0800 (PST) MIME-Version: 1.0 References: <871e974e-0fdd-062d-13b5-53676ee78538@cs.ucla.edu> In-Reply-To: <871e974e-0fdd-062d-13b5-53676ee78538@cs.ucla.edu> From: Jim Meyering Date: Tue, 17 Dec 2019 15:10:33 -0800 Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.5 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.5 (/) On Mon, Dec 16, 2019 at 1:41 AM Paul Eggert wrote: > On 12/15/19 11:40 AM, Roy Smith wrote: > > With the following input: > > > >> $ cat x > >> "=E2=81=BF=E1=B5=98=CB=A1=CB=A1" > >> "=DC=A5=DC=9D=DC=AA=DC=90=DC=A9" > > > > > > Running "uniq -c" says there's two copies of the same line! > > > >> $ uniq -c x > >> 2 "=E2=81=BF=E1=B5=98=CB=A1=CB=A1" > > Thanks for the bug report. I expect this is because GNU 'uniq' uses the > equivalent of strcoll (locale-dependent comparison) to compare lines, whe= reas > macOS 'uniq' uses the equivalent of strcmp (byte comparison). Since the t= wo > lines compare equal in your locale, GNU 'uniq' says there's just one line= . > > The GNU 'uniq' behavior appears to be a consequence of this commit: > > commit 545c2323d493c7ed9c770d9b8e45a15db6f615bc > Author: Jim Meyering > Date: Fri Aug 2 14:42:37 2002 +0000 > > with a change noted this way in NEWS: > > * uniq now obeys the LC_COLLATE locale, as per POSIX 1003.1-2001 TC1. > > However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'un= iq', > and I expect this means that the 2002 commit should be reverted so that G= NU > 'uniq' behaves like macOS 'uniq' (a behavior that I think makes more sens= e anyway). > > I'll CC: this email to Jim Meyering to see whether he has an opinion abou= t this. > > In the meantime you can work around the problem by using 'LC_ALL=3DC uniq= ' instead > of plain 'uniq' in your shell script. Thanks for the report, Roy, and thanks Paul for diving in. I confess I haven't done more than look at that old diff, but this sure sounds like a bug we must fix, to get in line with the the much more recent POSIX spec. From unknown Sat Jun 21 05:19:22 2025 X-Loop: help-debbugs@gnu.org Subject: bug#38627: uniq -c gets wrong count with non-ascii strings References: In-Reply-To: Resent-From: Bruno Haible Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Wed, 18 Dec 2019 04:40:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 38627 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: eggert@cs.ucla.edu Cc: 38627@debbugs.gnu.org X-Debbugs-Original-Cc: bug-coreutils@gnu.org, 38627@debbugs.gnu.org Received: via spool by submit@debbugs.gnu.org id=B.15766439963271 (code B ref -1); Wed, 18 Dec 2019 04:40:02 +0000 Received: (at submit) by debbugs.gnu.org; 18 Dec 2019 04:39:56 +0000 Received: from localhost ([127.0.0.1]:42586 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ihR7s-0000qh-FU for submit@debbugs.gnu.org; Tue, 17 Dec 2019 23:39:56 -0500 Received: from lists.gnu.org ([209.51.188.17]:45201) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ihR7q-0000qZ-W2 for submit@debbugs.gnu.org; Tue, 17 Dec 2019 23:39:55 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:34820) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ihR7p-0003rD-8R for bug-coreutils@gnu.org; Tue, 17 Dec 2019 23:39:54 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=BAYES_20,URIBL_BLOCKED autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ihR7m-00029Y-Va for bug-coreutils@gnu.org; Tue, 17 Dec 2019 23:39:52 -0500 Received: from mo6-p00-ob.smtp.rzone.de ([2a01:238:20a:202:5300::12]:23100) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1ihR7k-00025m-1n for bug-coreutils@gnu.org; Tue, 17 Dec 2019 23:39:49 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; t=1576643984; s=strato-dkim-0002; d=clisp.org; h=Message-ID:Date:Subject:Cc:To:From:X-RZG-CLASS-ID:X-RZG-AUTH:From: Subject:Sender; bh=vqk8mqhiMrb+ziwidrIuCL/IxNhWlKVCkp4XLc/JIlc=; b=cVeEhqFlwlxALFFEvVEA4wCx5qnFYzLHk1xTsjdR8wOLlHo2XocNf/BM+llHPZV8Ft 1X60ll+iqxH8UsWD17o63I11GQNnDXx2GAf7/ANkyJYstSqeNLUMbuvo9avbQrb1psUy Sz0w0afYyVjnBgw3ZkBQATFWJQE2Mj1sAi/pDu/f612iVtHaDwzbbgK7TTB1tlmwtE1/ 5mfZAldssKTvymESzPiLjYTh0K3jtQISLov54Ipu9xM7gI0/JjrLS6UDd81HIMZ/GOe6 7j9NXWSDTkWYZcPqqRNjhTp3gC5MxtRrBYCQdX6X1y3kEDPBnERd/6J2xf0xHzuNFCCp MAZg== X-RZG-AUTH: ":Ln4Re0+Ic/6oZXR1YgKryK8brlshOcZlIWs+iCP5vnk6shH+AHjwLuWOH6fzxfs=" X-RZG-CLASS-ID: mo00 Received: from bruno.haible.de by smtp.strato.de (RZmta 46.0.7 DYNA|AUTH) with ESMTPSA id t0ad5bvBI4dd3Au (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (curve X9_62_prime256v1 with 256 ECDH bits, eq. 3072 bits RSA)) (Client did not present a certificate); Wed, 18 Dec 2019 05:39:39 +0100 (CET) From: Bruno Haible Date: Wed, 18 Dec 2019 05:39:38 +0100 Message-ID: <6530924.vR5LBQhLdA@omega> User-Agent: KMail/5.1.3 (Linux/4.4.0-166-generic; KDE/5.18.0; x86_64; ; ) MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2a01:238:20a:202:5300::12 X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'uniq' Indeed. The change was done in . Quote: "On Page: 3309 Line: 111067 Section: uniq In the ENVIRONMENT VARIABLES section, delete: LC_COLLATE Determine the locale for ordering rules." Bruno From unknown Sat Jun 21 05:19:22 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Roy Smith Subject: bug#38627: closed (Re: bug#38627: uniq -c gets wrong count with non-ascii strings) Message-ID: References: <8d9261ea-7355-fa06-a286-81268178e706@draigBrady.com> X-Gnu-PR-Message: they-closed 38627 X-Gnu-PR-Package: coreutils Reply-To: 38627@debbugs.gnu.org Date: Sun, 23 Feb 2020 19:44:01 +0000 Content-Type: multipart/mixed; boundary="----------=_1582487041-19374-1" This is a multi-part message in MIME format... ------------=_1582487041-19374-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #38627: uniq -c gets wrong count with non-ascii strings which was filed against the coreutils package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 38627@debbugs.gnu.org. --=20 38627: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D38627 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1582487041-19374-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 38627-done) by debbugs.gnu.org; 23 Feb 2020 19:43:51 +0000 Received: from localhost ([127.0.0.1]:51631 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1j5xAC-00051l-GF for submit@debbugs.gnu.org; Sun, 23 Feb 2020 14:43:51 -0500 Received: from mail-wm1-f68.google.com ([209.85.128.68]:34908) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1j5xA8-00051V-A3 for 38627-done@debbugs.gnu.org; Sun, 23 Feb 2020 14:43:39 -0500 Received: by mail-wm1-f68.google.com with SMTP id b17so7183309wmb.0 for <38627-done@debbugs.gnu.org>; Sun, 23 Feb 2020 11:43:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language; bh=dllj2vp9M+N4SN7ajXxx+RS1dC1762L+HFB9g8aPu60=; b=qBWaQaHESiZbGlF+dmXzxKeSKzKKkYdE9Xc7dpShY+hCaXI54neoaUDVHV80uWvT8n qNXDqAyZmiygiNg5b48OFxuKxqB5ULTPJvGToTymg7hhLazKg/S3UxATfX2fP56bEYzX O+9KttIbYx9961ywPrxaGEym5UjQW1s5A5JYVWWT9LG9rECB8ujVcETKiL1zhM2V64+B kYj+YX9hp3fqmXdKiXDkWINLOy5s/gyXych86rHdsPGV/pSwQlB6isuwdLooRKI9nWX8 C0x78Gf48jVhIbwht43wOG2I0xTw+Dn9ypGIqi6lyM73Jq6YvRdWBoW4QVejHEzqTldV EgnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:subject:to:cc:references:from:message-id :date:user-agent:mime-version:in-reply-to:content-language; bh=dllj2vp9M+N4SN7ajXxx+RS1dC1762L+HFB9g8aPu60=; b=EQ2d3HrzNVKizq+LLPkT1e3wdYJfZmyBAjjjoY0TafAICDzwi5Kquo9WPAgXJRk0bD 1vu5K43ZvmiBd8Z4HFowN9AdbW8dbs+S+64KfHnXpX3pSHXbZR61zWMtGFgK+TbXM3uq pVqmbKlUx4RDKX/TM0l5sOiTS90rB76bZ0joCqy+tORpZDZ9X1UPk8dLb8GAkBfZknTE YAOeNkY68Kks5rdI6DDEvs9PBIdAr3tUWnXdNi4XM0lR5vmhj+9FuI6glY1gyprnevzZ 35G3+Q3Orsu4OsdImj8EqAf9ANLnJPTJ+eeJj30YF7C4pW4P928b4A9zjIXbJxz8nx+p pGYA== X-Gm-Message-State: APjAAAUuWlxNvPcEFMSoZdP3qmtqTaQ91Uph7yq+klRwzROoRKDvwtbm yDf+20L4GDCJ2o1eYgyNKMv0Hovo X-Google-Smtp-Source: APXvYqyn1dnn9uyL3QHyOjVrSFTHux9/s1ZQQDBUQYC+GDgU8Ydsu56up/TYkWabZTqfxyojYt4ckQ== X-Received: by 2002:a05:600c:2c13:: with SMTP id q19mr17850629wmg.144.1582487010111; Sun, 23 Feb 2020 11:43:30 -0800 (PST) Received: from localhost.localdomain (86-42-14-227-dynamic.agg2.lod.rsl-rtd.eircom.net. [86.42.14.227]) by smtp.googlemail.com with ESMTPSA id d3sm3354121wrr.56.2020.02.23.11.43.29 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 23 Feb 2020 11:43:29 -0800 (PST) Subject: Re: bug#38627: uniq -c gets wrong count with non-ascii strings To: Roy Smith References: <871e974e-0fdd-062d-13b5-53676ee78538@cs.ucla.edu> <815E72D0-3240-45E5-94F1-A31B2F276657@panix.com> From: =?UTF-8?Q?P=c3=a1draig_Brady?= Message-ID: <8d9261ea-7355-fa06-a286-81268178e706@draigBrady.com> Date: Sun, 23 Feb 2020 19:43:27 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:73.0) Gecko/20100101 Thunderbird/73.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/mixed; boundary="------------AD43522172AE4BA28DDAD868" Content-Language: en-US X-Spam-Score: 0.3 (/) X-Debbugs-Envelope-To: 38627-done Cc: 38627-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) This is a multi-part message in MIME format. --------------AD43522172AE4BA28DDAD868 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit On 17/12/2019 17:25, Roy Smith wrote: > I stopped short of actually building uniq.c from source (bootstrap, prerequisites, ...), but looking at the code, it looks like the call chain is: > > different() > xmemcoll() > memcoll() > strcoll() > > so I tried a little test at the strcoll() level: > > #include > #include > #include > > int > main (int argc, char **argv) > { > unsigned char null[] = { > > 0342, 0201, 0277, 0341, 0265, 0230, 0313, 0241, 0313, 0241, 0 > }; > unsigned char iraq[] = { > 0334, 0245, 0334, 0235, 0334, 0252, 0334, 0220, 0334, 0251, 0}; > > printf("%s\n", null); > printf("%s\n", iraq); > > int m = strcoll(null, iraq); > printf("m = %d\n", m); > } > > That correctly says the strings are different: > > $ LANG=en_US.UTF-8 ./a.out > ⁿᵘˡˡ > ܥܝܪܐܩ > m = 6 > > > > > > >> On Dec 16, 2019, at 7:46 PM, Roy Smith wrote: >> >> Yup, this does depend on the locale. In my original example, I had LANG=en_US.UTF-8. Setting it to C.UTF-8 gets me the right result: >> >>> $ LANG=C.UTF-8 uniq -c x >>> 1 "ⁿᵘˡˡ" >>> 1 "ܥܝܪܐܩ" >> >> >> But, that doesn't fully explain what's going on. I find it difficult to believe that there's any collation sequence in the world where those two strings should compare the same. I've been playing around with the ICU string compare demo and can't reproduce this there. Possibly I just haven't hit upon the right combination of options to set, but I think it's far-fetched that there's any such combination for which those two strings comparing equal is legitimate. I think you ran your test on a newer glibc. Testing on older glibc-2.22 I see the issue with strcoll() returning 0 for the above strings, while it returns an expected difference on glibc-2.30 at least. There are a few things to reason about with removing strcoll(), namely: buggy strcoll implementations inconsistent unicode normalization mismatched locale settings and data handling of characters ignored in collation order tl;dr is that strcoll() should be removed for all these reasons, and I've added a test for each of the 4 cases above in the attached patch, which I'll push later. Marking this as done. thanks, Pádraig --------------AD43522172AE4BA28DDAD868 Content-Type: text/x-patch; charset=UTF-8; name="uniq-no-strcoll.patch" Content-Transfer-Encoding: 8bit Content-Disposition: attachment; filename="uniq-no-strcoll.patch" >From 439a0f7fe0b89ebc371a80b30b07e3fd8b0c1b4e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= Date: Sun, 23 Feb 2020 13:20:08 +0000 Subject: [PATCH] uniq: avoid strcoll() to improve performance and consistency strcoll() is only significant to uniq(1) if it returns 0, and it generally only does so with buggy locales or mismatched locales and data. Some systems may have strcoll() return 0 for equivalent normalized unicode forms, but for consistency across platforms strcoll() is avoided. The various cases are defined in the new test. This is consistent with newer POSIX standards as discussed at: https://www.austingroupbugs.net/view.php?id=963 * src/uniq.c: s/xstrcoll/memcmp/. * tests/local.mk: Reference the new test. * tests/misc/uniq-collate.sh: Add a new test. * NEWS: Mention the change in behavior. Fixes https://bugs.gnu.org/38627 --- NEWS | 3 ++ src/uniq.c | 13 +------- tests/local.mk | 1 + tests/misc/uniq-collate.sh | 64 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 69 insertions(+), 12 deletions(-) create mode 100755 tests/misc/uniq-collate.sh diff --git a/NEWS b/NEWS index 8a349634e..6afb9cb6d 100644 --- a/NEWS +++ b/NEWS @@ -65,6 +65,9 @@ GNU coreutils NEWS -*- outline -*- [The old behavior was introduced in sh-utils 2.0.15 ca. 1999, predating coreutils package.] + uniq no longer uses strcoll() to determine string equivalence, + and so will operate more efficiently and consistently. + ** New Features ls now supports the --time=birth option to display and sort by diff --git a/src/uniq.c b/src/uniq.c index 0fcf50a16..6608b6c62 100644 --- a/src/uniq.c +++ b/src/uniq.c @@ -30,7 +30,6 @@ #include "hard-locale.h" #include "posixver.h" #include "stdio--.h" -#include "xmemcoll.h" #include "xstrtol.h" #include "memcasecmp.h" #include "quote.h" @@ -52,9 +51,6 @@ } \ while (0) -/* True if the LC_COLLATE locale is hard. */ -static bool hard_LC_COLLATE; - /* Number of fields to skip on each line when doing comparisons. */ static size_t skip_fields; @@ -220,7 +216,6 @@ characters. Fields are skipped before chars.\n\ \n\ Note: 'uniq' does not detect repeated lines unless they are adjacent.\n\ You may want to sort the input first, or use 'sort -u' without 'uniq'.\n\ -Also, comparisons honor the rules specified by 'LC_COLLATE'.\n\ "), stdout); emit_ancillary_info (PROGRAM_NAME); } @@ -293,12 +288,7 @@ different (char *old, char *new, size_t oldlen, size_t newlen) newlen = check_chars; if (ignore_case) - { - /* FIXME: This should invoke strcoll somehow. */ - return oldlen != newlen || memcasecmp (old, new, oldlen); - } - else if (hard_LC_COLLATE) - return xmemcoll (old, oldlen, new, newlen) != 0; + return oldlen != newlen || memcasecmp (old, new, oldlen); else return oldlen != newlen || memcmp (old, new, oldlen); } @@ -501,7 +491,6 @@ main (int argc, char **argv) setlocale (LC_ALL, ""); bindtextdomain (PACKAGE, LOCALEDIR); textdomain (PACKAGE); - hard_LC_COLLATE = hard_locale (LC_COLLATE); atexit (close_stdout); diff --git a/tests/local.mk b/tests/local.mk index bbcb9d413..0aabdaacc 100644 --- a/tests/local.mk +++ b/tests/local.mk @@ -438,6 +438,7 @@ all_tests = \ tests/misc/unexpand.pl \ tests/misc/uniq.pl \ tests/misc/uniq-perf.sh \ + tests/misc/uniq-collate.sh \ tests/misc/xattr.sh \ tests/misc/yes.sh \ tests/tail-2/wait.sh \ diff --git a/tests/misc/uniq-collate.sh b/tests/misc/uniq-collate.sh new file mode 100755 index 000000000..6767848a4 --- /dev/null +++ b/tests/misc/uniq-collate.sh @@ -0,0 +1,64 @@ +#!/bin/sh +# before coreutils-8.32, uniq would not distinguish +# items which compared equal with strcoll() +# So ensure we avoid strcoll() for the following cases. + +# Copyright (C) 2020 Free Software Foundation, Inc. + +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. + +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. + +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . + +. "${srcdir=.}/tests/init.sh"; path_prepend_ ./src +print_ver_ uniq printf + +gen_input() +{ + env LC_ALL=$LOCALE_FR_UTF8 printf \ + "$@" > in || framework_failure_ +} + +# strcoll() used to return 0 comparing the following strings +# which was fixed somewhere between glibc-2.22 and glibc-2.30 +gen_input '%s\n' 'ⁿᵘˡˡ' 'ܥܝܪܐܩ' > in || framework_failure_ +test $(LC_ALL=$LOCALE_FR_UTF8 uniq < in | wc -l) = 2 || fail=1 + +# normalization in strcoll is inconsistent across platforms. +# glibc based systems at least do _not_ normalize in strcoll, +# while cygwin systems for example may do so. +# á composed, decomposed +gen_input '\u00E1\na\u0301\n' > in || framework_failure_ +test $(LC_ALL=$LOCALE_FR_UTF8 uniq < in | wc -l) = 2 || fail=1 +# Similarly with hangul +gen_input '\uAC01\n\u1100\u1161\u11A8' > in || framework_failure_ +test $(LC_ALL=ko_KR.utf8 uniq < in | wc -l) = 2 || fail=1 + +# Note if running in the wrong locale, strcoll may +# indicate the strings match when they don't +# I.e., cjk and hangul will now work even if +# uniq is running in the wong locale +# hangul (ko_KR.utf8) +gen_input '\uAC00\n\uAC01\n' > in || framework_failure_ +test $(LC_ALL=en_US.utf8 uniq < in | wc -l) = 2 || fail=1 +# CJK (zh_CN.utf8) +gen_input '\u3400\n\u3401\n' > in || framework_failure_ +test $(LC_ALL=en_US.utf8 uniq < in | wc -l) = 2 || fail=1 + +# Note strcoll() ignores certain characters, +# but not if the strings are otherwise equal. +# I.e., the following on glibc-2.30 at least +# does not print a single item as expected, +# but testing here for illustration +gen_input ',a\n.a\n' > in || framework_failure_ +test $(LC_ALL=$LOCALE_FR_UTF8 uniq < in | wc -l) = 2 || fail=1 + +Exit $fail -- 2.24.1 --------------AD43522172AE4BA28DDAD868-- ------------=_1582487041-19374-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 15 Dec 2019 19:40:20 +0000 Received: from localhost ([127.0.0.1]:37131 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1igZka-0002U1-Gz for submit@debbugs.gnu.org; Sun, 15 Dec 2019 14:40:20 -0500 Received: from lists.gnu.org ([209.51.188.17]:47165) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1igZkY-0002Ts-Gl for submit@debbugs.gnu.org; Sun, 15 Dec 2019 14:40:18 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:36522) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1igZkX-0004X5-9s for bug-coreutils@gnu.org; Sun, 15 Dec 2019 14:40:18 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.5 required=5.0 tests=BAYES_50,HTML_MESSAGE, RCVD_IN_DNSWL_MED autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1igZkW-0000Bc-7q for bug-coreutils@gnu.org; Sun, 15 Dec 2019 14:40:17 -0500 Received: from mailbackend.panix.com ([166.84.1.89]:52231) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1igZkW-00007i-0P for bug-coreutils@gnu.org; Sun, 15 Dec 2019 14:40:16 -0500 Received: from [10.0.1.14] (ool-45734927.dyn.optonline.net [69.115.73.39]) by mailbackend.panix.com (Postfix) with ESMTPSA id 47bZW318YVz1pcS for ; Sun, 15 Dec 2019 14:40:14 -0500 (EST) From: Roy Smith Content-Type: multipart/alternative; boundary="Apple-Mail=_319628D4-F62F-467C-A73E-667B8418C1C0" Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Subject: uniq -c gets wrong count with non-ascii strings Message-Id: Date: Sun, 15 Dec 2019 14:40:14 -0500 To: bug-coreutils@gnu.org X-Mailer: Apple Mail (2.3445.9.1) X-detected-operating-system: by eggs.gnu.org: GNU/Linux (Android) [fuzzy] X-Received-From: 166.84.1.89 X-Spam-Score: -1.6 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.6 (--) --Apple-Mail=_319628D4-F62F-467C-A73E-667B8418C1C0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 With the following input: > $ cat x > "=E2=81=BF=E1=B5=98=CB=A1=CB=A1" > "=DC=A5=DC=9D=DC=AA=DC=90=DC=A9" Running "uniq -c" says there's two copies of the same line! > $ uniq -c x > 2 "=E2=81=BF=E1=B5=98=CB=A1=CB=A1" I've attached a copy of the test file, and here's the octal dump: > $ od -b x > 0000000 042 342 201 277 341 265 230 313 241 313 241 042 012 042 334 = 245 > 0000020 334 235 334 252 334 220 334 251 042 012 > 0000032 I'm getting this on: > Linux tools-sgebastion-08 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 = (2018-10-27) x86_64 GNU/Linux > uniq (GNU coreutils) 8.26 My MacOS 10.13.6 box gets it right: > $ uniq -c x > 1 "=E2=81=BF=E1=B5=98=CB=A1=CB=A1" > 1 "=DC=A5=DC=9D=DC=AA=DC=90=DC=A9" --Apple-Mail=_319628D4-F62F-467C-A73E-667B8418C1C0 Content-Type: multipart/mixed; boundary="Apple-Mail=_6E85B91E-BD97-4FE9-843F-21F5E789A4D7" --Apple-Mail=_6E85B91E-BD97-4FE9-843F-21F5E789A4D7 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 With = the following input:

$ cat x
"=E2=81=BF=E1=B5=98=CB=A1=CB=A1"
"=DC=A5=DC=9D=DC=AA=DC=90=DC=A9"

Running "uniq -c" says there's two copies of the same = line!

$ uniq -c x
      2 = "=E2=81=BF=E1=B5=98=CB=A1=CB=A1"

I've = attached a copy of the test file, and here's the octal = dump:

$ od -b x
0000000 042 342 201 277 341 265 230 313 = 241 313 241 042 012 042 334 245
0000020 = 334 235 334 252 334 220 334 251 042 012
0000032


I'm getting this on:

Linux tools-sgebastion-08 4.9.0-8-amd64 = #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux
uniq (GNU coreutils) = 8.26

My MacOS 10.13.6 box gets it right:

$ uniq = -c x
   1 "=E2=81=BF=E1=B5=98=CB=A1=CB=A1"
   1 "=DC=A5=DC=9D=DC=AA=DC=90=DC=A9"


= --Apple-Mail=_6E85B91E-BD97-4FE9-843F-21F5E789A4D7 Content-Disposition: attachment; filename=x Content-Type: application/octet-stream; x-unix-mode=0644; name="x" Content-Transfer-Encoding: base64 IuKBv+G1mMuhy6EiCiLcpdyd3KrckNypIgo= --Apple-Mail=_6E85B91E-BD97-4FE9-843F-21F5E789A4D7 Content-Transfer-Encoding: 7bit Content-Type: text/html; charset=us-ascii
--Apple-Mail=_6E85B91E-BD97-4FE9-843F-21F5E789A4D7-- --Apple-Mail=_319628D4-F62F-467C-A73E-667B8418C1C0-- ------------=_1582487041-19374-1-- From unknown Sat Jun 21 05:19:22 2025 X-Loop: help-debbugs@gnu.org Subject: bug#38627: uniq -c gets wrong count with non-ascii strings Resent-From: Andreas Schwab Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Sun, 23 Feb 2020 20:03:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 38627 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: 38627@debbugs.gnu.org Cc: roy@panix.com, P@draigBrady.com Received: via spool by 38627-submit@debbugs.gnu.org id=B38627.158248815628918 (code B ref 38627); Sun, 23 Feb 2020 20:03:01 +0000 Received: (at 38627) by debbugs.gnu.org; 23 Feb 2020 20:02:36 +0000 Received: from localhost ([127.0.0.1]:51638 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1j5xSW-0007WL-NF for submit@debbugs.gnu.org; Sun, 23 Feb 2020 15:02:36 -0500 Received: from mail-out.m-online.net ([212.18.0.10]:40786) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1j5xSU-0007W9-B9 for 38627@debbugs.gnu.org; Sun, 23 Feb 2020 15:02:34 -0500 Received: from frontend01.mail.m-online.net (unknown [192.168.8.182]) by mail-out.m-online.net (Postfix) with ESMTP id 48QbhS21tzz1rfPq; Sun, 23 Feb 2020 21:02:31 +0100 (CET) Received: from localhost (dynscan1.mnet-online.de [192.168.6.70]) by mail.m-online.net (Postfix) with ESMTP id 48QbhR65PFz1qqkL; Sun, 23 Feb 2020 21:02:31 +0100 (CET) X-Virus-Scanned: amavisd-new at mnet-online.de Received: from mail.mnet-online.de ([192.168.8.182]) by localhost (dynscan1.mail.m-online.net [192.168.6.70]) (amavisd-new, port 10024) with ESMTP id XqTsgbIIJrT8; Sun, 23 Feb 2020 21:02:31 +0100 (CET) X-Auth-Info: X/l5WYEWy+e+arQ7okMpJ0MQ7Sb3+gkQ7dQYtl6BSTekkoqG3cE7gGi68DMBrMin Received: from igel.home (ppp-46-244-169-177.dynamic.mnet-online.de [46.244.169.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.mnet-online.de (Postfix) with ESMTPSA; Sun, 23 Feb 2020 21:02:30 +0100 (CET) Received: by igel.home (Postfix, from userid 1000) id 5E01D2C2917; Sun, 23 Feb 2020 21:02:30 +0100 (CET) From: Andreas Schwab References: <871e974e-0fdd-062d-13b5-53676ee78538@cs.ucla.edu> <815E72D0-3240-45E5-94F1-A31B2F276657@panix.com> <8d9261ea-7355-fa06-a286-81268178e706@draigBrady.com> X-Yow: Vote for ME -- I'm well-tapered, half-cocked, ill-conceived and TAX-DEFERRED! Date: Sun, 23 Feb 2020 21:02:30 +0100 In-Reply-To: <8d9261ea-7355-fa06-a286-81268178e706@draigBrady.com> ("=?UTF-8?Q?P=C3=A1draig?= Brady"'s message of "Sun, 23 Feb 2020 19:43:27 +0000") Message-ID: <87a759138p.fsf@igel.home> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Score: -0.5 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.5 (-) On Feb 23 2020, Pádraig Brady wrote: > On 17/12/2019 17:25, Roy Smith wrote: >> I stopped short of actually building uniq.c from source (bootstrap, prerequisites, ...), but looking at the code, it looks like the call chain is: >> >> different() >> xmemcoll() >> memcoll() >> strcoll() >> >> so I tried a little test at the strcoll() level: >> >> #include >> #include >> #include >> >> int >> main (int argc, char **argv) >> { >> unsigned char null[] = { >> >> 0342, 0201, 0277, 0341, 0265, 0230, 0313, 0241, 0313, 0241, 0 >> }; >> unsigned char iraq[] = { >> 0334, 0245, 0334, 0235, 0334, 0252, 0334, 0220, 0334, 0251, 0}; >> >> printf("%s\n", null); >> printf("%s\n", iraq); >> >> int m = strcoll(null, iraq); >> printf("m = %d\n", m); >> } This lacks setlocale. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." From unknown Sat Jun 21 05:19:22 2025 X-Loop: help-debbugs@gnu.org Subject: bug#38627: uniq -c gets wrong count with non-ascii strings Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Sun, 23 Feb 2020 23:50:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 38627 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: P@draigBrady.com Cc: roy@panix.com, 38627@debbugs.gnu.org Received: via spool by 38627-submit@debbugs.gnu.org id=B38627.158250179919281 (code B ref 38627); Sun, 23 Feb 2020 23:50:01 +0000 Received: (at 38627) by debbugs.gnu.org; 23 Feb 2020 23:49:59 +0000 Received: from localhost ([127.0.0.1]:51711 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1j610Z-00050u-9N for submit@debbugs.gnu.org; Sun, 23 Feb 2020 18:49:59 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:49436) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1j610X-00050e-92 for 38627@debbugs.gnu.org; Sun, 23 Feb 2020 18:49:58 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 9C41816009E; Sun, 23 Feb 2020 15:49:50 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id dimxohJ0hCAU; Sun, 23 Feb 2020 15:49:50 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 037DE1600A2; Sun, 23 Feb 2020 15:49:50 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id e-wyZ80ydGZq; Sun, 23 Feb 2020 15:49:49 -0800 (PST) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id D11B416009E; Sun, 23 Feb 2020 15:49:49 -0800 (PST) References: <871e974e-0fdd-062d-13b5-53676ee78538@cs.ucla.edu> <815E72D0-3240-45E5-94F1-A31B2F276657@panix.com> <8d9261ea-7355-fa06-a286-81268178e706@draigBrady.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <14382c81-d021-635a-f468-1d549b9d4cd5@cs.ucla.edu> Date: Sun, 23 Feb 2020 15:49:46 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1 MIME-Version: 1.0 In-Reply-To: <8d9261ea-7355-fa06-a286-81268178e706@draigBrady.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) On 2/23/20 11:43 AM, P=C3=A1draig Brady wrote: > #include "hard-locale.h" > #include "posixver.h" > #include "stdio--.h" > -#include "xmemcoll.h" Please also remove the '#include "hard-locale.h"' line. Thanks for fixing this.