From unknown Wed Jun 25 03:51:21 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#16168 <16168@debbugs.gnu.org> To: bug#16168 <16168@debbugs.gnu.org> Subject: Status: uniq mis-handles UTF8 (8bit) characters Reply-To: bug#16168 <16168@debbugs.gnu.org> Date: Wed, 25 Jun 2025 10:51:21 +0000 retitle 16168 uniq mis-handles UTF8 (8bit) characters reassign 16168 coreutils submitter 16168 Shlomo Urbach severity 16168 normal thanks From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 16 11:55:54 2013 Received: (at submit) by debbugs.gnu.org; 16 Dec 2013 16:55:54 +0000 Received: from localhost ([127.0.0.1]:54164 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VsbST-0003dy-Lr for submit@debbugs.gnu.org; Mon, 16 Dec 2013 11:55:54 -0500 Received: from eggs.gnu.org ([208.118.235.92]:36300) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VsYZG-0006bx-3e for submit@debbugs.gnu.org; Mon, 16 Dec 2013 08:50:42 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VsYZE-0003HM-IG for submit@debbugs.gnu.org; Mon, 16 Dec 2013 08:50:41 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=BAYES_20,HTML_MESSAGE, T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:37357) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VsYZE-0003HH-Em for submit@debbugs.gnu.org; Mon, 16 Dec 2013 08:50:40 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38021) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VsYZD-0006v1-8C for bug-coreutils@gnu.org; Mon, 16 Dec 2013 08:50:40 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VsYZC-0003Gw-5b for bug-coreutils@gnu.org; Mon, 16 Dec 2013 08:50:39 -0500 Received: from mail-ob0-x22b.google.com ([2607:f8b0:4003:c01::22b]:37489) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VsYZC-0003Gh-0P for bug-coreutils@gnu.org; Mon, 16 Dec 2013 08:50:38 -0500 Received: by mail-ob0-f171.google.com with SMTP id wp18so4810237obc.16 for ; Mon, 16 Dec 2013 05:50:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=Fw8+3U/lLfzgLMc4Wfblpvy6N9I6j9t06eLtOo/fxFg=; b=Kq8IaN8CSaim6OqUhWbl/aZZEEVEzXR4r+9FNQkKIetTBsNEwF/ElMyF3lhXAItOEK /yNsB0sVB8WhbuM0BP7rReFqTPkPVzCilK7oMzD3sKiN8a/TwVl+9rlRett8HndNW+6r XUuYAZQjbUe6rHUjDLWzQzsEcNUUUjMvypa7siXTzpA5rEU8v3HxTuz8MlPpQOy64bJo Qf6Egg5NljbYySIj4AkT5RRDSBzQWvkaIM0mla+2fykCtnvoEVaB3BmGvUUPH+LuP9v8 w7Jc3NnBd0QCioFBpAhFwDr+kS1FVxwX/lmcKt35hO5a6K47TzVxa6Xvd5X7cDf7Z0ui aIBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to :content-type; bh=Fw8+3U/lLfzgLMc4Wfblpvy6N9I6j9t06eLtOo/fxFg=; b=ewId979zGTu9XJVUmeUCPlFgE67dDOy/T+ZgJgCawCrc6tTvvaaBMZPPuiK8ZEeFra iJ/WrkoX2lQaa3cy6xngtGMglydAXw5y65plY6x8q/bOoRm4jEKTxCcg4ZT/D2Q7jK1Y NEmdZufipBpuEykv7TYX4c8ZdG0wDPVeMVSMRIcF19hyG+XAlFczkAvfhKgQeAPdzSCQ HHZsm9KwX/2tN1coG4ux+8PcwcT1ZqlnSZNTPLwsDYssjZzpv8v5fMuUtraDqUPHe1i3 iBQPZv1CTtoCbPFwK1JDNHHurX0yikkZ7tpYtvy+kTVTz+Qpr2wKb1EWiKkyTPfoQtyU NNuQ== X-Gm-Message-State: ALoCoQkxDZFU3+thd5VSoLBt3Mu3r0rUvpWgWtheLhJL7yoyHuymWXh9k8DWFbj6eiFbRizyhGgWKzl8pvlZi1OgHvJMttAaU64Dvf16URlpc9GOmK7BZ4k2EaWuZ4HavCRNnP8+JdMnJwGFuczME6a6lsKPl1MO7imknsx1MzfJ7NCoyWqTMaqO9AFL6/xwgfqOmvFMMaRrwrOMPzqm1Zev/Pe1wTRFcQ== X-Received: by 10.60.136.132 with SMTP id qa4mr1563944oeb.68.1387201835704; Mon, 16 Dec 2013 05:50:35 -0800 (PST) MIME-Version: 1.0 Received: by 10.182.80.166 with HTTP; Mon, 16 Dec 2013 05:50:15 -0800 (PST) From: Shlomo Urbach Date: Mon, 16 Dec 2013 15:50:15 +0200 Message-ID: Subject: uniq mis-handles UTF8 (8bit) characters To: bug-coreutils@gnu.org Content-Type: multipart/alternative; boundary=047d7b414f40a63e3104eda718f6 X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Mon, 16 Dec 2013 11:55:51 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) --047d7b414f40a63e3104eda718f6 Content-Type: text/plain; charset=ISO-8859-1 Lines with CJK letters are deemed equal by length only, since the characters seem to be ignored. I understand this is due to locale. But, it would be nice if a simple flag would do a locale-free comparison (i.e. equal = all bytes are equal). --047d7b414f40a63e3104eda718f6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Lines with CJK letters are deemed equal by length only, si= nce the characters seem to be ignored.
I understand this is due to loca= le.
But, it would be nice if a simple flag would do a locale-free= comparison (i.e. equal =3D all bytes are equal).

--047d7b414f40a63e3104eda718f6-- From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 16 12:33:29 2013 Received: (at 16168-done) by debbugs.gnu.org; 16 Dec 2013 17:33:29 +0000 Received: from localhost ([127.0.0.1]:54226 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Vsc2q-0005xq-Lm for submit@debbugs.gnu.org; Mon, 16 Dec 2013 12:33:28 -0500 Received: from mail3.vodafone.ie ([213.233.128.45]:46087) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Vsc2n-0005xc-Jv for 16168-done@debbugs.gnu.org; Mon, 16 Dec 2013 12:33:26 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AjkDAE44r1JtT9qD/2dsb2JhbAANTIcasmmDAQMCgTuDGQEBAQQjDwFGEAsNCwICBRYLAgIJAwIBAgFFBg0BBwEBiAWvEnaYJReBKY1wB4JugUgBA58CjlI Received: from unknown (HELO [192.168.1.79]) ([109.79.218.131]) by mail3.vodafone.ie with ESMTP; 16 Dec 2013 17:33:23 +0000 Message-ID: <52AF3963.6020003@draigBrady.com> Date: Mon, 16 Dec 2013 17:33:23 +0000 From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 MIME-Version: 1.0 To: Shlomo Urbach Subject: Re: bug#16168: uniq mis-handles UTF8 (8bit) characters References: In-Reply-To: X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 16168-done Cc: 16168-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) tag 16168 notabug close 16168 stop On 12/16/2013 01:50 PM, Shlomo Urbach wrote: > Lines with CJK letters are deemed equal by length only, since the > characters seem to be ignored. > I understand this is due to locale. > But, it would be nice if a simple flag would do a locale-free comparison > (i.e. equal = all bytes are equal). If you want to compare byte by byte: LC_ALL=C uniq .... thanks, Pǽdraig. From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 16 13:02:14 2013 Received: (at 16168) by debbugs.gnu.org; 16 Dec 2013 18:02:14 +0000 Received: from localhost ([127.0.0.1]:54305 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VscUg-0006tk-Dw for submit@debbugs.gnu.org; Mon, 16 Dec 2013 13:02:14 -0500 Received: from ishtar.tlinx.org ([173.164.175.65]:48723) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VscUe-0006tb-Dv for 16168@debbugs.gnu.org; Mon, 16 Dec 2013 13:02:13 -0500 Received: from [192.168.4.12] (Athenae [192.168.4.12]) by Ishtar.tlinx.org (8.14.7/8.14.4/SuSE Linux 0.8) with ESMTP id rBGI28W0088871; Mon, 16 Dec 2013 10:02:10 -0800 Message-ID: <52AF4020.5010505@tlinx.org> Date: Mon, 16 Dec 2013 10:02:08 -0800 From: Linda Walsh User-Agent: Thunderbird MIME-Version: 1.0 To: 16168@debbugs.gnu.org, P@draigBrady.com, urbach@google.com Subject: Re: bug#16168: uniq mis-handles UTF8 (8bit) characters References: <52AF3963.6020003@draigBrady.com> In-Reply-To: <52AF3963.6020003@draigBrady.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Score: -0.5 (/) X-Debbugs-Envelope-To: 16168 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.5 (/) Maybe he was hoping for a uniq [-b|--bytes] ? Suggestion to Shlomo (if you use bash): alias uniq='LC_ALL=C \uniq' or, if you want it in your shell scripts too: uniq() { LC_ALL=C; "${type -P uniq}" "$@" ; }; export -f uniq On 12/16/2013 9:33 AM, Pádraig Brady wrote: > tag 16168 notabug > close 16168 > stop > > On 12/16/2013 01:50 PM, Shlomo Urbach wrote: >> Lines with CJK letters are deemed equal by length only, since the >> characters seem to be ignored. >> I understand this is due to locale. >> But, it would be nice if a simple flag would do a locale-free comparison >> (i.e. equal = all bytes are equal). > > If you want to compare byte by byte: > > LC_ALL=C uniq .... > > thanks, > Pǽdraig. > > > From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 16 15:19:55 2013 Received: (at 16168) by debbugs.gnu.org; 16 Dec 2013 20:19:55 +0000 Received: from localhost ([127.0.0.1]:54462 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Vsedu-0002CW-Uz for submit@debbugs.gnu.org; Mon, 16 Dec 2013 15:19:55 -0500 Received: from mail-oa0-f54.google.com ([209.85.219.54]:34261) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Vseds-0002CM-Lk for 16168@debbugs.gnu.org; Mon, 16 Dec 2013 15:19:53 -0500 Received: by mail-oa0-f54.google.com with SMTP id h16so5668699oag.27 for <16168@debbugs.gnu.org>; Mon, 16 Dec 2013 12:19:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=bd4WOo3s5qgTqHkprO/99KrMrV5XlAHW5SvnASOba3E=; b=gsBwTmCdCaz6HbXU1NnzxoYgzQwCeQuNbvwQtt+kDzY5udYmMiJedVlso/36KbQEPg XS9iM3aZOtU21KRT/Ec6WyR5KkZa3iwv83TnGpQuvhm7nqSJR8GqDF6sIdyyABb1OeFF BjBI7BYD61FgEbG4mlH51Ts6rtD7Zd0Tx6sp3LtPZeVdFBSXQEp/6XkAdJEaNBPgX55Z TJelE8R0dclICYeZLnYh8Is4wilBMb9NXrsrlXnWj/jgR6JIHSchuf+eoSKfJIMWVZdA HHzUkA2koxngcj8g/hiqwWJZu7IbzLPWTUcS9LaS3dV2nSn980Q30qsgKgvKU0Oe3BdK lBOg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; bh=bd4WOo3s5qgTqHkprO/99KrMrV5XlAHW5SvnASOba3E=; b=IFTBHwaEBw6ya/4FljTsI6OlmJ3wIzSyicIgejUMJHqAJZvl4ZTbnlYN08Hn6dwJmx S0nUTMjjAPCgsLggDOticE9Jdu6sfsNFhY1op9cC8WOG2NbHWh1KFKqb9Kh36k2Amt8G v0lAW/bFYJxezBhR7B0lY2epQfVqbd6WhAsXlV0krqmgChQm+pB9fpIEGCMGC8sRb21S V0LG4IOtFzqAY1zkxmPYKVBhoDO18o375Pd1Z+p0WOTGUd4DSW7m4gmDpmciwR/xgVx9 sJk0O6k8z0aDZd+Z4yF87oo+JZFXGK3EWvHnmZA2p3gXOsLodePRxHtXBmXyG/L8uYdV BAlg== X-Gm-Message-State: ALoCoQmKZ+GC1BmCBpF4T2YTPWT4TIGfQTJN1d0rWoHVDZI3Eh5EfJka6zaB6CNCmI5wlc++OzlI0xOvS+UKxA5BKKuaB0Oepb1Z2v7bE9gXFEm86SIjHdlQK+HwVb8fkWAIcHDIJ8mk/iOpuqU9uYv/ooyyMkyRjf0V/SF2fhOeeSREogQL6huD1hlKEx5pLTx2xz6DFNucllnnnPb3w9BtXYH0L+NENA== X-Received: by 10.182.250.200 with SMTP id ze8mr3239756obc.72.1387225191632; Mon, 16 Dec 2013 12:19:51 -0800 (PST) MIME-Version: 1.0 Received: by 10.182.80.166 with HTTP; Mon, 16 Dec 2013 12:19:31 -0800 (PST) In-Reply-To: <52AF4020.5010505@tlinx.org> References: <52AF3963.6020003@draigBrady.com> <52AF4020.5010505@tlinx.org> From: Shlomo Urbach Date: Mon, 16 Dec 2013 22:19:31 +0200 Message-ID: Subject: Re: bug#16168: uniq mis-handles UTF8 (8bit) characters To: Linda Walsh Content-Type: multipart/alternative; boundary=089e0160c660c568f904edac88b4 X-Spam-Score: -1.2 (-) X-Debbugs-Envelope-To: 16168 Cc: 16168@debbugs.gnu.org, P@draigbrady.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.2 (-) --089e0160c660c568f904edac88b4 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks, this works great. But, I'm sure the general public doesn't know of this issue. Shlomo On Mon, Dec 16, 2013 at 8:02 PM, Linda Walsh wrote: > Maybe he was hoping for a uniq [-b|--bytes] ? > > Suggestion to Shlomo (if you use bash): > > alias uniq=3D'LC_ALL=3DC \uniq' > > or, if you want it in your shell scripts too: > > uniq() { LC_ALL=3DC; "${type -P uniq}" "$@" ; }; export -f uniq > > > > On 12/16/2013 9:33 AM, P=C3=A1draig Brady wrote: > >> tag 16168 notabug >> close 16168 >> stop >> >> On 12/16/2013 01:50 PM, Shlomo Urbach wrote: >> >>> Lines with CJK letters are deemed equal by length only, since the >>> characters seem to be ignored. >>> I understand this is due to locale. >>> But, it would be nice if a simple flag would do a locale-free compariso= n >>> (i.e. equal =3D all bytes are equal). >>> >> >> If you want to compare byte by byte: >> >> LC_ALL=3DC uniq .... >> >> thanks, >> P=C7=BDdraig. >> >> >> >> --089e0160c660c568f904edac88b4 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks,

this works great.
But= , I'm sure the general public doesn't know of this issue.

Shlomo


On Mon, Dec 16, 2013 at 8:02 PM, Linda Walsh <coreutils@tlinx.org&g= t; wrote:
Maybe he was hoping for a uniq [-b|--bytes] ?

Suggestion to Shlomo (if you use bash):

=C2=A0 alias uniq=3D'LC_ALL=3DC \uniq'

or, if you want it in your shell scripts too:

=C2=A0 uniq() { LC_ALL=3DC; "${type -P uniq}" "$@" ; };= export -f uniq



On 12/16/2013 9:33 AM, P=C3=A1draig Brady wrote:
tag 16168 notabug
close 16168
stop

On 12/16/2013 01:50 PM, Shlomo Urbach wrote:
Lines with CJK letters are deemed equal by length only, since the
characters seem to be ignored.
I understand this is due to locale.
But, it would be nice if a simple flag would do a locale-free comparison (i.e. equal =3D all bytes are equal).

If you want to compare byte by byte:

LC_ALL=3DC uniq ....

thanks,
P=C7=BDdraig.




--089e0160c660c568f904edac88b4-- From unknown Wed Jun 25 03:51:21 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Tue, 14 Jan 2014 12:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator