GNU bug report logs - #16168
uniq mis-handles UTF8 (8bit) characters

Previous Next

Package: coreutils;

Reported by: Shlomo Urbach <urbach <at> google.com>

Date: Mon, 16 Dec 2013 16:56:03 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16168 in the body.
You can then email your comments to 16168 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#16168; Package coreutils. (Mon, 16 Dec 2013 16:56:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Shlomo Urbach <urbach <at> google.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 16 Dec 2013 16:56:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Shlomo Urbach <urbach <at> google.com>
To: bug-coreutils <at> gnu.org
Subject: uniq mis-handles UTF8 (8bit) characters
Date: Mon, 16 Dec 2013 15:50:15 +0200
[Message part 1 (text/plain, inline)]
Lines with CJK letters are deemed equal by length only, since the
characters seem to be ignored.
I understand this is due to locale.
But, it would be nice if a simple flag would do a locale-free comparison
(i.e. equal = all bytes are equal).
[Message part 2 (text/html, inline)]

Reply sent to Pádraig Brady <P <at> draigBrady.com>:
You have taken responsibility. (Mon, 16 Dec 2013 17:34:02 GMT) Full text and rfc822 format available.

Notification sent to Shlomo Urbach <urbach <at> google.com>:
bug acknowledged by developer. (Mon, 16 Dec 2013 17:34:02 GMT) Full text and rfc822 format available.

Message #10 received at 16168-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Shlomo Urbach <urbach <at> google.com>
Cc: 16168-done <at> debbugs.gnu.org
Subject: Re: bug#16168: uniq mis-handles UTF8 (8bit) characters
Date: Mon, 16 Dec 2013 17:33:23 +0000
tag 16168 notabug
close 16168
stop

On 12/16/2013 01:50 PM, Shlomo Urbach wrote:
> Lines with CJK letters are deemed equal by length only, since the
> characters seem to be ignored.
> I understand this is due to locale.
> But, it would be nice if a simple flag would do a locale-free comparison
> (i.e. equal = all bytes are equal).

If you want to compare byte by byte:

LC_ALL=C uniq ....

thanks,
Pǽdraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#16168; Package coreutils. (Mon, 16 Dec 2013 18:03:01 GMT) Full text and rfc822 format available.

Message #13 received at 16168 <at> debbugs.gnu.org (full text, mbox):

From: Linda Walsh <coreutils <at> tlinx.org>
To: 16168 <at> debbugs.gnu.org, P <at> draigBrady.com, urbach <at> google.com
Subject: Re: bug#16168: uniq mis-handles UTF8 (8bit) characters
Date: Mon, 16 Dec 2013 10:02:08 -0800
Maybe he was hoping for a uniq [-b|--bytes] ?

Suggestion to Shlomo (if you use bash):

  alias uniq='LC_ALL=C \uniq'

or, if you want it in your shell scripts too:

  uniq() { LC_ALL=C; "${type -P uniq}" "$@" ; }; export -f uniq


On 12/16/2013 9:33 AM, Pádraig Brady wrote:
> tag 16168 notabug
> close 16168
> stop
> 
> On 12/16/2013 01:50 PM, Shlomo Urbach wrote:
>> Lines with CJK letters are deemed equal by length only, since the
>> characters seem to be ignored.
>> I understand this is due to locale.
>> But, it would be nice if a simple flag would do a locale-free comparison
>> (i.e. equal = all bytes are equal).
> 
> If you want to compare byte by byte:
> 
> LC_ALL=C uniq ....
> 
> thanks,
> Pǽdraig.
> 
> 
> 




Information forwarded to bug-coreutils <at> gnu.org:
bug#16168; Package coreutils. (Mon, 16 Dec 2013 20:20:02 GMT) Full text and rfc822 format available.

Message #16 received at 16168 <at> debbugs.gnu.org (full text, mbox):

From: Shlomo Urbach <urbach <at> google.com>
To: Linda Walsh <coreutils <at> tlinx.org>
Cc: 16168 <at> debbugs.gnu.org, P <at> draigbrady.com
Subject: Re: bug#16168: uniq mis-handles UTF8 (8bit) characters
Date: Mon, 16 Dec 2013 22:19:31 +0200
[Message part 1 (text/plain, inline)]
Thanks,

this works great.
But, I'm sure the general public doesn't know of this issue.

Shlomo


On Mon, Dec 16, 2013 at 8:02 PM, Linda Walsh <coreutils <at> tlinx.org> wrote:

> Maybe he was hoping for a uniq [-b|--bytes] ?
>
> Suggestion to Shlomo (if you use bash):
>
>   alias uniq='LC_ALL=C \uniq'
>
> or, if you want it in your shell scripts too:
>
>   uniq() { LC_ALL=C; "${type -P uniq}" "$@" ; }; export -f uniq
>
>
>
> On 12/16/2013 9:33 AM, Pádraig Brady wrote:
>
>> tag 16168 notabug
>> close 16168
>> stop
>>
>> On 12/16/2013 01:50 PM, Shlomo Urbach wrote:
>>
>>> Lines with CJK letters are deemed equal by length only, since the
>>> characters seem to be ignored.
>>> I understand this is due to locale.
>>> But, it would be nice if a simple flag would do a locale-free comparison
>>> (i.e. equal = all bytes are equal).
>>>
>>
>> If you want to compare byte by byte:
>>
>> LC_ALL=C uniq ....
>>
>> thanks,
>> Pǽdraig.
>>
>>
>>
>>
[Message part 2 (text/html, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 14 Jan 2014 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 164 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.