GNU bug report logs -
#16168
uniq mis-handles UTF8 (8bit) characters
Previous Next
Reported by: Shlomo Urbach <urbach <at> google.com>
Date: Mon, 16 Dec 2013 16:56:03 UTC
Severity: normal
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
Maybe he was hoping for a uniq [-b|--bytes] ?
Suggestion to Shlomo (if you use bash):
alias uniq='LC_ALL=C \uniq'
or, if you want it in your shell scripts too:
uniq() { LC_ALL=C; "${type -P uniq}" "$@" ; }; export -f uniq
On 12/16/2013 9:33 AM, Pádraig Brady wrote:
> tag 16168 notabug
> close 16168
> stop
>
> On 12/16/2013 01:50 PM, Shlomo Urbach wrote:
>> Lines with CJK letters are deemed equal by length only, since the
>> characters seem to be ignored.
>> I understand this is due to locale.
>> But, it would be nice if a simple flag would do a locale-free comparison
>> (i.e. equal = all bytes are equal).
>
> If you want to compare byte by byte:
>
> LC_ALL=C uniq ....
>
> thanks,
> Pǽdraig.
>
>
>
This bug report was last modified 11 years and 185 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.