GNU bug report logs - #36718
uniq treats distinct Korean characters equal

Reported by: Felix Hamme <fhamme <at> united-internet.de>

Date: Thu, 18 Jul 2019 14:49:01 UTC

Severity: normal

To reply to this bug, email your comments to 36718 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-coreutils <at> gnu.org:
bug#36718; Package coreutils. (Thu, 18 Jul 2019 14:49:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Felix Hamme <fhamme <at> united-internet.de>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Thu, 18 Jul 2019 14:49:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Felix Hamme <fhamme <at> united-internet.de>
To: bug-coreutils <at> gnu.org
Cc: Gerhard Dittes <gerhard.dittes <at> ionos.com>
Subject: uniq treats distinct Korean characters equal
Date: Thu, 18 Jul 2019 16:08:57 +0200

[Message part 1 (text/plain, inline)]

Dear all,

I found that, when performing uniq on some Korean characters, it treats
them as equal (counts as duplicate) although the characters aren't
equal. To be precise, it happened to me on the Characters 프 (U+D504) and
틀 (U+D2C0).

An example (input, expected output, actual output) can be found in the
attachment.
I've tried that using uniq (GNU coreutils) 8.30.

Greetings
Felix Hamme

[uniq-korean-characters-bug.tar.gz (application/gzip, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#36718; Package coreutils. (Thu, 18 Jul 2019 22:44:02 GMT) Full text and rfc822 format available.

Message #8 received at 36718 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Felix Hamme <fhamme <at> united-internet.de>, 36718 <at> debbugs.gnu.org
Cc: Gerhard Dittes <gerhard.dittes <at> ionos.com>
Subject: Re: bug#36718: uniq treats distinct Korean characters equal
Date: Thu, 18 Jul 2019 15:43:48 -0700

uniq just calls strcoll, and if strcoll (A, B) returns 0 then uniq assumes the 
lines are equal. So my guess is that your problem has something to do with 
strcoll, not with coreutils per se.

Information forwarded to bug-coreutils <at> gnu.org:
bug#36718; Package coreutils. (Fri, 19 Jul 2019 11:58:02 GMT) Full text and rfc822 format available.

Message #11 received at 36718 <at> debbugs.gnu.org (full text, mbox):

From: Felix Hamme <fhamme <at> united-internet.de>
To: Paul Eggert <eggert <at> cs.ucla.edu>, 36718 <at> debbugs.gnu.org
Cc: Gerhard Dittes <gerhard.dittes <at> ionos.com>
Subject: Re: bug#36718: uniq treats distinct Korean characters equal
Date: Fri, 19 Jul 2019 12:18:32 +0200

Thanks @Paul Eggert, it seems like this isn't a bug at all.

My locale (de_DE.utf8) appears to lack definitions for the mentioned
Korean characters. After setting my system language to Korean
(ko_KR.utf8) uniq produces the expected output.
For my purpose, I'll set my environment to LC_COLLATE=C, which forces
byte-wise comparison and should work for all languages.

Admittedly, I could've searched it:
https://unix.stackexchange.com/questions/373848/why-does-uniq-think-%E3%81%82%E3%81%84-and-%E3%81%84%E3%81%82-are-the-same

This bug report was last modified 6 years and 22 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #36718 uniq treats distinct Korean characters equal

GNU bug report logs - #36718
uniq treats distinct Korean characters equal