GNU bug report logs - #25455
uniq considers all the full-width punctuation and Japanese kana as the same under zh_CN.UTF-8 locale

Previous Next

Package: coreutils;

Reported by: Icenowy Zheng <icenowy <at> aosc.xyz>

Date: Sun, 15 Jan 2017 23:10:01 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 25455 in the body.
You can then email your comments to 25455 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#25455; Package coreutils. (Sun, 15 Jan 2017 23:10:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Icenowy Zheng <icenowy <at> aosc.xyz>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sun, 15 Jan 2017 23:10:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Icenowy Zheng <icenowy <at> aosc.xyz>
To: bug-coreutils <at> gnu.org
Cc: arthur2e5 <at> aosc.xyz
Subject: uniq considers all the full-width punctuation and Japanese kana as
 the same under zh_CN.UTF-8 locale
Date: Mon, 16 Jan 2017 04:01:05 +0800
Problem:
When dealing lines with only a Chinese full-width punctuation or Japanese kana
and locale is zh_CN.UTF-8, uniq command will consider all the lines are the
same, and wrongly removed different punctuations.

Reproduce steps:

Run the following command:

```
printf "%s\n" , 。 : ¥ あ か ア カ a b c , . : $ | LC_ALL=zh_CN.UTF-8 uniq
```

Comments:
The printf command prints out
```
,
。
:
¥
あ
か
ア
カ
a
b
c
,
.
:
$
```

Every line is different.

However, after uniq command, it gives out
```
,
a
b
c
,
.
:
$
```

Under zh_TW.UTF-8 locale, the problems also happens; but under ja_JP.UTF-8 or C it do not happen.

Version info:
```
$ uniq --version
uniq (GNU coreutils) 8.26
... ...
$ /lib/libc.so.6 
GNU C Library (2.24-2_AOSC_OS) stable release version 2.24, by Roland McGrath et al.
... ...
```

Architecture:

on x86_64 and armv7l architectures the test fails.




Information forwarded to bug-coreutils <at> gnu.org:
bug#25455; Package coreutils. (Tue, 17 Jan 2017 19:27:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Mingye Wang (Arthur2e5)" <arthur2e5 <at> aosc.xyz>
To: Icenowy Zheng <icenowy <at> aosc.xyz>,
 "bug-coreutils <at> gnu.org" <bug-coreutils <at> gnu.org>
Subject: Re: uniq considers all the full-width punctuation and Japanese kana
 as the same under zh_CN.UTF-8 locale
Date: Tue, 17 Jan 2017 18:22:48 +0000
15.01.2017, 20:01, "Icenowy Zheng" <icenowy <at> aosc.xyz>:
> Problem:
> When dealing lines with only a Chinese full-width punctuation or Japanese kana
> and locale is zh_CN.UTF-8, uniq command will consider all the lines are the
> same, and wrongly removed different punctuations.

To narrow the scope down a bit, I should mention that LC_COLLATE is enough to trigger the bug:

printf '%s\n' 。 , ? ! a b c | LC_COLLATE=zh_CN.UTF-8 uniq

-- 
Regards,

Arthur2e5




Information forwarded to bug-coreutils <at> gnu.org:
bug#25455; Package coreutils. (Sat, 21 Jan 2017 03:09:02 GMT) Full text and rfc822 format available.

Message #11 received at 25455 <at> debbugs.gnu.org (full text, mbox):

From: Mike Frysinger <vapier <at> gentoo.org>
To: Icenowy Zheng <icenowy <at> aosc.xyz>
Cc: 25455 <at> debbugs.gnu.org, arthur2e5 <at> aosc.xyz
Subject: Re: bug#25455: uniq considers all the full-width punctuation and
 Japanese kana as the same under zh_CN.UTF-8 locale
Date: Fri, 20 Jan 2017 22:08:33 -0500
[Message part 1 (text/plain, inline)]
On 16 Jan 2017 04:01, Icenowy Zheng wrote:
> When dealing lines with only a Chinese full-width punctuation or Japanese kana
> and locale is zh_CN.UTF-8, uniq command will consider all the lines are the
> same, and wrongly removed different punctuations.

this is a problem with glibc, not coreutils.  you can follow the upstream bug:
https://sourceware.org/bugzilla/show_bug.cgi?id=13063
-mike
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#25455; Package coreutils. (Sun, 28 Oct 2018 07:53:01 GMT) Full text and rfc822 format available.

Message #14 received at 25455 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: 25455 <at> debbugs.gnu.org
Subject: Re: bug#25455: uniq considers all the full-width punctuation and
 Japanese kana as the same under zh_CN.UTF-8 locale
Date: Sun, 28 Oct 2018 01:52:13 -0600
tags 25455 notabug
close 25455
stop

(triaging old bugs)

On 2017-01-20 8:08 p.m., Mike Frysinger wrote:
> On 16 Jan 2017 04:01, Icenowy Zheng wrote:
>> When dealing lines with only a Chinese full-width punctuation or Japanese kana
>> and locale is zh_CN.UTF-8, uniq command will consider all the lines are the
>> same, and wrongly removed different punctuations.
> 
> this is a problem with glibc, not coreutils.  you can follow the upstream bug:
> https://sourceware.org/bugzilla/show_bug.cgi?id=13063

Given the above, and with no further comments
in more than a year, I'm closing this bug.
Discussion can continue by replying to this thread.

-assaf





Added tag(s) notabug. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Sun, 28 Oct 2018 07:53:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 25455 <at> debbugs.gnu.org and Icenowy Zheng <icenowy <at> aosc.xyz> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Sun, 28 Oct 2018 07:53:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 25 Nov 2018 12:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 264 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.