GNU bug report logs - #25455
uniq considers all the full-width punctuation and Japanese kana as the same under zh_CN.UTF-8 locale

Previous Next

Package: coreutils;

Reported by: Icenowy Zheng <icenowy <at> aosc.xyz>

Date: Sun, 15 Jan 2017 23:10:01 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Icenowy Zheng <icenowy <at> aosc.xyz>
To: bug-coreutils <at> gnu.org
Cc: arthur2e5 <at> aosc.xyz
Subject: uniq considers all the full-width punctuation and Japanese kana as
 the same under zh_CN.UTF-8 locale
Date: Mon, 16 Jan 2017 04:01:05 +0800
Problem:
When dealing lines with only a Chinese full-width punctuation or Japanese kana
and locale is zh_CN.UTF-8, uniq command will consider all the lines are the
same, and wrongly removed different punctuations.

Reproduce steps:

Run the following command:

```
printf "%s\n" , 。 : ¥ あ か ア カ a b c , . : $ | LC_ALL=zh_CN.UTF-8 uniq
```

Comments:
The printf command prints out
```
,
。
:
¥
あ
か
ア
カ
a
b
c
,
.
:
$
```

Every line is different.

However, after uniq command, it gives out
```
,
a
b
c
,
.
:
$
```

Under zh_TW.UTF-8 locale, the problems also happens; but under ja_JP.UTF-8 or C it do not happen.

Version info:
```
$ uniq --version
uniq (GNU coreutils) 8.26
... ...
$ /lib/libc.so.6 
GNU C Library (2.24-2_AOSC_OS) stable release version 2.24, by Roland McGrath et al.
... ...
```

Architecture:

on x86_64 and armv7l architectures the test fails.




This bug report was last modified 6 years and 265 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.