GNU bug report logs - #53145
"cut" can't segment Chinese characters correctly?

Previous Next

Package: coreutils;

Reported by: zendas <zendas <at> protonmail.com>

Date: Sun, 9 Jan 2022 19:13:01 UTC

Severity: wishlist

Full log


View this message in rfc822 format

From: Bernhard Voelker <mail <at> bernhard-voelker.de>
To: zendas <zendas <at> protonmail.com>, 53145 <at> debbugs.gnu.org
Subject: bug#53145: 回覆: bug#53145: Acknowledgement ("cut" can't segment Chinese characters correctly?)
Date: Wed, 12 Jan 2022 13:25:17 +0100
On 1/12/22 12:19, zendas via GNU coreutils Bug Reports wrote:
> I have considered dealing with this problem directly with three bytes instead, but I have two doubts, I can correctly use wc -m to recognize the bytes in the same environment (but cut can't?), and my script goal is to recognize Chinese, will The probability of execution is higher on platforms that support Chinese environment. In addition, the fixed three-byte approach cannot handle the mixed content of full shape and half shape. I need a lot of judgment and conversion, which will greatly increase the possibility of errors.

As Bob wrote, some downstream distributions have multi-byte support in cut(1) for many years,
e.g. RHEL/Fedora and SUSE/openSUSE.

E.g. here on my openSUSE system:

  $ echo "你好啊" | LC_ALL=zh_CN.UTF-8 cut -c 1
  你

Have a nice day,
Berny




This bug report was last modified 3 years and 118 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.