GNU bug report logs -
#53145
"cut" can't segment Chinese characters correctly?
Previous Next
Full log
View this message in rfc822 format
On 1/12/22 12:19, zendas via GNU coreutils Bug Reports wrote:
> I have considered dealing with this problem directly with three bytes instead, but I have two doubts, I can correctly use wc -m to recognize the bytes in the same environment (but cut can't?), and my script goal is to recognize Chinese, will The probability of execution is higher on platforms that support Chinese environment. In addition, the fixed three-byte approach cannot handle the mixed content of full shape and half shape. I need a lot of judgment and conversion, which will greatly increase the possibility of errors.
As Bob wrote, some downstream distributions have multi-byte support in cut(1) for many years,
e.g. RHEL/Fedora and SUSE/openSUSE.
E.g. here on my openSUSE system:
$ echo "你好啊" | LC_ALL=zh_CN.UTF-8 cut -c 1
你
Have a nice day,
Berny
This bug report was last modified 3 years and 118 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.