GNU bug report logs - #53145
"cut" can't segment Chinese characters correctly?

Reported by: zendas <zendas <at> protonmail.com>

Date: Sun, 9 Jan 2022 19:13:01 UTC

Severity: wishlist

To reply to this bug, email your comments to 53145 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-coreutils <at> gnu.org:
bug#53145; Package coreutils. (Sun, 09 Jan 2022 19:13:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to zendas <zendas <at> protonmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sun, 09 Jan 2022 19:13:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: zendas <zendas <at> protonmail.com>
To: "bug-coreutils <at> gnu.org" <bug-coreutils <at> gnu.org>
Subject: "cut" can't segment Chinese characters correctly?
Date: Sun, 09 Jan 2022 18:53:29 +0000

[Message part 1 (text/plain, inline)]

Hello, I need to get Chinese characters from the string. I googled a lot of documents, it seems that the -c parameter of cut should be able to meet my needs, but I even directly execute the instructions on the web page, and the result is different from the demonstration. I have searched dozens of pages but the results are not the same as the demo, maybe this is a bug?

For example:
https://blog.csdn.net/xuzhangze/article/details/80930714
[20180705173450701.png]
the result of my attempt:
[è¢å¹å¿«ç§ 2022-01-10 02:49:46.png]

[Message part 2 (text/html, inline)]

[螢幕快照 2022-01-10 024946.png (image/png, inline)]

[20180705173450701.png (image/png, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#53145; Package coreutils. (Sun, 09 Jan 2022 19:41:02 GMT) Full text and rfc822 format available.

Message #8 received at 53145 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: 53145 <at> debbugs.gnu.org, zendas <zendas <at> protonmail.com>
Subject: Re: bug#53145: "cut" can't segment Chinese characters correctly?
Date: Sun, 9 Jan 2022 12:40:20 -0700

zendas wrote:
> Hello, I need to get Chinese characters from the string. I googled a
> lot of documents, it seems that the -c parameter of cut should be
> able to meet my needs, but I even directly execute the instructions
> on the web page, and the result is different from the
> demonstration. I have searched dozens of pages but the results are
> not the same as the demo, maybe this is a bug?

Unfortunately the example was attached as images instead of as plain
text.  Please in the future copy and paste the example as text rather
than as an image.  As an image it is impossible to reproduce by trying
to copy and paste the image.  As an image it is impossible to search
for the strings.

The images were also lost somehow from the various steps in the
mailing list pipelines with this message.  First it was classified as
spam by the anti-spam robot (SpamAssassin-Bogofilter-CRM114).  I
caught it in review and re-sent the message.  That may have been the
problem specifically with images.

> For example:
> https://blog.csdn.net/xuzhangze/article/details/80930714
> [20180705173450701.png]
> the result of my attempt:
> [è¢å¹å¿«ç§ 2022-01-10 02:49:46.png]

One of the two images:

    https://debbugs.gnu.org/cgi/bugreport.cgi?msg=5;bug=53145;att=3;filename=20180705173450701.png

Second problem is that the first image shows as being corrupted.  I
can view the original however.  To my eye they are similar enough that
the one above is sufficient and I do not need to re-send the corrupted
image.

As to the problem you have reported it is due to lack of
internationalization support for characters.  -c is the same as -b at
this moment.

    https://www.gnu.org/software/coreutils/manual/html_node/cut-invocation.html#cut-invocation

    â-c CHARACTER-LISTâ
    â--characters=CHARACTER-LISTâ
         Select for printing only the characters in positions listed in
         CHARACTER-LIST.  The same as â-bâ for now, but internationalization
         will change that.  Tabs and backspaces are treated like any other
         character; they take up 1 character.  If an output delimiter is
         specified, (see the description of â--output-delimiterâ), then
         output that string between ranges of selected bytes.

For multi-byte UTF-8 characters the -c option will operate the same as
the -b option as of the current version and is not suitable for
dealing with multi-byte characters.

    $ echo 'è¢å¹å¿«ç§'
    è¢å¹å¿«ç§
    $ echo 'è¢å¹å¿«ç§' | cut -c 1
    ?
    $ echo 'è¢å¹å¿«ç§' | cut -c 1-3
    è¢
    $ echo 'è¢å¹å¿«ç§' | cut -b 1-3
    è¢

If the characters are known to be 3 bytes multi-characters then I
might suggest using -b to workaround the problem assuming 3 byte
characters.  Eventually when -c is coded to handle multi-byte
characters the handling as bytes will change.  Using -b would avoid
that change.

Some operating systems have patched that specific version of utilities
locally to add multi-byte character handling.  But the patches have
not been found acceptable for inclusion.  That is why there are
differences between different operating systems.

Bob

Information forwarded to bug-coreutils <at> gnu.org:
bug#53145; Package coreutils. (Sun, 09 Jan 2022 19:52:02 GMT) Full text and rfc822 format available.

Message #11 received at 53145 <at> debbugs.gnu.org (full text, mbox):

From: zendas <zendas <at> protonmail.com>
To: 53145 <at> debbugs.gnu.org, zendas <zendas <at> protonmail.com>
Subject: Re: bug#53145: "cut" can't segment Chinese characters correctly?
Date: Sun, 09 Jan 2022 19:51:33 +0000

Create a new test2.txt, the content is
ææä¸
ææäº
ææä¸
ææå
ææäº
ææå
æææ¥
=============================
zendas <at> Backup-Server:/tmp$ cat test2.txt
ææä¸
ææäº
ææä¸
ææå
ææäº
ææå
æææ¥
zendas <at> Backup-Server:/tmp$
=============================
zendas <at> Backup-Server:/tmp$ cut -c 1 test2.txt
ï¿½
ï¿½
ï¿½
ï¿½
ï¿½
ï¿½
ï¿½
zendas <at> Backup-Server:/tmp$ cut -c 2 test2.txt
ï¿½
ï¿½
ï¿½
ï¿½
ï¿½
ï¿½
ï¿½
zendas <at> Backup-Server:/tmp$ cut -c 1-3 test2.txt
æ
æ
æ
æ
æ
æ
æ
zendas <at> Backup-Server:/tmp$
=============================
Reference source:
https://blog.csdn.net/m0_38110132/article/details/79883827

my environment is:
zendas <at> Backup-Server:~$ cat /etc/debian_version
11.1
zendas <at> Backup-Server:~$ cut --version
cut (GNU coreutils) 8.32
Copyright (C) 2020 Free Software Foundation, Inc.
ææ¬æ¢æ¬¾ GPLv3+ï¼GNU éç¨å¬å±ææ¬æ¢æ¬¾ç¬¬ 3 çææ´æ°çæ¬ <https://gnu.org/licenses/gpl.html>ã
æ¬è»é«æ¯èªç±è»é«ï¼æ¨å¯ä»¥èªç±ä¿®æ¹åéæ°ç¼å¸å®ã
å¨æ³å¾ç¯åå§æ²æå¶ä»ä¿èã

ç± David M. IhnatãDavid MacKenzie å Jim Meyering ç·¨å¯«ã

âââââââ Original Message âââââââ

å¨ 2022å¹´1æ10æ¥ ææä¸ ä¸å 3:40ï¼Bob Proulx <bob <at> proulx.com> å¯«éï¼

> zendas wrote:
>
> > Hello, I need to get Chinese characters from the string. I googled a
> >
> > lot of documents, it seems that the -c parameter of cut should be
> >
> > able to meet my needs, but I even directly execute the instructions
> >
> > on the web page, and the result is different from the
> >
> > demonstration. I have searched dozens of pages but the results are
> >
> > not the same as the demo, maybe this is a bug?
>
> Unfortunately the example was attached as images instead of as plain
>
> text. Please in the future copy and paste the example as text rather
>
> than as an image. As an image it is impossible to reproduce by trying
>
> to copy and paste the image. As an image it is impossible to search
>
> for the strings.
>
> The images were also lost somehow from the various steps in the
>
> mailing list pipelines with this message. First it was classified as
>
> spam by the anti-spam robot (SpamAssassin-Bogofilter-CRM114). I
>
> caught it in review and re-sent the message. That may have been the
>
> problem specifically with images.
>
> > For example:
> >
> > https://blog.csdn.net/xuzhangze/article/details/80930714
> >
> > [20180705173450701.png]
> >
> > the result of my attempt:
> >
> > [è¢å¹å¿«ç§ 2022-01-10 02:49:46.png]
>
> One of the two images:
>
> https://debbugs.gnu.org/cgi/bugreport.cgi?msg=5;bug=53145;att=3;filename=20180705173450701.png
>
> Second problem is that the first image shows as being corrupted. I
>
> can view the original however. To my eye they are similar enough that
>
> the one above is sufficient and I do not need to re-send the corrupted
>
> image.
>
> As to the problem you have reported it is due to lack of
>
> internationalization support for characters. -c is the same as -b at
>
> this moment.
>
> https://www.gnu.org/software/coreutils/manual/html_node/cut-invocation.html#cut-invocation
>
> â-c CHARACTER-LISTâ
>
> â--characters=CHARACTER-LISTâ
>
> Select for printing only the characters in positions listed in
>
> CHARACTER-LIST. The same as â-bâ for now, but internationalization
>
> will change that. Tabs and backspaces are treated like any other
>
> character; they take up 1 character. If an output delimiter is
>
> specified, (see the description of â--output-delimiterâ), then
>
> output that string between ranges of selected bytes.
>
> For multi-byte UTF-8 characters the -c option will operate the same as
>
> the -b option as of the current version and is not suitable for
>
> dealing with multi-byte characters.
>
> $ echo 'è¢å¹å¿«ç§'
>
> è¢å¹å¿«ç§
>
> $ echo 'è¢å¹å¿«ç§' | cut -c 1
>
> ?
>
> $ echo 'è¢å¹å¿«ç§' | cut -c 1-3
>
> è¢
>
> $ echo 'è¢å¹å¿«ç§' | cut -b 1-3
>
> è¢
>
> If the characters are known to be 3 bytes multi-characters then I
>
> might suggest using -b to workaround the problem assuming 3 byte
>
> characters. Eventually when -c is coded to handle multi-byte
>
> characters the handling as bytes will change. Using -b would avoid
>
> that change.
>
> Some operating systems have patched that specific version of utilities
>
> locally to add multi-byte character handling. But the patches have
>
> not been found acceptable for inclusion. That is why there are
>
> differences between different operating systems.
>
> Bob

Information forwarded to bug-coreutils <at> gnu.org:
bug#53145; Package coreutils. (Mon, 10 Jan 2022 05:15:01 GMT) Full text and rfc822 format available.

Message #14 received at 53145 <at> debbugs.gnu.org (full text, mbox):

From: zendas <zendas <at> protonmail.com>
To: 53145 <at> debbugs.gnu.org, zendas <zendas <at> protonmail.com>
Subject: åè¦: Re: bug#53145: "cut" can't segment Chinese characters correctly?
Date: Mon, 10 Jan 2022 05:14:41 +0000

zendas <at> Backup-Server:/tmp$ echo "ä½ å¥½å" | cut -c 1-3 | od -b
0000000 344 275 240 012
0000004
zendas <at> Backup-Server:/tmp$ echo "ä½ å¥½å" | cut -c 1 | od -b
0000000 344 012
0000002
zendas <at> Backup-Server:/tmp$ echo "ä½ å¥½å" | cut -b 1 | od -b
0000000 344 012
0000002
zendas <at> Backup-Server:/tmp$ echo "ä½ å¥½å" | cut -nb 1 | od -b
0000000 344 012
0000002
zendas <at> Backup-Server:/tmp$ echo "ä½ å¥½å" | cut -c 1-3
ä½ 
zendas <at> Backup-Server:/tmp$ echo "ä½ å¥½å" | cut -c 1
ï¿½
zendas <at> Backup-Server:/tmp$
âââââââ Original Message âââââââ

å¨ 2022å¹´1æ10æ¥ ææä¸ ä¸å 3:51ï¼zendas <zendas <at> protonmail.com> å¯«éï¼

> Reference source:
>
> https://blog.csdn.net/m0_38110132/article/details/79883827
>
> my environment is:
>
> zendas <at> Backup-Server:~$ cat /etc/debian_version
>
> 11.1
>
> zendas <at> Backup-Server:~$ cut --version
>
> cut (GNU coreutils) 8.32
>
> Copyright (C) 2020 Free Software Foundation, Inc.
>
> ææ¬æ¢æ¬¾ GPLv3+ï¼GNU éç¨å¬å±ææ¬æ¢æ¬¾ç¬¬ 3 çææ´æ°çæ¬ https://gnu.org/licenses/gpl.htmlã
>
> æ¬è»é«æ¯èªç±è»é«ï¼æ¨å¯ä»¥èªç±ä¿®æ¹åéæ°ç¼å¸å®ã
>
> å¨æ³å¾ç¯åå§æ²æå¶ä»ä¿èã
>
> ç± David M. IhnatãDavid MacKenzie å Jim Meyering ç·¨å¯«ã
>
> âââââââ Original Message âââââââ
>
> å¨ 2022å¹´1æ10æ¥ ææä¸ ä¸å 3:40ï¼Bob Proulx bob <at> proulx.com å¯«éï¼
>
> > zendas wrote:
> >
> > > Hello, I need to get Chinese characters from the string. I googled a
> > >
> > > lot of documents, it seems that the -c parameter of cut should be
> > >
> > > able to meet my needs, but I even directly execute the instructions
> > >
> > > on the web page, and the result is different from the
> > >
> > > demonstration. I have searched dozens of pages but the results are
> > >
> > > not the same as the demo, maybe this is a bug?
> >
> > Unfortunately the example was attached as images instead of as plain
> >
> > text. Please in the future copy and paste the example as text rather
> >
> > than as an image. As an image it is impossible to reproduce by trying
> >
> > to copy and paste the image. As an image it is impossible to search
> >
> > for the strings.
> >
> > The images were also lost somehow from the various steps in the
> >
> > mailing list pipelines with this message. First it was classified as
> >
> > spam by the anti-spam robot (SpamAssassin-Bogofilter-CRM114). I
> >
> > caught it in review and re-sent the message. That may have been the
> >
> > problem specifically with images.
> >
> > > For example:
> > >
> > > https://blog.csdn.net/xuzhangze/article/details/80930714
> > >
> > > [20180705173450701.png]
> > >
> > > the result of my attempt:
> > >
> > > [è¢å¹å¿«ç§ 2022-01-10 02:49:46.png]
> >
> > One of the two images:
> >
> > https://debbugs.gnu.org/cgi/bugreport.cgi?msg=5;bug=53145;att=3;filename=20180705173450701.png
> >
> > Second problem is that the first image shows as being corrupted. I
> >
> > can view the original however. To my eye they are similar enough that
> >
> > the one above is sufficient and I do not need to re-send the corrupted
> >
> > image.
> >
> > As to the problem you have reported it is due to lack of
> >
> > internationalization support for characters. -c is the same as -b at
> >
> > this moment.
> >
> > https://www.gnu.org/software/coreutils/manual/html_node/cut-invocation.html#cut-invocation
> >
> > â-c CHARACTER-LISTâ
> >
> > â--characters=CHARACTER-LISTâ
> >
> > Select for printing only the characters in positions listed in
> >
> > CHARACTER-LIST. The same as â-bâ for now, but internationalization
> >
> > will change that. Tabs and backspaces are treated like any other
> >
> > character; they take up 1 character. If an output delimiter is
> >
> > specified, (see the description of â--output-delimiterâ), then
> >
> > output that string between ranges of selected bytes.
> >
> > For multi-byte UTF-8 characters the -c option will operate the same as
> >
> > the -b option as of the current version and is not suitable for
> >
> > dealing with multi-byte characters.
> >
> > $ echo 'è¢å¹å¿«ç§'
> >
> > è¢å¹å¿«ç§
> >
> > $ echo 'è¢å¹å¿«ç§' | cut -c 1
> >
> > ?
> >
> > $ echo 'è¢å¹å¿«ç§' | cut -c 1-3
> >
> > è¢
> >
> > $ echo 'è¢å¹å¿«ç§' | cut -b 1-3
> >
> > è¢
> >
> > If the characters are known to be 3 bytes multi-characters then I
> >
> > might suggest using -b to workaround the problem assuming 3 byte
> >
> > characters. Eventually when -c is coded to handle multi-byte
> >
> > characters the handling as bytes will change. Using -b would avoid
> >
> > that change.
> >
> > Some operating systems have patched that specific version of utilities
> >
> > locally to add multi-byte character handling. But the patches have
> >
> > not been found acceptable for inclusion. That is why there are
> >
> > differences between different operating systems.
> >
> > Bob

Information forwarded to bug-coreutils <at> gnu.org:
bug#53145; Package coreutils. (Wed, 12 Jan 2022 11:20:02 GMT) Full text and rfc822 format available.

Message #17 received at 53145 <at> debbugs.gnu.org (full text, mbox):

From: zendas <zendas <at> protonmail.com>
To: 53145 <at> debbugs.gnu.org
Subject: åè¦: bug#53145: Acknowledgement ("cut" can't segment Chinese characters correctly?)
Date: Wed, 12 Jan 2022 11:19:01 +0000

I have considered dealing with this problem directly with three bytes instead, but I have two doubts, I can correctly use wc -m to recognize the bytes in the same environment (but cut can't?), and my script goal is to recognize Chinese, will The probability of execution is higher on platforms that support Chinese environment. In addition, the fixed three-byte approach cannot handle the mixed content of full shape and half shape. I need a lot of judgment and conversion, which will greatly increase the possibility of errors.

å¨ 2022å¹´1æ10æ¥ ææä¸ ä¸å 3:13ï¼ <help-debbugs <at> gnu.org> å¯«éï¼

> 53145: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=53145
>
> GNU Bug Tracking System
>
> Contact help-debbugs <at> gnu.org with problems

Information forwarded to bug-coreutils <at> gnu.org:
bug#53145; Package coreutils. (Wed, 12 Jan 2022 12:26:01 GMT) Full text and rfc822 format available.

Message #20 received at 53145 <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Voelker <mail <at> bernhard-voelker.de>
To: zendas <zendas <at> protonmail.com>, 53145 <at> debbugs.gnu.org
Subject: Re: bug#53145: åè¦: bug#53145: Acknowledgement ("cut" can't segment Chinese characters correctly?)
Date: Wed, 12 Jan 2022 13:25:17 +0100

On 1/12/22 12:19, zendas via GNU coreutils Bug Reports wrote:
> I have considered dealing with this problem directly with three bytes instead, but I have two doubts, I can correctly use wc -m to recognize the bytes in the same environment (but cut can't?), and my script goal is to recognize Chinese, will The probability of execution is higher on platforms that support Chinese environment. In addition, the fixed three-byte approach cannot handle the mixed content of full shape and half shape. I need a lot of judgment and conversion, which will greatly increase the possibility of errors.

As Bob wrote, some downstream distributions have multi-byte support in cut(1) for many years,
e.g. RHEL/Fedora and SUSE/openSUSE.

E.g. here on my openSUSE system:

  $ echo "ä½ å¥½å" | LC_ALL=zh_CN.UTF-8 cut -c 1
  ä½ 

Have a nice day,
Berny

Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Mon, 21 Feb 2022 09:55:02 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 170 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #53145 "cut" can't segment Chinese characters correctly?

GNU bug report logs - #53145
"cut" can't segment Chinese characters correctly?