GNU bug report logs -
#13362
multibyte: tr: TR operates on bytes, not characters
Previous Next
To reply to this bug, email your comments to 13362 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#13362
; Package
coreutils
.
(Sat, 05 Jan 2013 17:28:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Urs Thuermann <urs <at> isnogud.escape.de>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Sat, 05 Jan 2013 17:28:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
The tr utility from coreutils-8.20 does not handle multi-byte
characters in UTF-8 correctly. It seems the arguments and standard
input are read byte-by-byte instead of character-by-character.
Here are two examples, using the following UTF-8 characters (which are
also available in latin1, since this is what my mail software still
uses):
ä (c3 a4), ö (c3 b6), ü(c3 bc), ¼ (c2 bc), ½ (c2 bd)
1. A call to tr -d ü does not delete that two byte sequence from the
input but deletes any occurence of c3 or bc:
urs <at> bit:~/coreutils-8.20$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=
urs <at> bit:~/coreutils-8.20$ echo äöü¼|od -tx1
0000000 c3 a4 c3 b6 c3 bc c2 bc 0a
0000011
urs <at> bit:~/coreutils-8.20$ echo äöü¼|tr -d ü|od -tx1
0000000 a4 b6 c2 0a
0000004
2. Replacing the single character ü (c3 bc) by the single character ½
(c2 bd) does instead replace each c3 by c2 and each bc by bd:
urs <at> bit:~/coreutils-8.20$ echo äöü¼|tr ü ½|od -tx1
0000000 c2 a4 c2 b6 c2 bd c2 bd 0a
0000011
urs
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#13362
; Package
coreutils
.
(Sun, 06 Jan 2013 12:24:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 13362 <at> debbugs.gnu.org (full text, mbox):
forcemerge 13362 9365
thanks
On 01/05/2013 11:53 AM, Urs Thuermann wrote:
> The tr utility from coreutils-8.20 does not handle multi-byte
> characters in UTF-8 correctly. It seems the arguments and standard
> input are read byte-by-byte instead of character-by-character.
We all agree that this is an issue.
Someone just needs to get the time to implement it.
thanks,
Pádraig.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#13362
; Package
coreutils
.
(Fri, 27 Jun 2014 17:07:02 GMT)
Full text and
rfc822 format available.
Message #13 received at 13362 <at> debbugs.gnu.org (full text, mbox):
Dear sirs:
This bugs is causing errors since many years ago (at least twelve (!)
[https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=139861]), and let's face
it, if we don't change the point of view it will never get solved. Meanwhile,
the effects of this bug will keep on damaging the works of Linux users, and our
reputation.
sed can work with utf-8 correctly. What about asking help from sed developers?
sed developers could even refactor the tr code so that sed code could be used,
so at least this bug would not keep on causing errors to Linux users.
Moreover, sed developers may find a better solution than that one.
Thank you.
Severity set to 'wishlist' from 'normal'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Mon, 15 Oct 2018 14:07:02 GMT)
Full text and
rfc822 format available.
Changed bug title to 'multibyte: tr: TR operates on bytes, not characters' from 'tr does not work with UTF-8 locales'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Mon, 15 Oct 2018 14:07:02 GMT)
Full text and
rfc822 format available.
This bug report was last modified 6 years and 245 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.