GNU bug report logs - #13362
multibyte: tr: TR operates on bytes, not characters

Previous Next

Package: coreutils;

Reported by: Urs Thuermann <urs <at> isnogud.escape.de>

Date: Sat, 5 Jan 2013 17:28:01 UTC

Severity: wishlist

Merged with 9365, 9569, 10880, 12192

To reply to this bug, email your comments to 13362 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-coreutils <at> gnu.org:
bug#13362; Package coreutils. (Sat, 05 Jan 2013 17:28:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Urs Thuermann <urs <at> isnogud.escape.de>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sat, 05 Jan 2013 17:28:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Urs Thuermann <urs <at> isnogud.escape.de>
To: bug-coreutils <at> gnu.org
Subject: tr does not work with UTF-8 locales
Date: 05 Jan 2013 12:53:00 +0100

The tr utility from coreutils-8.20 does not handle multi-byte
characters in UTF-8 correctly.  It seems the arguments and standard
input are read byte-by-byte instead of character-by-character.

Here are two examples, using the following UTF-8 characters (which are
also available in latin1, since this is what my mail software still
uses):

        ä (c3 a4), ö (c3 b6), ü(c3 bc), ¼ (c2 bc), ½ (c2 bd)

1. A call to tr -d ü does not delete that two byte sequence from the
   input but deletes any occurence of c3 or bc:

    urs <at> bit:~/coreutils-8.20$ locale
    LANG=C.UTF-8
    LANGUAGE=
    LC_CTYPE="C.UTF-8"
    LC_NUMERIC="C.UTF-8"
    LC_TIME="C.UTF-8"
    LC_COLLATE="C.UTF-8"
    LC_MONETARY="C.UTF-8"
    LC_MESSAGES="C.UTF-8"
    LC_PAPER="C.UTF-8"
    LC_NAME="C.UTF-8"
    LC_ADDRESS="C.UTF-8"
    LC_TELEPHONE="C.UTF-8"
    LC_MEASUREMENT="C.UTF-8"
    LC_IDENTIFICATION="C.UTF-8"
    LC_ALL=
    urs <at> bit:~/coreutils-8.20$ echo äöü¼|od -tx1
    0000000 c3 a4 c3 b6 c3 bc c2 bc 0a
    0000011
    urs <at> bit:~/coreutils-8.20$ echo äöü¼|tr -d ü|od -tx1
    0000000 a4 b6 c2 0a
    0000004

2. Replacing the single character ü (c3 bc) by the single character ½
   (c2 bd) does instead replace each c3 by c2 and each bc by bd:

    urs <at> bit:~/coreutils-8.20$ echo äöü¼|tr ü ½|od -tx1
    0000000 c2 a4 c2 b6 c2 bd c2 bd 0a
    0000011

urs

Information forwarded to bug-coreutils <at> gnu.org:
bug#13362; Package coreutils. (Sun, 06 Jan 2013 12:24:02 GMT) Full text and rfc822 format available.

Message #8 received at 13362 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Urs Thuermann <urs <at> isnogud.escape.de>
Cc: 13362 <at> debbugs.gnu.org
Subject: Re: bug#13362: tr does not work with UTF-8 locales
Date: Sun, 06 Jan 2013 12:22:56 +0000

forcemerge 13362 9365
thanks

On 01/05/2013 11:53 AM, Urs Thuermann wrote:
> The tr utility from coreutils-8.20 does not handle multi-byte
> characters in UTF-8 correctly.  It seems the arguments and standard
> input are read byte-by-byte instead of character-by-character.

We all agree that this is an issue.
Someone just needs to get the time to implement it.

thanks,
Pádraig.

Forcibly Merged 9365 9569 10880 12192 13362. Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Sun, 06 Jan 2013 12:24:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#13362; Package coreutils. (Fri, 27 Jun 2014 17:07:02 GMT) Full text and rfc822 format available.

Message #13 received at 13362 <at> debbugs.gnu.org (full text, mbox):

From: Ganton <kubry <at> gmx.com>
To: 13362 <at> debbugs.gnu.org
Subject: GNU bug report logs - #13362 tr does not work with UTF-8 locales
Date: Fri, 27 Jun 2014 19:01:14 +0200

Dear sirs:

This bugs is causing errors since many years ago (at least twelve (!) 
[https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=139861]), and let's face 
it, if we don't change the point of view it will never get solved. Meanwhile, 
the effects of this bug will keep on damaging the works of Linux users, and our 
reputation.

sed can work with utf-8 correctly. What about asking help from sed developers? 
sed developers could even refactor the tr code so that sed code could be used, 
so at least this bug would not keep on causing errors to Linux users. 
Moreover, sed developers may find a better solution than that one.

Thank you.

Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Mon, 15 Oct 2018 14:07:02 GMT) Full text and rfc822 format available.

Changed bug title to 'multibyte: tr: TR operates on bytes, not characters' from 'tr does not work with UTF-8 locales' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Mon, 15 Oct 2018 14:07:02 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 245 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #13362 multibyte: tr: TR operates on bytes, not characters

GNU bug report logs - #13362
multibyte: tr: TR operates on bytes, not characters