GNU bug report logs - #13362
multibyte: tr: TR operates on bytes, not characters

Previous Next

Package: coreutils;

Reported by: Urs Thuermann <urs <at> isnogud.escape.de>

Date: Sat, 5 Jan 2013 17:28:01 UTC

Severity: wishlist

Merged with 9365, 9569, 10880, 12192

Full log


View this message in rfc822 format

From: Urs Thuermann <urs <at> isnogud.escape.de>
To: 13362 <at> debbugs.gnu.org
Subject: bug#13362: tr does not work with UTF-8 locales
Date: 05 Jan 2013 12:53:00 +0100
The tr utility from coreutils-8.20 does not handle multi-byte
characters in UTF-8 correctly.  It seems the arguments and standard
input are read byte-by-byte instead of character-by-character.

Here are two examples, using the following UTF-8 characters (which are
also available in latin1, since this is what my mail software still
uses):

        ä (c3 a4), ö (c3 b6), ü(c3 bc), ¼ (c2 bc), ½ (c2 bd)

1. A call to tr -d ü does not delete that two byte sequence from the
   input but deletes any occurence of c3 or bc:

    urs <at> bit:~/coreutils-8.20$ locale
    LANG=C.UTF-8
    LANGUAGE=
    LC_CTYPE="C.UTF-8"
    LC_NUMERIC="C.UTF-8"
    LC_TIME="C.UTF-8"
    LC_COLLATE="C.UTF-8"
    LC_MONETARY="C.UTF-8"
    LC_MESSAGES="C.UTF-8"
    LC_PAPER="C.UTF-8"
    LC_NAME="C.UTF-8"
    LC_ADDRESS="C.UTF-8"
    LC_TELEPHONE="C.UTF-8"
    LC_MEASUREMENT="C.UTF-8"
    LC_IDENTIFICATION="C.UTF-8"
    LC_ALL=
    urs <at> bit:~/coreutils-8.20$ echo äöü¼|od -tx1
    0000000 c3 a4 c3 b6 c3 bc c2 bc 0a
    0000011
    urs <at> bit:~/coreutils-8.20$ echo äöü¼|tr -d ü|od -tx1
    0000000 a4 b6 c2 0a
    0000004

2. Replacing the single character ü (c3 bc) by the single character ½
   (c2 bd) does instead replace each c3 by c2 and each bc by bd:

    urs <at> bit:~/coreutils-8.20$ echo äöü¼|tr ü ½|od -tx1
    0000000 c2 a4 c2 b6 c2 bd c2 bd 0a
    0000011

urs




This bug report was last modified 6 years and 245 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.