GNU bug report logs - #26362
multibyte: tr: "tr -cd" -- Problem with UTF-8?

Previous Next

Package: coreutils;

Reported by: Ronald Schaten <ronald <at> schatenseite.de>

Date: Tue, 4 Apr 2017 15:25:02 UTC

Severity: wishlist

Tags: notabug

To reply to this bug, email your comments to 26362 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#26362; Package coreutils. (Tue, 04 Apr 2017 15:25:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ronald Schaten <ronald <at> schatenseite.de>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Tue, 04 Apr 2017 15:25:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ronald Schaten <ronald <at> schatenseite.de>
To: bug-coreutils <at> gnu.org
Subject: tr -cd -- Problem with UTF-8?
Date: Tue, 4 Apr 2017 16:01:52 +0200
Hey...

I'm not sure if this is bug or if I'm using it wrong. As a matter of
fact, I tested this on several systems, and on BSD-based systems (Mac)
the tr tool gives different results -- the one I expected.

The simplest way to reproduce this looks like this (sorry, umlaut
ahead):

$ echo -ne "\xc3\x82" | tr -cd "ä" | xxd
% 00000000: c3                                       .

The echo prints a capital A with a circumflex (Â), and I expect the tr
command to delete everything except the small umlaut ä. It looks as if
tr just deletes the second byte.

When I try without the umlaut it gives me the empty result, as expected:

$ echo -ne "\xc3\x82" | tr -cd "a" | xxd
[empty result]

I tested several systems, the oldest is a Debian with coreutils 8.5, the
newest an Ubuntu with coreutils 8.25.


For the moment, I'll try to solve my problem differently, but... is this
a bug? Thanks in advance!


Regards,
Ronald.

-- 
There is no reason for any individual to have a computer in his home.
(Ken Olsen, DEC)




Information forwarded to bug-coreutils <at> gnu.org:
bug#26362; Package coreutils. (Wed, 05 Apr 2017 02:20:01 GMT) Full text and rfc822 format available.

Message #8 received at 26362 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Ronald Schaten <ronald <at> schatenseite.de>
Cc: 26362 <at> debbugs.gnu.org
Subject: Re: bug#26362: tr -cd -- Problem with UTF-8?
Date: Tue, 4 Apr 2017 22:19:15 -0400
tags 26362 notabug wishlist
stop 26362

Hello,

> On Apr 4, 2017, at 10:01, Ronald Schaten <ronald <at> schatenseite.de> wrote:
> 
> I'm not sure if this is bug or if I'm using it wrong.

Neither - it is simply the GNU tr does not yet support multibyte characters.

> The simplest way to reproduce this looks like this (sorry, umlaut
> ahead):
> 
> $ echo -ne "\xc3\x82" | tr -cd "ä" | xxd
> % 00000000: c3                                       .
> 
> The echo prints a capital A with a circumflex (Â), and I expect the tr
> command to delete everything except the small umlaut ä. It looks as if
> tr just deletes the second byte.

What happened here is this:
'tr' currently reads the input string parameter (SET1) as single-byte, and so
treats it as if you've given two octets: \xC3 \xA4 (which is the UTF-8 encoding
of small A with umlaut).
Then, it reads the input octet-by-octet, keeps \xC3 and deletes \x82.

> When I try without the umlaut it gives me the empty result, as expected:
> 
> $ echo -ne "\xc3\x82" | tr -cd "a" | xxd

Indeed, because here you're asking to
keep only octets whose value is \x61 (the ASCII value of 'a') -
neither "\xC3" not "\x82" match and so they are deleted.


> For the moment, I'll try to solve my problem differently, but... is this
> a bug? Thanks in advance!

Not a bug - but a yet-missing feature.
For relevant discussion see here:
   https://debbugs.gnu.org/cgi/bugreport.cgi?bug=24924#8

As a temporary work-around, you can use gnu sed which is multibyte-aware:

  $ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^ä]//g'
  ä

And 'sed' supports one more thing called "character equivalent class":
The the following examples, all characters except those that are equivalent to 'a'
will be deleted:

  $ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^[=a=]]//g'
  aäÂ

'Character equivalent class' will work with future 'tr' as well
once multibyte-support is added.

Lastly,
"echo -en" is not portable. It is recommended to use "printf" instead.
"printf" has the added advantage that it supports unicode code-points
directly, instead of having to know the UTF-8 encoding of a unicode character,
e.g.:
     printf "\u00c2\n"
will print capital A with circumflex (and will work in other locales if they
support this character, not just UTF-8).


I'm thus marking this item as "wishlist" and "notabug",
but I'll keep it open until it is implemented.
Discussion can continue by replying to this thread.

regards,
 - assaf





Added tag(s) notabug. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Wed, 05 Apr 2017 02:20:02 GMT) Full text and rfc822 format available.

Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Mon, 29 Oct 2018 03:05:02 GMT) Full text and rfc822 format available.

Changed bug title to 'multibyte: tr: "tr -cd" -- Problem with UTF-8?' from 'tr -cd -- Problem with UTF-8?' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Mon, 29 Oct 2018 03:05:02 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 292 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.