GNU bug report logs -
#26362
multibyte: tr: "tr -cd" -- Problem with UTF-8?
Previous Next
To reply to this bug, email your comments to 26362 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#26362
; Package
coreutils
.
(Tue, 04 Apr 2017 15:25:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Ronald Schaten <ronald <at> schatenseite.de>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Tue, 04 Apr 2017 15:25:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hey...
I'm not sure if this is bug or if I'm using it wrong. As a matter of
fact, I tested this on several systems, and on BSD-based systems (Mac)
the tr tool gives different results -- the one I expected.
The simplest way to reproduce this looks like this (sorry, umlaut
ahead):
$ echo -ne "\xc3\x82" | tr -cd "ä" | xxd
% 00000000: c3 .
The echo prints a capital A with a circumflex (Â), and I expect the tr
command to delete everything except the small umlaut ä. It looks as if
tr just deletes the second byte.
When I try without the umlaut it gives me the empty result, as expected:
$ echo -ne "\xc3\x82" | tr -cd "a" | xxd
[empty result]
I tested several systems, the oldest is a Debian with coreutils 8.5, the
newest an Ubuntu with coreutils 8.25.
For the moment, I'll try to solve my problem differently, but... is this
a bug? Thanks in advance!
Regards,
Ronald.
--
There is no reason for any individual to have a computer in his home.
(Ken Olsen, DEC)
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#26362
; Package
coreutils
.
(Wed, 05 Apr 2017 02:20:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 26362 <at> debbugs.gnu.org (full text, mbox):
tags 26362 notabug wishlist
stop 26362
Hello,
> On Apr 4, 2017, at 10:01, Ronald Schaten <ronald <at> schatenseite.de> wrote:
>
> I'm not sure if this is bug or if I'm using it wrong.
Neither - it is simply the GNU tr does not yet support multibyte characters.
> The simplest way to reproduce this looks like this (sorry, umlaut
> ahead):
>
> $ echo -ne "\xc3\x82" | tr -cd "ä" | xxd
> % 00000000: c3 .
>
> The echo prints a capital A with a circumflex (Â), and I expect the tr
> command to delete everything except the small umlaut ä. It looks as if
> tr just deletes the second byte.
What happened here is this:
'tr' currently reads the input string parameter (SET1) as single-byte, and so
treats it as if you've given two octets: \xC3 \xA4 (which is the UTF-8 encoding
of small A with umlaut).
Then, it reads the input octet-by-octet, keeps \xC3 and deletes \x82.
> When I try without the umlaut it gives me the empty result, as expected:
>
> $ echo -ne "\xc3\x82" | tr -cd "a" | xxd
Indeed, because here you're asking to
keep only octets whose value is \x61 (the ASCII value of 'a') -
neither "\xC3" not "\x82" match and so they are deleted.
> For the moment, I'll try to solve my problem differently, but... is this
> a bug? Thanks in advance!
Not a bug - but a yet-missing feature.
For relevant discussion see here:
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=24924#8
As a temporary work-around, you can use gnu sed which is multibyte-aware:
$ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^ä]//g'
ä
And 'sed' supports one more thing called "character equivalent class":
The the following examples, all characters except those that are equivalent to 'a'
will be deleted:
$ printf "abc \xc3\xA4\xc3\x82 def\n" | sed 's/[^[=a=]]//g'
aäÂ
'Character equivalent class' will work with future 'tr' as well
once multibyte-support is added.
Lastly,
"echo -en" is not portable. It is recommended to use "printf" instead.
"printf" has the added advantage that it supports unicode code-points
directly, instead of having to know the UTF-8 encoding of a unicode character,
e.g.:
printf "\u00c2\n"
will print capital A with circumflex (and will work in other locales if they
support this character, not just UTF-8).
I'm thus marking this item as "wishlist" and "notabug",
but I'll keep it open until it is implemented.
Discussion can continue by replying to this thread.
regards,
- assaf
Added tag(s) notabug.
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Wed, 05 Apr 2017 02:20:02 GMT)
Full text and
rfc822 format available.
Severity set to 'wishlist' from 'normal'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Mon, 29 Oct 2018 03:05:02 GMT)
Full text and
rfc822 format available.
Changed bug title to 'multibyte: tr: "tr -cd" -- Problem with UTF-8?' from 'tr -cd -- Problem with UTF-8?'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Mon, 29 Oct 2018 03:05:02 GMT)
Full text and
rfc822 format available.
This bug report was last modified 6 years and 292 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.