GNU bug report logs - #12192
multibyte: tr: TR operates on bytes, not characters

Previous Next

Package: coreutils;

Reported by: Michael Stummvoll <michael <at> stummi.org>

Date: Mon, 13 Aug 2012 13:02:02 UTC

Severity: wishlist

Merged with 9365, 9569, 10880, 13362

To reply to this bug, email your comments to 12192 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#12192; Package coreutils. (Mon, 13 Aug 2012 13:02:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Michael Stummvoll <michael <at> stummi.org>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 13 Aug 2012 13:02:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Michael Stummvoll <michael <at> stummi.org>
To: bug-coreutils <at> gnu.org
Subject: tr - bytes vs characters
Date: Mon, 13 Aug 2012 14:52:22 +0200
Hi gnu folks,

as already known, tr cannot handle multibyte-encodings like utf-8:

> mst <at> eddie:~$ echo "foo" | tr o ö
> fÃÃ

i know, that multibyte encoding support is not needed for
posix-compilance, BUT:

the manpage of tr says the following: 

> Translate, squeeze, and/or delete characters from standard input,
> writing to standard output.

and thats the inconsistence imho.

The typical interpretation of "character" in such a context means one
character on display. regardless which encoding is used or how many
bytes are used to display this. So, if tr realy translates "characters"
it should preserve the encoding. If it doesn't do, it does not
translate "characters" but "bytes". So there I see two ways:

- add multybyte-encoding support to tr
or
- change the manpage and helptext to not say "characters" but "bytes"

since it doesn't seem that somebody want to add the support to tr, an
update of the manpage would be the easier way to ensure the consistence.

Kind regards,
Michael




Information forwarded to bug-coreutils <at> gnu.org:
bug#12192; Package coreutils. (Mon, 13 Aug 2012 14:03:01 GMT) Full text and rfc822 format available.

Message #8 received at 12192 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Michael Stummvoll <michael <at> stummi.org>
Cc: 12192 <at> debbugs.gnu.org
Subject: Re: bug#12192: tr - bytes vs characters
Date: Mon, 13 Aug 2012 07:54:02 -0600
[Message part 1 (text/plain, inline)]
On 08/13/2012 06:52 AM, Michael Stummvoll wrote:
> Hi gnu folks,
> 
> as already known, tr cannot handle multibyte-encodings like utf-8:
> 
>> mst <at> eddie:~$ echo "foo" | tr o ö
>> fÃÃ
> 
> i know, that multibyte encoding support is not needed for
> posix-compilance,

Actually, POSIX _does_ require multi-byte support; it's just that no one
has yet contributed code for this upstream that is easy enough to
maintain and without penalizing single-byte locales.  Patches are welcome.

-- 
Eric Blake   eblake <at> redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#12192; Package coreutils. (Tue, 14 Aug 2012 02:55:01 GMT) Full text and rfc822 format available.

Message #11 received at 12192 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eric Blake <eblake <at> redhat.com>
Cc: 12192 <at> debbugs.gnu.org, Michael Stummvoll <michael <at> stummi.org>
Subject: Re: bug#12192: tr - bytes vs characters
Date: Mon, 13 Aug 2012 19:45:54 -0700
On 08/13/2012 06:54 AM, Eric Blake wrote:
> POSIX _does_ require multi-byte support

The last time I checked, POSIX did not require
the implementation to provide any multibyte locales.
Has this changed?

But yes, the main thing is for someone to contribute
correct, easy-to-maintain, and efficient code.




Information forwarded to bug-coreutils <at> gnu.org:
bug#12192; Package coreutils. (Tue, 14 Aug 2012 05:44:02 GMT) Full text and rfc822 format available.

Message #14 received at 12192 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 12192 <at> debbugs.gnu.org, Michael Stummvoll <michael <at> stummi.org>
Subject: Re: bug#12192: tr - bytes vs characters
Date: Mon, 13 Aug 2012 23:34:16 -0600
[Message part 1 (text/plain, inline)]
On 08/13/2012 08:45 PM, Paul Eggert wrote:
> On 08/13/2012 06:54 AM, Eric Blake wrote:
>> POSIX _does_ require multi-byte support
> 
> The last time I checked, POSIX did not require
> the implementation to provide any multibyte locales.
> Has this changed?

Fair enough - POSIX does not require the existence of a multibyte
locale; an embedded system that provides only single-byte encodings can
still be POSIX-compliant.  But POSIX _does_ require that tr be
locale-aware, and therefore if an implementation provides multibyte
locales (which most desktop glibc-based GNU/Linux systems do), then tr
should honor those locales, including multibyte character support.

> 
> But yes, the main thing is for someone to contribute
> correct, easy-to-maintain, and efficient code.

We're in violent agreement on this point :)

-- 
Eric Blake   eblake <at> redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#12192; Package coreutils. (Tue, 14 Aug 2012 07:54:01 GMT) Full text and rfc822 format available.

Message #17 received at 12192 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eric Blake <eblake <at> redhat.com>
Cc: 12192 <at> debbugs.gnu.org, Michael Stummvoll <michael <at> stummi.org>
Subject: Re: bug#12192: tr - bytes vs characters
Date: Tue, 14 Aug 2012 00:44:24 -0700
On 08/13/2012 10:34 PM, Eric Blake wrote:
> But POSIX _does_ require that tr be
> locale-aware, and therefore if an implementation provides multibyte
> locales (which most desktop glibc-based GNU/Linux systems do), then tr
> should honor those locales, including multibyte character support.

All this is absolutely correct; but still, if the issue is merely POSIX
conformance, these glibc-based GNU/Linux systems do conform to POSIX,
since the POSIX-conformance document for these systems can state that
the supported locales are merely the single-byte locales.  Admittedly this
is legal hairsplitting, but if POSIX compliance is the issue
then one is in legal-hairsplitting mode already....




Information forwarded to bug-coreutils <at> gnu.org:
bug#12192; Package coreutils. (Fri, 17 Aug 2012 12:13:02 GMT) Full text and rfc822 format available.

Message #20 received at 12192 <at> debbugs.gnu.org (full text, mbox):

From: Michael Stummvoll <michael <at> stummi.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 12192 <at> debbugs.gnu.org
Subject: Re: bug#12192: tr - bytes vs characters
Date: Fri, 17 Aug 2012 14:03:42 +0200
Hi there,
> But yes, the main thing is for someone to contribute
> correct, easy-to-maintain, and efficient code.

Just for the record, if any day somebody wants to attend this

I just noticed, that the "tr" from 9base can handle utf-8 correctly.
9base is a unix-port of the plan9 utils: http://tools.suckless.org/9base

i didn't took an closer look yet to the sources neither from gnu tr nor
from 9base tr. But may somebody other could benefit from there.

Kind Regards,
Michael






Information forwarded to bug-coreutils <at> gnu.org:
bug#12192; Package coreutils. (Sat, 15 Sep 2012 10:30:02 GMT) Full text and rfc822 format available.

Message #23 received at 12192 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Michael Stummvoll <michael <at> stummi.org>
Cc: 12192 <at> debbugs.gnu.org
Subject: Re: bug#12192: tr - bytes vs characters
Date: Sat, 15 Sep 2012 12:28:54 +0200
forcemerge 12192 9365
thanks

Michael Stummvoll wrote:
> Hi gnu folks,
>
> as already known, tr cannot handle multibyte-encodings like utf-8:
>
>> mst <at> eddie:~$ echo "foo" | tr o ö
>> fÃÃ
>
> i know, that multibyte encoding support is not needed for
> posix-compilance, BUT:
>
> the manpage of tr says the following:
>
>> Translate, squeeze, and/or delete characters from standard input,
>> writing to standard output.
>
> and thats the inconsistence imho.
>
> The typical interpretation of "character" in such a context means one
> character on display. regardless which encoding is used or how many
> bytes are used to display this. So, if tr realy translates "characters"
> it should preserve the encoding. If it doesn't do, it does not
> translate "characters" but "bytes". So there I see two ways:
>
> - add multybyte-encoding support to tr
> or
> - change the manpage and helptext to not say "characters" but "bytes"
>
> since it doesn't seem that somebody want to add the support to tr, an
> update of the manpage would be the easier way to ensure the consistence.

Thanks for the report.
I'm merging this issue with the others that relate to tr
and multi-byte support.




Forcibly Merged 9365 9569 10880 12192. Request was from Jim Meyering <jim <at> meyering.net> to control <at> debbugs.gnu.org. (Sat, 15 Sep 2012 10:30:04 GMT) Full text and rfc822 format available.

Forcibly Merged 9365 9569 10880 12192 13362. Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Sun, 06 Jan 2013 12:24:03 GMT) Full text and rfc822 format available.

Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Mon, 15 Oct 2018 14:07:02 GMT) Full text and rfc822 format available.

Changed bug title to 'multibyte: tr: TR operates on bytes, not characters' from 'tr - bytes vs characters' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Mon, 15 Oct 2018 14:07:02 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 249 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.