GNU bug report logs -
#12192
multibyte: tr: TR operates on bytes, not characters
Previous Next
To reply to this bug, email your comments to 12192 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#12192
; Package
coreutils
.
(Mon, 13 Aug 2012 13:02:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Michael Stummvoll <michael <at> stummi.org>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Mon, 13 Aug 2012 13:02:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi gnu folks,
as already known, tr cannot handle multibyte-encodings like utf-8:
> mst <at> eddie:~$ echo "foo" | tr o ö
> fÃÃ
i know, that multibyte encoding support is not needed for
posix-compilance, BUT:
the manpage of tr says the following:
> Translate, squeeze, and/or delete characters from standard input,
> writing to standard output.
and thats the inconsistence imho.
The typical interpretation of "character" in such a context means one
character on display. regardless which encoding is used or how many
bytes are used to display this. So, if tr realy translates "characters"
it should preserve the encoding. If it doesn't do, it does not
translate "characters" but "bytes". So there I see two ways:
- add multybyte-encoding support to tr
or
- change the manpage and helptext to not say "characters" but "bytes"
since it doesn't seem that somebody want to add the support to tr, an
update of the manpage would be the easier way to ensure the consistence.
Kind regards,
Michael
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#12192
; Package
coreutils
.
(Mon, 13 Aug 2012 14:03:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 12192 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 08/13/2012 06:52 AM, Michael Stummvoll wrote:
> Hi gnu folks,
>
> as already known, tr cannot handle multibyte-encodings like utf-8:
>
>> mst <at> eddie:~$ echo "foo" | tr o ö
>> fÃÃ
>
> i know, that multibyte encoding support is not needed for
> posix-compilance,
Actually, POSIX _does_ require multi-byte support; it's just that no one
has yet contributed code for this upstream that is easy enough to
maintain and without penalizing single-byte locales. Patches are welcome.
--
Eric Blake eblake <at> redhat.com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#12192
; Package
coreutils
.
(Tue, 14 Aug 2012 02:55:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 12192 <at> debbugs.gnu.org (full text, mbox):
On 08/13/2012 06:54 AM, Eric Blake wrote:
> POSIX _does_ require multi-byte support
The last time I checked, POSIX did not require
the implementation to provide any multibyte locales.
Has this changed?
But yes, the main thing is for someone to contribute
correct, easy-to-maintain, and efficient code.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#12192
; Package
coreutils
.
(Tue, 14 Aug 2012 05:44:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 12192 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 08/13/2012 08:45 PM, Paul Eggert wrote:
> On 08/13/2012 06:54 AM, Eric Blake wrote:
>> POSIX _does_ require multi-byte support
>
> The last time I checked, POSIX did not require
> the implementation to provide any multibyte locales.
> Has this changed?
Fair enough - POSIX does not require the existence of a multibyte
locale; an embedded system that provides only single-byte encodings can
still be POSIX-compliant. But POSIX _does_ require that tr be
locale-aware, and therefore if an implementation provides multibyte
locales (which most desktop glibc-based GNU/Linux systems do), then tr
should honor those locales, including multibyte character support.
>
> But yes, the main thing is for someone to contribute
> correct, easy-to-maintain, and efficient code.
We're in violent agreement on this point :)
--
Eric Blake eblake <at> redhat.com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#12192
; Package
coreutils
.
(Tue, 14 Aug 2012 07:54:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 12192 <at> debbugs.gnu.org (full text, mbox):
On 08/13/2012 10:34 PM, Eric Blake wrote:
> But POSIX _does_ require that tr be
> locale-aware, and therefore if an implementation provides multibyte
> locales (which most desktop glibc-based GNU/Linux systems do), then tr
> should honor those locales, including multibyte character support.
All this is absolutely correct; but still, if the issue is merely POSIX
conformance, these glibc-based GNU/Linux systems do conform to POSIX,
since the POSIX-conformance document for these systems can state that
the supported locales are merely the single-byte locales. Admittedly this
is legal hairsplitting, but if POSIX compliance is the issue
then one is in legal-hairsplitting mode already....
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#12192
; Package
coreutils
.
(Fri, 17 Aug 2012 12:13:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 12192 <at> debbugs.gnu.org (full text, mbox):
Hi there,
> But yes, the main thing is for someone to contribute
> correct, easy-to-maintain, and efficient code.
Just for the record, if any day somebody wants to attend this
I just noticed, that the "tr" from 9base can handle utf-8 correctly.
9base is a unix-port of the plan9 utils: http://tools.suckless.org/9base
i didn't took an closer look yet to the sources neither from gnu tr nor
from 9base tr. But may somebody other could benefit from there.
Kind Regards,
Michael
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#12192
; Package
coreutils
.
(Sat, 15 Sep 2012 10:30:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 12192 <at> debbugs.gnu.org (full text, mbox):
forcemerge 12192 9365
thanks
Michael Stummvoll wrote:
> Hi gnu folks,
>
> as already known, tr cannot handle multibyte-encodings like utf-8:
>
>> mst <at> eddie:~$ echo "foo" | tr o ö
>> fÃÃ
>
> i know, that multibyte encoding support is not needed for
> posix-compilance, BUT:
>
> the manpage of tr says the following:
>
>> Translate, squeeze, and/or delete characters from standard input,
>> writing to standard output.
>
> and thats the inconsistence imho.
>
> The typical interpretation of "character" in such a context means one
> character on display. regardless which encoding is used or how many
> bytes are used to display this. So, if tr realy translates "characters"
> it should preserve the encoding. If it doesn't do, it does not
> translate "characters" but "bytes". So there I see two ways:
>
> - add multybyte-encoding support to tr
> or
> - change the manpage and helptext to not say "characters" but "bytes"
>
> since it doesn't seem that somebody want to add the support to tr, an
> update of the manpage would be the easier way to ensure the consistence.
Thanks for the report.
I'm merging this issue with the others that relate to tr
and multi-byte support.
Severity set to 'wishlist' from 'normal'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Mon, 15 Oct 2018 14:07:02 GMT)
Full text and
rfc822 format available.
Changed bug title to 'multibyte: tr: TR operates on bytes, not characters' from 'tr - bytes vs characters'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Mon, 15 Oct 2018 14:07:02 GMT)
Full text and
rfc822 format available.
This bug report was last modified 6 years and 249 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.