GNU bug report logs -
#10880
multibyte: tr: TR operates on bytes, not characters
Previous Next
To reply to this bug, email your comments to 10880 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#10880
; Package
coreutils
.
(Fri, 24 Feb 2012 17:31:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
"Marton Kadar" <marton.kadar <at> mail.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Fri, 24 Feb 2012 17:31:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Don't know which is the official way to report a bug in 'tr'
so I will copy to this list too. CC me on replies as I am not
subscribing.
> ----- Original Message -----
> From: Marton Kadar
> Sent: 02/24/12 03:18 PM
> To: 9365 <at> debbugs.gnu.org
> Subject: Example
>
> Environment for Hungary where á and í are proper lowercase letters
> but for example Spanish has these letters too:
>
> $ set | grep ^L
> LANG=hu_HU.UTF-8
> LC_ALL=hu_HU.UTF-8
> LINES=73
> LOGNAME=kadar1marto518
>
> Now let's see the bytestream for the following string
> (which means flood in Hungarian):
>
> $ echo árvíz | od -c
> 0000000 303 241 r v 303 255 z \n
> 0000010
>
> Let us try to delete a character and see if it worked:
>
> $ echo árvíz | tr -d á | od -c
> 0000000 r v 255 z \n
> 0000005
>
> Correct expected behavior would rather be:
>
> $ echo árvíz | tr -d á | od -c
> 0000000 r v 303 255 z \n
> 0000006
>
> I'll check the source for tr myself although never coded in C.
> This should be a trivial fix. The problem is especially annoying
> as we currently have no real simple and good general purpose case
> conversion tool. (correct me if I'm wrong, but tr should be this
> tool).
>
> Marton Kadar
Forcibly Merged 9365 9569 10880.
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Fri, 24 Feb 2012 18:33:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#10880
; Package
coreutils
.
(Sat, 25 Feb 2012 03:32:01 GMT)
Full text and
rfc822 format available.
Message #10 received at 10880 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 02/24/2012 07:29 AM, Marton Kadar wrote:
> Don't know which is the official way to report a bug in 'tr'
> so I will copy to this list too. CC me on replies as I am not
> subscribing.
Sending mail to coreutils <at> gnu.org _is_ what creates a bug on
debbugs.gnu.org, so you have managed to create a duplicate. Paul Eggert
has already merged 9365, 10880, and 9569, so now, replying to any one of
those three is merely adding information to the same report.
>>
>> Let us try to delete a character and see if it worked:
>>
>> $ echo árvíz | tr -d á | od -c
>> 0000000 r v 255 z \n
>> 0000005
Please keep in mind that upstream coreutils is not yet converted over to
multibyte support. This is evidence of one of the places that multibyte
support is required, and therefore, where you cannot expect things to
work yet. No one has yet contributed a maintainable patch that does not
penalize single-byte locales, at least not upstream. Several distros
have their own UTF-8 patches that they apply, but then, this would be a
bug you report to your distro and not upstream.
>> I'll check the source for tr myself although never coded in C.
>> This should be a trivial fix.
Alas, dealing with multibyte characters without penalizing single-byte
locales is NOT trivial, or it would have been done long ago.
--
Eric Blake eblake <at> redhat.com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#10880
; Package
coreutils
.
(Sat, 25 Feb 2012 22:11:01 GMT)
Full text and
rfc822 format available.
Message #13 received at 10880 <at> debbugs.gnu.org (full text, mbox):
> ----- Original Message -----
> From: Eric Blake
> Sent: 02/25/12 04:28 AM
> To: Marton Kadar
> Subject: Re: bug#10880: instead of characters, tr works on bytes
>
> On 02/24/2012 07:29 AM, Marton Kadar wrote:
> > Don't know which is the official way to report a bug in 'tr'
> > so I will copy to this list too. CC me on replies as I am not
> > subscribing.
>
> Sending mail to coreutils <at> gnu.org _is_ what creates a bug on
> debbugs.gnu.org, so you have managed to create a duplicate. Paul Eggert
> has already merged 9365, 10880, and 9569, so now, replying to any one of
> those three is merely adding information to the same report.
>
> >>
> >> Let us try to delete a character and see if it worked:
> >>
> >> $ echo árvíz | tr -d á | od -c
> >> 0000000 r v 255 z \n
> >> 0000005
>
> Please keep in mind that upstream coreutils is not yet converted over to
> multibyte support. This is evidence of one of the places that multibyte
> support is required, and therefore, where you cannot expect things to
> work yet. No one has yet contributed a maintainable patch that does not
> penalize single-byte locales, at least not upstream. Several distros
> have their own UTF-8 patches that they apply, but then, this would be a
> bug you report to your distro and not upstream.
>
> >> I'll check the source for tr myself although never coded in C.
> >> This should be a trivial fix.
>
> Alas, dealing with multibyte characters without penalizing single-byte
> locales is NOT trivial, or it would have been done long ago.
"Penalizing" single-byte locales - did you mean in performance or in functionality?
I understand that a generalized algorithm would probably be slower than one tuned for the single byte case.
But I suspect that you are also referring to some functional implication, as avoiding a solely performance related penalty in text handling command line utilities can never be a justifiable reason for incorrect functionality.
Besides, the execution path (sigle byte specific or generalized multibyte capable) can be determined at program startup, so in the worst case there can be a tr and a tr-slow-but-multibyte version, former calling the latter when so directed by the locale settings.
A minimal "solution" could also be to put a warning on each affected program's man page: "Multibyte locales currently unsupported!". It is not always immediately apparent, what the problem is, as in many special cases it happens to work as expected, then in very similar other cases it doesn't.
>
> --
> Eric Blake eblake <at> redhat.com +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#10880
; Package
coreutils
.
(Sat, 25 Feb 2012 23:24:02 GMT)
Full text and
rfc822 format available.
Message #16 received at 10880 <at> debbugs.gnu.org (full text, mbox):
On 02/25/2012 02:07 PM, Marton Kadar wrote:
> the execution path (sigle byte specific or generalized
> multibyte capable) can be determined at program startup, so in the
> worst case there can be a tr and a tr-slow-but-multibyte version,
> former calling the latter when so directed by the locale settings.
Something like that should work, yes. Unfortunately so far nobody has
volunteered to do it. The task would not be trivial. We don't want
to maintain two copies of the code, one for single-byte and one for
multibyte, as that'd be a maintenance problem. Instead, we'd like to
have just one copy of the code, which is easy to read and which
compiles into either unibyte or multibyte versions.
> avoiding a solely performance related penalty in text handling
> command line utilities can never be a justifiable reason for
> incorrect functionality.
As far as I know there is no requirement in POSIX that applications
must support multibyte locales, and there's no documentation claiming
that the utilities in question support multibyte location, so this is
not a bug; it's a feature request.
My opinion about this may be colored by an experience I had yesterday
with the latest version of GNU sed. Single-byte it worked fine;
multibyte it was so slow that I gave up. We don't want this to
happen with the core utilities.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#10880
; Package
coreutils
.
(Mon, 27 Feb 2012 06:16:01 GMT)
Full text and
rfc822 format available.
Message #19 received at submit <at> debbugs.gnu.org (full text, mbox):
On Fri, Feb 24, 2012 at 09:29:12AM EST, Marton Kadar wrote:
[..]
> > $ set | grep ^L
> > LANG=hu_HU.UTF-8
> > LC_ALL=hu_HU.UTF-8
> > LINES=73
> > LOGNAME=kadar1marto518
> >
> > Now let's see the bytestream for the following string
> > (which means flood in Hungarian):
> >
> > $ echo árvíz | od -c
> > 0000000 303 241 r v 303 255 z \n
> > 0000010
> >
> > Let us try to delete a character and see if it worked:
> >
> > $ echo árvíz | tr -d á | od -c
> > 0000000 r v 255 z \n
> > 0000005
[..]
Try this for size...
$ echo árvíz | od -t x1z -w16
$ echo árvíz | tr -d é | od -t x1z -w16
$ echo árvíz | tr -d é > /tmp/u.txt
$ isutf8 /tmp/u.txt
And there is not even an ‘é’ in ‘árvíz’..
CJ
P.S. Though you do have to look for it a bit, the coreutils manual
clearly states that only single-byte encodings are supported:
http://www.gnu.org/software/coreutils/manual/html_node/tr-invocation.html
--
Mooo Canada!!!!
Severity set to 'wishlist' from 'normal'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Mon, 15 Oct 2018 14:07:02 GMT)
Full text and
rfc822 format available.
Changed bug title to 'multibyte: tr: TR operates on bytes, not characters' from 'instead of characters, tr works on bytes'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Mon, 15 Oct 2018 14:07:02 GMT)
Full text and
rfc822 format available.
This bug report was last modified 6 years and 301 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.