GNU bug report logs -
#36094
Possible sed bug
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 36094 in the body.
You can then email your comments to 36094 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-sed <at> gnu.org
:
bug#36094
; Package
sed
.
(Wed, 05 Jun 2019 02:16:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Roel Van de Paar <roel.vandepaar <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-sed <at> gnu.org
.
(Wed, 05 Jun 2019 02:16:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
See attached 'test' file with some non-ASCII chars.
$ cat test
a�-�-
Then;
$ sed -i "s|.*|allgone|" test && cat test
allgone�-�-
Or (using fresh copy of 'test');
$ sed -i "s|.*|allgone|gi" test && cat test
allgone�allgone�allgone
Expected output in both cases would seem to be "allgone" on the line and
nothing else?
God Bless,
Roel
[Message part 2 (text/html, inline)]
[test (application/octet-stream, attachment)]
Information forwarded
to
bug-sed <at> gnu.org
:
bug#36094
; Package
sed
.
(Wed, 05 Jun 2019 14:18:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 36094 <at> debbugs.gnu.org (full text, mbox):
tag 36094 notabug
close 36094
stop
Hello,
On Wed, Jun 05, 2019 at 10:38:53AM +1000, Roel Van de Paar wrote:
> $ cat test
> a�-�-
>
> $ sed -i "s|.*|allgone|gi" test && cat test
> allgone�allgone�allgone
>
> Expected output in both cases would seem to be "allgone" on the line and
> nothing else?
This is not a bug, but a side-effect of having invalid UTF8 characters in
the input file, while working with a UTF8 locale.
POSIX requires that '.*' regular expression does not match invalid
characters.
The 'test' input file contains two bytes of 255 (\xFF) - these are
invalid (under UTF8 locale), and the regex matching stops at these bytes.
The other characters in the file are matched as three separate patterns
(due to "g" flag).
The simplest solution when working with such files is to force C locale,
where all bytes are considered valid (but then you loose UTF8
capabilities). Compare:
$ LC_ALL=en_CA.utf8 sed "s|.*|allgone|g" test | od -An -c
a l l g o n e 255 a l l g o n e 255
a l l g o n e \n
$ LC_ALL=C sed "s|.*|allgone|g" test | od -An -c
a l l g o n e \n
But then multi-byte UTF8 characters are processed as individual bytes:
$ printf "\U1011\n" | LC_ALL=en_CA.utf8 sed 's/./A/g'
A
$ printf "\U1011\n" | LC_ALL=C sed 's/./A/g'
AAA
As a side-note,
This is the reason GNU sed has the non-standad 'z' command
to clear the pattern space - a more intuitive 's/.*//' command will fail
to clear a pattern containing invalid characters.
$ printf "FOO \xFF\n" | LC_ALL=en_CA.utf8 sed 's/.*//g' | od -An -tx1
ff 0a
$ printf "FOO \xFF\n" | LC_ALL=en_CA.utf8 sed 'z' | od -An -tx1
0a
I'm closing this as "not a bug", but discussion can continue by replying
to this thread.
regards,
- assaf
Added tag(s) notabug.
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Wed, 05 Jun 2019 14:18:02 GMT)
Full text and
rfc822 format available.
bug closed, send any further explanations to
36094 <at> debbugs.gnu.org and Roel Van de Paar <roel.vandepaar <at> gmail.com>
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Wed, 05 Jun 2019 14:18:03 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 04 Jul 2019 11:24:08 GMT)
Full text and
rfc822 format available.
This bug report was last modified 6 years and 46 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.