GNU bug report logs -
#36094
Possible sed bug
Previous Next
Full log
View this message in rfc822 format
tag 36094 notabug
close 36094
stop
Hello,
On Wed, Jun 05, 2019 at 10:38:53AM +1000, Roel Van de Paar wrote:
> $ cat test
> a�-�-
>
> $ sed -i "s|.*|allgone|gi" test && cat test
> allgone�allgone�allgone
>
> Expected output in both cases would seem to be "allgone" on the line and
> nothing else?
This is not a bug, but a side-effect of having invalid UTF8 characters in
the input file, while working with a UTF8 locale.
POSIX requires that '.*' regular expression does not match invalid
characters.
The 'test' input file contains two bytes of 255 (\xFF) - these are
invalid (under UTF8 locale), and the regex matching stops at these bytes.
The other characters in the file are matched as three separate patterns
(due to "g" flag).
The simplest solution when working with such files is to force C locale,
where all bytes are considered valid (but then you loose UTF8
capabilities). Compare:
$ LC_ALL=en_CA.utf8 sed "s|.*|allgone|g" test | od -An -c
a l l g o n e 255 a l l g o n e 255
a l l g o n e \n
$ LC_ALL=C sed "s|.*|allgone|g" test | od -An -c
a l l g o n e \n
But then multi-byte UTF8 characters are processed as individual bytes:
$ printf "\U1011\n" | LC_ALL=en_CA.utf8 sed 's/./A/g'
A
$ printf "\U1011\n" | LC_ALL=C sed 's/./A/g'
AAA
As a side-note,
This is the reason GNU sed has the non-standad 'z' command
to clear the pattern space - a more intuitive 's/.*//' command will fail
to clear a pattern containing invalid characters.
$ printf "FOO \xFF\n" | LC_ALL=en_CA.utf8 sed 's/.*//g' | od -An -tx1
ff 0a
$ printf "FOO \xFF\n" | LC_ALL=en_CA.utf8 sed 'z' | od -An -tx1
0a
I'm closing this as "not a bug", but discussion can continue by replying
to this thread.
regards,
- assaf
This bug report was last modified 6 years and 46 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.