GNU bug report logs - #36094
Possible sed bug

Previous Next

Package: sed;

Reported by: Roel Van de Paar <roel.vandepaar <at> gmail.com>

Date: Wed, 5 Jun 2019 02:16:02 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 36094 in the body.
You can then email your comments to 36094 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-sed <at> gnu.org:
bug#36094; Package sed. (Wed, 05 Jun 2019 02:16:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Roel Van de Paar <roel.vandepaar <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-sed <at> gnu.org. (Wed, 05 Jun 2019 02:16:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Roel Van de Paar <roel.vandepaar <at> gmail.com>
To: bug-sed <at> gnu.org
Subject: Possible sed bug
Date: Wed, 5 Jun 2019 10:38:53 +1000
[Message part 1 (text/plain, inline)]
See attached 'test' file with some non-ASCII chars.

$ cat test
a�-�-

Then;

$ sed -i "s|.*|allgone|" test && cat test
allgone�-�-

Or (using fresh copy of 'test');

$ sed -i "s|.*|allgone|gi" test && cat test
allgone�allgone�allgone

Expected output in both cases would seem to be "allgone" on the line and
nothing else?

God Bless,
Roel
[Message part 2 (text/html, inline)]
[test (application/octet-stream, attachment)]

Information forwarded to bug-sed <at> gnu.org:
bug#36094; Package sed. (Wed, 05 Jun 2019 14:18:02 GMT) Full text and rfc822 format available.

Message #8 received at 36094 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Roel Van de Paar <roel.vandepaar <at> gmail.com>
Cc: 36094 <at> debbugs.gnu.org
Subject: Re: bug#36094: Possible sed bug
Date: Wed, 5 Jun 2019 08:17:32 -0600
tag 36094 notabug
close 36094
stop

Hello,

On Wed, Jun 05, 2019 at 10:38:53AM +1000, Roel Van de Paar wrote:
> $ cat test
> a�-�-
> 
> $ sed -i "s|.*|allgone|gi" test && cat test
> allgone�allgone�allgone
> 
> Expected output in both cases would seem to be "allgone" on the line and
> nothing else?

This is not a bug, but a side-effect of having invalid UTF8 characters in
the input file, while working with a UTF8 locale.

POSIX requires that '.*' regular expression does not match invalid
characters.
The 'test' input file contains two bytes of 255 (\xFF) - these are
invalid (under UTF8 locale), and the regex matching stops at these bytes.
The other characters in the file are matched as three separate patterns
(due to "g" flag).

The simplest solution when working with such files is to force C locale,
where all bytes are considered valid (but then you loose UTF8
capabilities). Compare:

    $ LC_ALL=en_CA.utf8 sed "s|.*|allgone|g" test | od -An -c
       a   l   l   g   o   n   e 255   a   l   l   g   o   n   e 255
       a   l   l   g   o   n   e  \n

    $ LC_ALL=C sed "s|.*|allgone|g" test | od -An -c
       a   l   l   g   o   n   e  \n

But then multi-byte UTF8 characters are processed as individual bytes:

    $ printf "\U1011\n" | LC_ALL=en_CA.utf8 sed 's/./A/g'
    A

    $ printf "\U1011\n" | LC_ALL=C sed 's/./A/g'
    AAA



As a side-note,
This is the reason GNU sed has the non-standad 'z' command
to clear the pattern space - a more intuitive 's/.*//' command will fail
to clear a pattern containing invalid characters.

    $ printf "FOO \xFF\n" | LC_ALL=en_CA.utf8 sed 's/.*//g' | od -An -tx1
     ff 0a
    $ printf "FOO \xFF\n" | LC_ALL=en_CA.utf8 sed 'z' | od -An -tx1
     0a


I'm closing this as "not a bug", but discussion can continue by replying
to this thread.

regards,
 - assaf




Added tag(s) notabug. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Wed, 05 Jun 2019 14:18:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 36094 <at> debbugs.gnu.org and Roel Van de Paar <roel.vandepaar <at> gmail.com> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Wed, 05 Jun 2019 14:18:03 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 04 Jul 2019 11:24:08 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 46 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.