GNU bug report logs - #36094
Possible sed bug

Previous Next

Package: sed;

Reported by: Roel Van de Paar <roel.vandepaar <at> gmail.com>

Date: Wed, 5 Jun 2019 02:16:02 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Full log

View this message in rfc822 format

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Roel Van de Paar <roel.vandepaar <at> gmail.com>
Cc: 36094 <at> debbugs.gnu.org
Subject: bug#36094: Possible sed bug
Date: Wed, 5 Jun 2019 08:17:32 -0600

tag 36094 notabug
close 36094
stop

Hello,

On Wed, Jun 05, 2019 at 10:38:53AM +1000, Roel Van de Paar wrote:
> $ cat test
> a�-�-
> 
> $ sed -i "s|.*|allgone|gi" test && cat test
> allgone�allgone�allgone
> 
> Expected output in both cases would seem to be "allgone" on the line and
> nothing else?

This is not a bug, but a side-effect of having invalid UTF8 characters in
the input file, while working with a UTF8 locale.

POSIX requires that '.*' regular expression does not match invalid
characters.
The 'test' input file contains two bytes of 255 (\xFF) - these are
invalid (under UTF8 locale), and the regex matching stops at these bytes.
The other characters in the file are matched as three separate patterns
(due to "g" flag).

The simplest solution when working with such files is to force C locale,
where all bytes are considered valid (but then you loose UTF8
capabilities). Compare:

    $ LC_ALL=en_CA.utf8 sed "s|.*|allgone|g" test | od -An -c
       a   l   l   g   o   n   e 255   a   l   l   g   o   n   e 255
       a   l   l   g   o   n   e  \n

    $ LC_ALL=C sed "s|.*|allgone|g" test | od -An -c
       a   l   l   g   o   n   e  \n

But then multi-byte UTF8 characters are processed as individual bytes:

    $ printf "\U1011\n" | LC_ALL=en_CA.utf8 sed 's/./A/g'
    A

    $ printf "\U1011\n" | LC_ALL=C sed 's/./A/g'
    AAA

As a side-note,
This is the reason GNU sed has the non-standad 'z' command
to clear the pattern space - a more intuitive 's/.*//' command will fail
to clear a pattern containing invalid characters.

    $ printf "FOO \xFF\n" | LC_ALL=en_CA.utf8 sed 's/.*//g' | od -An -tx1
     ff 0a
    $ printf "FOO \xFF\n" | LC_ALL=en_CA.utf8 sed 'z' | od -An -tx1
     0a

I'm closing this as "not a bug", but discussion can continue by replying
to this thread.

regards,
 - assaf

This bug report was last modified 6 years and 46 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #36094 Possible sed bug

GNU bug report logs - #36094
Possible sed bug