GNU bug report logs - #22237
sed no longer removes high-ascii characters as it did formerly.

Previous Next

Package: sed;

Reported by: Brian Tew <montanalag <at> gmail.com>

Date: Fri, 25 Dec 2015 18:52:02 UTC

Severity: normal

Tags: notabug

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

Full log


Message #10 received at 22237-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Brian Tew <montanalag <at> gmail.com>
Cc: 22237-done <at> debbugs.gnu.org
Subject: Re: bug#22237: sed no longer removes high-ascii characters as it did
 formerly.
Date: Sat, 26 Dec 2015 13:19:07 -0800
On Fri, Dec 25, 2015 at 4:21 AM, Brian Tew <montanalag <at> gmail.com> wrote:
> Well, sometimes it do and sometimes it don't.
>
> Script started on Fri 25 Dec 2015 05:53:04 AM CS
> ~$ed sample
> 50
> l
> subject now that thanksgiving has come and gone\342\246$
> q
> ~$
> ~$sed -i 's/[^a-z 0-9]//g' sample

To remove all but the matched bytes, you probably want something like
this instead:

  LC_ALL=C sed -i 's/[^[:alnum:] ]//'

Note I've done two things: used LC_ALL=C to override your default
locale (probably a UTF8 one), and to use [:alnum:] in place of that
nonportable a-z range and 0-9.

In general, with UTF8-based locales, a byte sequence like your
\342\246 will match no regular expression, since it is not a valid
UTF8 character.

What probably changed is that older versions of sed did not properly
handle multi-byte locales, or your other experience was using a
single-byte locale.

If you still think there is a problem with sed-4.22, please provide
more detail and I'll reopen this issue.




This bug report was last modified 9 years and 204 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.