GNU bug report logs -
#22237
sed no longer removes high-ascii characters as it did formerly.
Previous Next
Reported by: Brian Tew <montanalag <at> gmail.com>
Date: Fri, 25 Dec 2015 18:52:02 UTC
Severity: normal
Tags: notabug
Done: Jim Meyering <jim <at> meyering.net>
Bug is archived. No further changes may be made.
Full log
Message #10 received at 22237-done <at> debbugs.gnu.org (full text, mbox):
On Fri, Dec 25, 2015 at 4:21 AM, Brian Tew <montanalag <at> gmail.com> wrote:
> Well, sometimes it do and sometimes it don't.
>
> Script started on Fri 25 Dec 2015 05:53:04 AM CS
> ~$ed sample
> 50
> l
> subject now that thanksgiving has come and gone\342\246$
> q
> ~$
> ~$sed -i 's/[^a-z 0-9]//g' sample
To remove all but the matched bytes, you probably want something like
this instead:
LC_ALL=C sed -i 's/[^[:alnum:] ]//'
Note I've done two things: used LC_ALL=C to override your default
locale (probably a UTF8 one), and to use [:alnum:] in place of that
nonportable a-z range and 0-9.
In general, with UTF8-based locales, a byte sequence like your
\342\246 will match no regular expression, since it is not a valid
UTF8 character.
What probably changed is that older versions of sed did not properly
handle multi-byte locales, or your other experience was using a
single-byte locale.
If you still think there is a problem with sed-4.22, please provide
more detail and I'll reopen this issue.
This bug report was last modified 9 years and 204 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.