GNU bug report logs -
#18266
grep -P and invalid exits with error
Previous Next
Reported by: Santiago <santiago <at> debian.org>
Date: Thu, 14 Aug 2014 15:43:02 UTC
Severity: wishlist
Merged with 18455
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
Message #134 received at 18266 <at> debbugs.gnu.org (full text, mbox):
Vincent Lefevre wrote:
> Glibc regards it as ASCII:
You're right. Sorry, I was confused. FreeBSD, Solaris, and AIX work
the way that I thought, though. Plus, in GNU regular expressions the
pattern "." works the way that I thought with LC_ALL=C; my guess
(without investigating this) is that this is because whoever wrote the
regex code assumed the BSDish behavior. Arguably this is a glitch in
the GNU regex code, in that for consistency "." should not match
encoding errors in unibyte locales.
Here's a pair of test cases to illustrate the glitch:
$ printf '\200\n' | LC_ALL=en_US.utf8 grep '.' | wc
0 0 0
$ printf '\200\n' | LC_ALL=C grep '.' | wc
1 0 2
> I just mean that "grep ." is a method given by some people, that
> was working before UTF-8.
And it still works, if by "." one means "match one character".
Unfortunately there is no POSIX regular expression that does what you're
looking for (match either one character, or a single byte that is an
encoding error). This is because POSIX says the behavior is undefined
on encoding errors. The GNU syntax for regular expressions extends
POSIX and does not dump core, but it still provides no way to write the
pattern you're asking for, and the behavior is unspecified on encoding
errors. Perhaps this should be improved by fixing the abovementioned
glitch and by providing a syntax extension for matching encoding errors,
though we'd need a volunteer to do that.
The situation with libpcre is weirder: there's a pattern '\C' for
matching a single byte even if it's an encoding error, but as far as I
can tell there's no way to use regular expressions safely on arbitrary
data containing encoding errors unless you're in unibyte mode (in which
case '\C' provides no extra power). I.e., \C appears to be useless in
any program for which undefined behavior is unacceptable.
This bug report was last modified 10 years and 248 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.