#18266 - grep -P and invalid exits with error

GNU bug report logs - #18266
grep -P and invalid exits with error

Package: grep;

Reported by: Santiago <santiago <at> debian.org>

Date: Thu, 14 Aug 2014 15:43:02 UTC

Severity: wishlist

Merged with 18455

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Message #134 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu> To: Vincent Lefevre <vincent <at> vinc17.net> Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org Subject: Re: bug#18266: handling bytes not part of the charset, and other garbage Date: Fri, 12 Sep 2014 09:16:45 -0700

Vincent Lefevre wrote: > Glibc regards it as ASCII: You're right. Sorry, I was confused. FreeBSD, Solaris, and AIX work the way that I thought, though. Plus, in GNU regular expressions the pattern "." works the way that I thought with LC_ALL=C; my guess (without investigating this) is that this is because whoever wrote the regex code assumed the BSDish behavior. Arguably this is a glitch in the GNU regex code, in that for consistency "." should not match encoding errors in unibyte locales. Here's a pair of test cases to illustrate the glitch: $ printf '\200\n' | LC_ALL=en_US.utf8 grep '.' | wc 0 0 0 $ printf '\200\n' | LC_ALL=C grep '.' | wc 1 0 2 > I just mean that "grep ." is a method given by some people, that > was working before UTF-8. And it still works, if by "." one means "match one character". Unfortunately there is no POSIX regular expression that does what you're looking for (match either one character, or a single byte that is an encoding error). This is because POSIX says the behavior is undefined on encoding errors. The GNU syntax for regular expressions extends POSIX and does not dump core, but it still provides no way to write the pattern you're asking for, and the behavior is unspecified on encoding errors. Perhaps this should be improved by fixing the abovementioned glitch and by providing a syntax extension for matching encoding errors, though we'd need a volunteer to do that. The situation with libpcre is weirder: there's a pattern '\C' for matching a single byte even if it's an encoding error, but as far as I can tell there's no way to use regular expressions safely on arbitrary data containing encoding errors unless you're in unibyte mode (in which case '\C' provides no extra power). I.e., \C appears to be useless in any program for which undefined behavior is unacceptable.

This bug report was last modified 10 years and 347 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18266 grep -P and invalid exits with error

GNU bug report logs - #18266
grep -P and invalid exits with error