GNU bug report logs - #18266
grep -P and invalid exits with error

Previous Next

Package: grep;

Reported by: Santiago <santiago <at> debian.org>

Date: Thu, 14 Aug 2014 15:43:02 UTC

Severity: wishlist

Merged with 18455

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #134 received at 18266 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18266 <at> debbugs.gnu.org, 758105 <at> bugs.debian.org
Subject: Re: bug#18266: handling bytes not part of the charset, and other
 garbage
Date: Fri, 12 Sep 2014 09:16:45 -0700
Vincent Lefevre wrote:
> Glibc regards it as ASCII:

You're right.  Sorry, I was confused.  FreeBSD, Solaris, and AIX work 
the way that I thought, though.  Plus, in GNU regular expressions the 
pattern "." works the way that I thought with LC_ALL=C; my guess 
(without investigating this) is that this is because whoever wrote the 
regex code assumed the BSDish behavior.  Arguably this is a glitch in 
the GNU regex code, in that for consistency "." should not match 
encoding errors in unibyte locales.

Here's a pair of test cases to illustrate the glitch:

$ printf '\200\n' | LC_ALL=en_US.utf8 grep '.' | wc
      0       0       0
$ printf '\200\n' | LC_ALL=C grep '.' | wc
      1       0       2

> I just mean that "grep ." is a method given by some people, that
> was working before UTF-8.

And it still works, if by "." one means "match one character".

Unfortunately there is no POSIX regular expression that does what you're 
looking for (match either one character, or a single byte that is an 
encoding error).  This is because POSIX says the behavior is undefined 
on encoding errors.  The GNU syntax for regular expressions extends 
POSIX and does not dump core, but it still provides no way to write the 
pattern you're asking for, and the behavior is unspecified on encoding 
errors.  Perhaps this should be improved by fixing the abovementioned 
glitch and by providing a syntax extension for matching encoding errors, 
though we'd need a volunteer to do that.

The situation with libpcre is weirder: there's a pattern '\C' for 
matching a single byte even if it's an encoding error, but as far as I 
can tell there's no way to use regular expressions safely on arbitrary 
data containing encoding errors unless you're in unibyte mode (in which 
case '\C' provides no extra power).  I.e., \C appears to be useless in 
any program for which undefined behavior is unacceptable.




This bug report was last modified 10 years and 248 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.