GNU bug report logs - #47264
RFE: pcre2 support

Previous Next

Package: grep;

Reported by: Jaroslav Skarvada <jskarvad <at> redhat.com>

Date: Fri, 19 Mar 2021 15:23:01 UTC

Severity: wishlist

Merged with 22345, 40395

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Carlo Arenas <carenas <at> gmail.com>
Cc: Jaroslav Skarvada <jskarvad <at> redhat.com>, 47264 <at> debbugs.gnu.org
Subject: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Mon, 15 Nov 2021 08:17:02 -0800
On 11/14/21 20:44, Carlo Arenas wrote:

>> This shouldn't be a problem in practice. Surely PCRE2_SIZE_MAX is for
>> forward compatibility to a potential future version of PCRE2 that may
>> define PCRE2_SIZE to be some other type. For PCRE2 10.20 and earlier
>> PCRE2_SIZE is hardwired to size_t, so there is only one plausible
>> default for PCRE2_SIZE_MAX, namely SIZE_MAX.
> 
> which is why I mention that it will be better to at least document
> that in a comment, as it was done everywhere else where assumptions
> made in the pcre library were used.

What sort of documentation did you have in mind, exactly?

> Interestingly enough this discussion gave me an idea for a feature in
> PCRE where that value will be set to something else than SIZE_MAX and
> that might break grep in a future release if it lands.

How would it break grep? I'm not following. If a future version of PCRE 
defines PCRE_SIZE_MAX to something other than SIZE_MAX, grep should work 
just fine because it will use what PCRE defines.

>>> As I mentioned before, PCRE matches the Perl definition as mentioned
>>> before in an early draft that also had this change reversed.
>>
>> I see that PCRE2 documents that PCRE2_EXTRA_MATCH_WORD surrounds the
>> pattern with "\b(?:" and ")\b". However, this is bogus: it doesn't
>> correspond to the intuitive meaning of "match words", and it doesn't
>> correspond to how grep -w behaves for any grep that I know of.
> 
> It all comes from what perl defines[1] as a word character (\w)

No it doesn't. It comes merely from how PCRE2 documents and implements 
PCRE2_EXTRA_MATCH_WORD.

Perl's definition of \w does not determine how PCRE2_EXTRA_MATCH_WORD 
behaves; it determines only which characters are word characters and 
which are not. As things stand, PCRE2_EXTRA_MATCH_WORD is bizarre 
because it causes 'pcre2grep -w' to match strings consisting entirely of 
non-word (i.e., non-\w) characters. This cannot be right.


> that is indeed likely a "bug", but is one that PCRE shares with perl
> (and at least JavaScript, Java, Net, Python and Ruby) :
> 
>    $ echo 'a,a' | perl -nle '/\b(,)\b/ and print "$1"'

That is a different issue. \b matches word *boundaries*; it's different 
from -w which is supposed to match *words*. There is indeed a word 
boundary between "a" (a \w character) and "," (a non-\w character), and 
another word boundary between "," and the following "a", but this 
doesn't mean "," is a word.

Attempting to implement -w with \b is a mistake. That mistake is made in 
PCRE2 and the mistake should be corrected. PCRE2 should implement 
PCRE2_EXTRA_MATCH_WORD the same way that grep -P implements -w.




This bug report was last modified 3 years and 184 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.