#47264 - RFE: pcre2 support - GNU bug report logs

GNU bug report logs - #47264
RFE: pcre2 support

Package: grep;

Reported by: Jaroslav Skarvada <jskarvad <at> redhat.com>

Date: Fri, 19 Mar 2021 15:23:01 UTC

Severity: wishlist

Merged with 22345, 40395

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu> To: Carlo Arenas <carenas <at> gmail.com> Cc: Jaroslav Skarvada <jskarvad <at> redhat.com>, 47264 <at> debbugs.gnu.org Subject: bug#47264: [PATCH v2] pcre: migrate to pcre2 Date: Mon, 15 Nov 2021 08:17:02 -0800

On 11/14/21 20:44, Carlo Arenas wrote: >> This shouldn't be a problem in practice. Surely PCRE2_SIZE_MAX is for >> forward compatibility to a potential future version of PCRE2 that may >> define PCRE2_SIZE to be some other type. For PCRE2 10.20 and earlier >> PCRE2_SIZE is hardwired to size_t, so there is only one plausible >> default for PCRE2_SIZE_MAX, namely SIZE_MAX. > > which is why I mention that it will be better to at least document > that in a comment, as it was done everywhere else where assumptions > made in the pcre library were used. What sort of documentation did you have in mind, exactly? > Interestingly enough this discussion gave me an idea for a feature in > PCRE where that value will be set to something else than SIZE_MAX and > that might break grep in a future release if it lands. How would it break grep? I'm not following. If a future version of PCRE defines PCRE_SIZE_MAX to something other than SIZE_MAX, grep should work just fine because it will use what PCRE defines. >>> As I mentioned before, PCRE matches the Perl definition as mentioned >>> before in an early draft that also had this change reversed. >> >> I see that PCRE2 documents that PCRE2_EXTRA_MATCH_WORD surrounds the >> pattern with "\b(?:" and ")\b". However, this is bogus: it doesn't >> correspond to the intuitive meaning of "match words", and it doesn't >> correspond to how grep -w behaves for any grep that I know of. > > It all comes from what perl defines[1] as a word character (\w) No it doesn't. It comes merely from how PCRE2 documents and implements PCRE2_EXTRA_MATCH_WORD. Perl's definition of \w does not determine how PCRE2_EXTRA_MATCH_WORD behaves; it determines only which characters are word characters and which are not. As things stand, PCRE2_EXTRA_MATCH_WORD is bizarre because it causes 'pcre2grep -w' to match strings consisting entirely of non-word (i.e., non-\w) characters. This cannot be right. > that is indeed likely a "bug", but is one that PCRE shares with perl > (and at least JavaScript, Java, Net, Python and Ruby) : > > $ echo 'a,a' | perl -nle '/\b(,)\b/ and print "$1"' That is a different issue. \b matches word *boundaries*; it's different from -w which is supposed to match *words*. There is indeed a word boundary between "a" (a \w character) and "," (a non-\w character), and another word boundary between "," and the following "a", but this doesn't mean "," is a word. Attempting to implement -w with \b is a mistake. That mistake is made in PCRE2 and the mistake should be corrected. PCRE2 should implement PCRE2_EXTRA_MATCH_WORD the same way that grep -P implements -w.

This bug report was last modified 3 years and 244 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #47264 RFE: pcre2 support

GNU bug report logs - #47264
RFE: pcre2 support