GNU bug report logs -
#47264
RFE: pcre2 support
Previous Next
Full log
Message #84 received at 47264 <at> debbugs.gnu.org (full text, mbox):
On 11/14/21 20:44, Carlo Arenas wrote:
>> This shouldn't be a problem in practice. Surely PCRE2_SIZE_MAX is for
>> forward compatibility to a potential future version of PCRE2 that may
>> define PCRE2_SIZE to be some other type. For PCRE2 10.20 and earlier
>> PCRE2_SIZE is hardwired to size_t, so there is only one plausible
>> default for PCRE2_SIZE_MAX, namely SIZE_MAX.
>
> which is why I mention that it will be better to at least document
> that in a comment, as it was done everywhere else where assumptions
> made in the pcre library were used.
What sort of documentation did you have in mind, exactly?
> Interestingly enough this discussion gave me an idea for a feature in
> PCRE where that value will be set to something else than SIZE_MAX and
> that might break grep in a future release if it lands.
How would it break grep? I'm not following. If a future version of PCRE
defines PCRE_SIZE_MAX to something other than SIZE_MAX, grep should work
just fine because it will use what PCRE defines.
>>> As I mentioned before, PCRE matches the Perl definition as mentioned
>>> before in an early draft that also had this change reversed.
>>
>> I see that PCRE2 documents that PCRE2_EXTRA_MATCH_WORD surrounds the
>> pattern with "\b(?:" and ")\b". However, this is bogus: it doesn't
>> correspond to the intuitive meaning of "match words", and it doesn't
>> correspond to how grep -w behaves for any grep that I know of.
>
> It all comes from what perl defines[1] as a word character (\w)
No it doesn't. It comes merely from how PCRE2 documents and implements
PCRE2_EXTRA_MATCH_WORD.
Perl's definition of \w does not determine how PCRE2_EXTRA_MATCH_WORD
behaves; it determines only which characters are word characters and
which are not. As things stand, PCRE2_EXTRA_MATCH_WORD is bizarre
because it causes 'pcre2grep -w' to match strings consisting entirely of
non-word (i.e., non-\w) characters. This cannot be right.
> that is indeed likely a "bug", but is one that PCRE shares with perl
> (and at least JavaScript, Java, Net, Python and Ruby) :
>
> $ echo 'a,a' | perl -nle '/\b(,)\b/ and print "$1"'
That is a different issue. \b matches word *boundaries*; it's different
from -w which is supposed to match *words*. There is indeed a word
boundary between "a" (a \w character) and "," (a non-\w character), and
another word boundary between "," and the following "a", but this
doesn't mean "," is a word.
Attempting to implement -w with \b is a mistake. That mistake is made in
PCRE2 and the mistake should be corrected. PCRE2 should implement
PCRE2_EXTRA_MATCH_WORD the same way that grep -P implements -w.
This bug report was last modified 3 years and 184 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.