#47264 - RFE: pcre2 support - GNU bug report logs

GNU bug report logs - #47264
RFE: pcre2 support

Package: grep;

Reported by: Jaroslav Skarvada <jskarvad <at> redhat.com>

Date: Fri, 19 Mar 2021 15:23:01 UTC

Severity: wishlist

Merged with 22345, 40395

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu> To: Carlo Arenas <carenas <at> gmail.com> Cc: Jaroslav Skarvada <jskarvad <at> redhat.com>, 47264 <at> debbugs.gnu.org Subject: bug#47264: [PATCH v2] pcre: migrate to pcre2 Date: Sun, 14 Nov 2021 19:17:58 -0800

On 11/14/21 14:25, Carlo Arenas wrote: > the one in patch6 where a uint32_t option is doubled, triggers > warnings because of comparing an unsigned variable with 0 AFAIK, but > there are several of those in the upstream gnulib so presumably not a > concern? Yes, that's right. intprops.h can generate tons of bogus warnings with older or non-GCC compilers. We typically don't worry about those warnings. Recent GCC should be OK here. > using idx_t instead of size_t should be fine (if only halves the max > size of the objects managed), but I am concerned that assuming > PCRE2_SIZE_MAX is always equivalent to SIZE_MAX (as done in patch 4) > might be risky (at least without a comment), and considering that is > part of the API anyway might be better if kept as PCRE2_SIZE_MAX IMHO. This shouldn't be a problem in practice. Surely PCRE2_SIZE_MAX is for forward compatibility to a potential future version of PCRE2 that may define PCRE2_SIZE to be some other type. For PCRE2 10.20 and earlier PCRE2_SIZE is hardwired to size_t, so there is only one plausible default for PCRE2_SIZE_MAX, namely SIZE_MAX. > As I mentioned before, PCRE matches the Perl definition as mentioned > before in an early draft that also had this change reversed. I see that PCRE2 documents that PCRE2_EXTRA_MATCH_WORD surrounds the pattern with "\b(?:" and ")\b". However, this is bogus: it doesn't correspond to the intuitive meaning of "match words", and it doesn't correspond to how grep -w behaves for any grep that I know of. Which "early draft" are you talking about? This appears to be merely a bug in libpcre2's documentation and implementation. > I would suggest instead that -P should also follow perl convention > instead when used together with -w, but maybe that is something that a > -P feature flag could enable or disable as needed? I can't imagine anybody intuitively saying in an English locale that "%%" is a word in the string "aa%%aa". PCRE2 is broken, that's all. If a user really wants PCRE2's buggy interpretation, they can simply surround their regexp with "\b(?:" and ")\b" and not use -w; so there's no need to have a different flag for pcre2grep's bizarre interpretation of -w. Here's another reason why pcre2grep -w is obviously busted: $ pcre2grep -w ',' <<'EOF' > a,a > a, a > a, > EOF a,a Why is "," a word in the first input line, but not in the second or third? pcre2grep is simply wrong here. > Note that "word" definition also has a different meaning in a post > Unicode world Yes, but that's an independent issue.

This bug report was last modified 3 years and 244 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #47264 RFE: pcre2 support

GNU bug report logs - #47264
RFE: pcre2 support