GNU bug report logs - #47264
RFE: pcre2 support

Previous Next

Package: grep;

Reported by: Jaroslav Skarvada <jskarvad <at> redhat.com>

Date: Fri, 19 Mar 2021 15:23:01 UTC

Severity: wishlist

Merged with 22345, 40395

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Carlo Marcelo Arenas Belón <carenas <at> gmail.com>, 47264 <at> debbugs.gnu.org
Subject: bug#47264: [PATCH] pcre: migrate to pcre2
Date: Sun, 7 Nov 2021 16:30:30 -0800
On 11/7/21 11:26, Carlo Marcelo Arenas Belón wrote:
> Mostly a bug by bug translation of the original code to the PCRE2 API.
> but includes a couple of fixes as well that might be worth doing in
> independent patches, if a straight translation is preferable.

Yes, that might be preferable so that if there are problems we can 
bisect better.


> The API changes the sign of some types and therefore some ugly casts
> were needed,

I see one ugly cast from char const * to PCRE2_SPTR8; it seems 
unavoidable unless we go through void * (and I see little point to doing 
that). Were there some other ugly casts that I missed?

> some of the changes are just to make sure all variables
> fit into the newer types better.

Yes, we do want to avoid overflow bugs.

> Includes backward compatibility and could be made to build all the way
> to 10.00, but assumes a recent version ~10.30; the configure rule sets
> a strict minimum of 10.34 as that is required to pass all tests, even
> if the issues are minimal and likely to be real bugs that the old PCRE
> just hide, there is likely more work pending in this area.

pcre2 10.34 is two years old. Seems old enough to me (tho perhaps others 
can chime in).

> -          int search_offset, int options, int *sub)
> +          int search_offset, uint32_t options)

Let's stick with 'int' unless we absolutely need uint32_t. uint32_t is 
ugly and is not required by the C standard and although PCRE2 evidently 
requires it, I'd rather stick to signed types for the usual reasons.

> +          if (lim > INT32_MAX)
> +            return e;
> +          lim *= 2;

This should use UINT32_MAX / 2 instead of INT32_MAX (the two expressions 
have different values on some oddball platforms). Better yet, replace 
the whole thing with "if (INT_MULTIPLY_WRAPV (lim, 2, &lim)) return e;".

> +  unsigned char *re = xnmalloc (4, size + (fix_len_max + 4 - 1) / 4);

We don't need to multiply by 4 any more, right? Because we no longer 
need to do the trick of replacing NUL with \000 in the pattern.

> +      flags |= PCRE2_UTF;
> +#ifdef PCRE2_NEVER_BACKSLASH_C
> +      /* do not match individual code units but only UTF-8  */
> +      flags |= PCRE2_NEVER_BACKSLASH_C;
> +#endif
> +#ifdef PCRE2_MATCH_INVALID_UTF
> +      /* consider invalid UTF-8 as a barrier, instead of error  */
> +      flags |= PCRE2_MATCH_INVALID_UTF;
> +#endif

Which versions of PCRE2 lack these flags? Should we bother to support 
these old versions?

If memory serves grep currently takes care to not pass invalid UTF-8 in 
the buffer or pattern. Does PCRE2_MATCH_INVALID_UTF make this work obsolete?

> +  PCRE2_UCHAR8 ep[128];

How do we know 128 is long enough? Deserves a comment.

> Performance seems equivalent, and it also seems functionally complete.

Very good news. Thanks for working on this.




This bug report was last modified 3 years and 184 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.