GNU bug report logs -
#47264
RFE: pcre2 support
Previous Next
Full log
Message #27 received at 47264 <at> debbugs.gnu.org (full text, mbox):
On 11/7/21 11:26, Carlo Marcelo Arenas Belón wrote:
> Mostly a bug by bug translation of the original code to the PCRE2 API.
> but includes a couple of fixes as well that might be worth doing in
> independent patches, if a straight translation is preferable.
Yes, that might be preferable so that if there are problems we can
bisect better.
> The API changes the sign of some types and therefore some ugly casts
> were needed,
I see one ugly cast from char const * to PCRE2_SPTR8; it seems
unavoidable unless we go through void * (and I see little point to doing
that). Were there some other ugly casts that I missed?
> some of the changes are just to make sure all variables
> fit into the newer types better.
Yes, we do want to avoid overflow bugs.
> Includes backward compatibility and could be made to build all the way
> to 10.00, but assumes a recent version ~10.30; the configure rule sets
> a strict minimum of 10.34 as that is required to pass all tests, even
> if the issues are minimal and likely to be real bugs that the old PCRE
> just hide, there is likely more work pending in this area.
pcre2 10.34 is two years old. Seems old enough to me (tho perhaps others
can chime in).
> - int search_offset, int options, int *sub)
> + int search_offset, uint32_t options)
Let's stick with 'int' unless we absolutely need uint32_t. uint32_t is
ugly and is not required by the C standard and although PCRE2 evidently
requires it, I'd rather stick to signed types for the usual reasons.
> + if (lim > INT32_MAX)
> + return e;
> + lim *= 2;
This should use UINT32_MAX / 2 instead of INT32_MAX (the two expressions
have different values on some oddball platforms). Better yet, replace
the whole thing with "if (INT_MULTIPLY_WRAPV (lim, 2, &lim)) return e;".
> + unsigned char *re = xnmalloc (4, size + (fix_len_max + 4 - 1) / 4);
We don't need to multiply by 4 any more, right? Because we no longer
need to do the trick of replacing NUL with \000 in the pattern.
> + flags |= PCRE2_UTF;
> +#ifdef PCRE2_NEVER_BACKSLASH_C
> + /* do not match individual code units but only UTF-8 */
> + flags |= PCRE2_NEVER_BACKSLASH_C;
> +#endif
> +#ifdef PCRE2_MATCH_INVALID_UTF
> + /* consider invalid UTF-8 as a barrier, instead of error */
> + flags |= PCRE2_MATCH_INVALID_UTF;
> +#endif
Which versions of PCRE2 lack these flags? Should we bother to support
these old versions?
If memory serves grep currently takes care to not pass invalid UTF-8 in
the buffer or pattern. Does PCRE2_MATCH_INVALID_UTF make this work obsolete?
> + PCRE2_UCHAR8 ep[128];
How do we know 128 is long enough? Deserves a comment.
> Performance seems equivalent, and it also seems functionally complete.
Very good news. Thanks for working on this.
This bug report was last modified 3 years and 184 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.