#60690 - [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P

GNU bug report logs - #60690
[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P

Package: grep;

Reported by: Ævar Arnfjörð Bjarmason <avarab <at> gmail.com>

Date: Mon, 9 Jan 2023 12:19:01 UTC

Severity: normal

Tags: patch

View this message in rfc822 format

From: Jim Meyering <jim <at> meyering.net> To: Paul Eggert <eggert <at> cs.ucla.edu> Cc: demerphq <at> gmail.com, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01 <at> gmail.com>, Carlo Marcelo Arenas Belón <carenas <at> gmail.com>, Ævar Arnfjörð Bjarmason <avarab <at> gmail.com>, gitster <at> pobox.com, pcre-dev <at> exim.org, Tukusej’s Sirs <tukusejssirs <at> protonmail.com>, git <at> vger.kernel.org Subject: bug#60690: -P '\d' in GNU and git grep Date: Mon, 3 Apr 2023 20:30:02 -0700

On Mon, Apr 3, 2023 at 2:39 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote: > I've recently done some bug-report maintenance about a set of GNU grep > bug reports related to whether whether "grep -P '\d'" should match > non-ASCII digits, and have some thoughts about coordinating GNU grep > with git grep in this department. > > GNU Bug#62605[1] "`[\d]` does not work with PCRE" has been fixed on > Savannah's copy of GNU grep, and some sort of fix should appear in the > next grep release. However, I'm leaving the GNU grep bug report open for > now because it's related to Bug#60690[2] "[PATCH v2] grep: correctly > identify utf-8 characters with \{b,w} in -P" and to Bug#62552[3] "Bug > found in latest stable release v3.10 of grep". I merged these related > bug reports, and the oldest one, Bug#60690, is now the representative > displayed in the GNU grep bug list[4]. > > For this set of grep bug reports there's still a pending issue discussed > in my recent email[5], which proposes a patch so I've tagged Bug#60690 > with "patch". The proposal is that GNU grep -P '\d' should revert to the > grep 3.9 behavior, i.e., that in a UTF-8 locale, \d should also match > non-ASCII decimal digits. > > In researching this a bit further, I found that on March 23 Git disabled > the use of PCRE2_UCP in PCRE2 10.34 or earlier[6], due to a PCRE2 bug > that can cause a crash when PCRE2_UCP is used[7]. A bug fix[8] should > appear in the next PCRE2 release. > > When PCRE2 10.35 comes out, Thanks for finding that. It's clearly a good idea to disable PCRE2_UCP for those using those older, known-buggy versions of pcre2. The latest is 10.42, per https://github.com/PCRE2Project/pcre2/releases > it appears that 'git grep -P' will behave > like 'grep -P' only if GNU grep adopts something like the solution > proposed in [5]. > > [1]: https://bugs.gnu.org/62605 > [2]: https://bugs.gnu.org/60690 > [3]: https://bugs.gnu.org/62552 > [4]: https://debbugs.gnu.org/cgi/pkgreport.cgi?package=grep > [5]: https://lists.gnu.org/archive/html/grep-devel/2023-04/msg00004.html > [6]: > https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac8 > [7]: > https://lore.kernel.org/git/7E83DAA1-F9A9-4151-8D07-D80EA6D59EEA <at> clumio.com/ > [8]: > https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac8 Thanks for all of the links. However, have you seen justification (other than for compatibility with some other tool or language) for allowing \d to match non-ASCII by default, in spite of the risks? IMHO, we have an obligation to retain compatibility with how grep -P '\d' has worked since -P was added. I'd be happy to see an option to enable the match-multibyte-digits behavior, but making it the default seems too likely to introduce unwarranted risk.

This bug report was last modified 2 years and 127 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #60690 [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P

GNU bug report logs - #60690
[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P