GNU bug report logs - #60690
[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P

Previous Next

Package: grep;

Reported by: Ævar Arnfjörð Bjarmason <avarab <at> gmail.com>

Date: Mon, 9 Jan 2023 12:19:01 UTC

Severity: normal

Tags: patch

Merged with 62552, 62605

Full log

View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Ævar Arnfjörð Bjarmason <avarab <at> gmail.com>, Carlo Marcelo Arenas Belón <carenas <at> gmail.com>
Cc: demerphq <at> gmail.com, pcre-dev <at> exim.org, 60690 <at> debbugs.gnu.org, git <at> vger.kernel.org, gitster <at> pobox.com
Subject: bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P
Date: Mon, 9 Jan 2023 10:40:16 -0800

On 1/9/23 03:35, Ævar Arnfjörð Bjarmason wrote:

> You almost never want "everything Unicode considers a digit", and if you
> do using e.g. \p{Nd} instead of \d would be better in terms of
> expressing your intent.

For GNU grep, PCRE2_UCP is needed because of examples like what Gro-Tsen 
and Karl Petterssen supplied. If there's some diagreement about how \d 
should behave with UTF-8 data the GNU grep hackers should let the Perl 
community decide that; that is, GNU grep can simply follow PCRE2's lead. 
But GNU grep does need PCRE2_UCP for \b etc.

> 	$ diff <(git -P grep -P '\d+') <(git -P grep -P '(*UCP)\d')
> 	53360a53361,53362
> 	> git-gui/po/ja.po:"- 第１行: 何をしたか、を１行で要約。\n"
> 	> git-gui/po/ja.po:"- 第２行: 空白\n"

Although I don't speak Japanese I have dealt with quite a bit of 
Japanese text in a previous job, and personally I would prefer \d to 
match those two lines as they do contain digits. So to me this 
particular case is not a good argument that git grep should not match 
those lines.

Of course other people might prefer differently, and there are cases 
where I want to match only ASCII digits. I've learned in the past to use 
[0-9] for that. I hope PCRE2 never changes [0-9] to match anything but 
ASCII digits when searching UTF-8 text.

This bug report was last modified 2 years and 127 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #60690 [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P

GNU bug report logs - #60690
[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P