GNU bug report logs -
#60690
[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P
Previous Next
Full log
View this message in rfc822 format
On 1/9/23 03:35, Ævar Arnfjörð Bjarmason wrote:
> You almost never want "everything Unicode considers a digit", and if you
> do using e.g. \p{Nd} instead of \d would be better in terms of
> expressing your intent.
For GNU grep, PCRE2_UCP is needed because of examples like what Gro-Tsen
and Karl Petterssen supplied. If there's some diagreement about how \d
should behave with UTF-8 data the GNU grep hackers should let the Perl
community decide that; that is, GNU grep can simply follow PCRE2's lead.
But GNU grep does need PCRE2_UCP for \b etc.
> $ diff <(git -P grep -P '\d+') <(git -P grep -P '(*UCP)\d')
> 53360a53361,53362
> > git-gui/po/ja.po:"- 第1行: 何をしたか、を1行で要約。\n"
> > git-gui/po/ja.po:"- 第2行: 空白\n"
Although I don't speak Japanese I have dealt with quite a bit of
Japanese text in a previous job, and personally I would prefer \d to
match those two lines as they do contain digits. So to me this
particular case is not a good argument that git grep should not match
those lines.
Of course other people might prefer differently, and there are cases
where I want to match only ASCII digits. I've learned in the past to use
[0-9] for that. I hope PCRE2 never changes [0-9] to match anything but
ASCII digits when searching UTF-8 text.
This bug report was last modified 2 years and 70 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.