GNU bug report logs -
#60690
[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P
Previous Next
Full log
Message #14 received at 60690 <at> debbugs.gnu.org (full text, mbox):
On 1/9/23 11:51, Ævar Arnfjörð Bjarmason wrote:
> /b:
> 155781
> (*UCP)/b:
> 46035
> /s:
> 0
> (*UCP)/s:
> 0
> /w:
> 142468
> (*UCP)/w:
> 9706
>
> So the output still differs, and some of those differences may or may
> not be wanted.
I took a look at the output, and by and large I'd want the differences;
that is, I'd want the UCP version, which generates less output. This is
because several Emacs source files are not UTF-8, and \b has nonsense
matches when searching text files encoded via Shift-JIS or Big 5 or
whatever. For this sort of thing, the fewer matches the better.
> If all you're doing is matching either ASCII or Japanese text and you
> want "locale-aware numbers" it might do the wrong thing.
I'm not seeing much of a problem here. When searching Japanese text, I
would expect \d and [0-90-9] (using both ASCII and full-width digits) to
be equivalent so (assuming UCP) it's not a big deal as to which regex
you use, since Japanese text won't contain Bengali (or whatever) digits.
And when searching binary data, I'd expect a bunch of garbage no matter
how \d is interpreted.
Here I'm assuming [0-9] (using full-width digits) has the expected
meaning in PCRE2, i.e., that PCRE2 didn't make the same mistake that
POSIX made.
This bug report was last modified 2 years and 70 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.