GNU bug report logs -
#60690
[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P
Previous Next
Full log
View this message in rfc822 format
On Fri, Apr 7, 2023 at 12:00 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> On 2023-04-06 06:39, demerphq wrote:
>
> > Unicode specifies that \d match any digit
> > in any script that it supports.
>
> "Specifies" is too strong. The Unicode Regular Expressions technical
> standard (UTS#18) mentions \d only in Annex C[1], next to the word
> "digit" in a column labeled "Property" (even though \d is really syntax
> not a property). This is at best an informal recommendation, not a
> requirement, as UTS#18 0.2[2] says that UTS#18's syntax is only for
> illustration and that although it's similar to Perl's, the two syntax
> forms may not be exactly the same. So we can't look to UTS#18 for a
> definitive way out of the \d mess, as the Unicode folks specifically
> delegated matters to us.
>
> Even ignoring the \d issue the digit situation is messy. UTS#18 Annex C
> says "\p{gc=Decimal_Number}" is the standard recommended syntax
> assignment for digits. However, PCRE2 does not support this syntax; it
> supports another variant \p{Nd} that UTS#18 also recommends. So it
> appears that PCRE2 already does not implement every recommended aspect
> of UTS#18 syntax. PCRE2 also doesn't match Perl, which does support
> "\p{gc=Decimal_Number}".
Not sure I follow the whole logic here, but PCRE2[3] (search for
"general category" which is what the "gc" above stands for) only
supports the abbreviated form of the unicode classes and `Nd` is
indeed the one that corresponds to `Decimal_Number`.
Carlo
[1]: https://unicode.org/reports/tr18/#Compatibility_Properties
[2]: https://unicode.org/reports/tr18/#Conformance
[3]: https://pcre2project.github.io/pcre2/doc/html/pcre2pattern.html
This bug report was last modified 2 years and 70 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.