GNU bug report logs -
#60690
[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P
Previous Next
Full log
Message #67 received at 60690 <at> debbugs.gnu.org (full text, mbox):
On 2023-04-06 06:39, demerphq wrote:
> Unicode specifies that \d match any digit
> in any script that it supports.
"Specifies" is too strong. The Unicode Regular Expressions technical
standard (UTS#18) mentions \d only in Annex C[1], next to the word
"digit" in a column labeled "Property" (even though \d is really syntax
not a property). This is at best an informal recommendation, not a
requirement, as UTS#18 0.2[2] says that UTS#18's syntax is only for
illustration and that although it's similar to Perl's, the two syntax
forms may not be exactly the same. So we can't look to UTS#18 for a
definitive way out of the \d mess, as the Unicode folks specifically
delegated matters to us.
Even ignoring the \d issue the digit situation is messy. UTS#18 Annex C
says "\p{gc=Decimal_Number}" is the standard recommended syntax
assignment for digits. However, PCRE2 does not support this syntax; it
supports another variant \p{Nd} that UTS#18 also recommends. So it
appears that PCRE2 already does not implement every recommended aspect
of UTS#18 syntax. PCRE2 also doesn't match Perl, which does support
"\p{gc=Decimal_Number}".
Anyway, since grep -P '\p{Nd}' implements Unicode's decimal digit class,
that's clearly enough for grep -P to conform to UTS#18 with respect to
digits.
> A) how do you tell the regular expression
> engine what semantics you want and B) how does the regular expression
> library identify the encoding in the file, and how does it handle
> malformed content in that file.
Here's how GNU grep does it:
* RE semantics are specified via command-line options like -P.
* Text encoding is specified by locale, e.g., LC_ALL='en_US.utf8'.
* REs do not match encoding errors.
> on *nix there is no tradition of using BOM's to
> distinguish the 6 different possible encodings of Unicode (UTF-8,
> UTF-EBCDIC, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE)
Yes, GNU/Linux never really experienced the joys of UTF-EBCDIC, Oracle
UTFE, UTF-16LE vs UTF-16BE etc. If you're running legacy IBM mainframe
or MS-Windows code these legacy encodings are obviously a big deal.
However, there seems little reason to force their nontrivial hassles
onto every GNU/Linux program that processes text. A few specialized apps
like 'iconv' deal with offbeat encodings, and that is probably a better
approach all around.
> there seems
> to be some level of desire of matching with unicode semantics against
> files that are not uniformly encoded in one of these formats.
That is a use case, yes. It's what 'strings' and 'grep' do.
[1]: https://unicode.org/reports/tr18/#Compatibility_Properties
[2]: https://unicode.org/reports/tr18/#Conformance
This bug report was last modified 2 years and 72 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.