#60690 - [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P

GNU bug report logs - #60690
[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P

Package: grep;

Reported by: Ævar Arnfjörð Bjarmason <avarab <at> gmail.com>

Date: Mon, 9 Jan 2023 12:19:01 UTC

Severity: normal

Tags: patch

Message #67 received at 60690 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu> To: demerphq <demerphq <at> gmail.com> Cc: pcre2-dev <at> googlegroups.com, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01 <at> gmail.com>, Carlo Arenas <carenas <at> gmail.com>, Ævar Arnfjörð Bjarmason <avarab <at> gmail.com>, Junio C Hamano <gitster <at> pobox.com>, Tukusej’s Sirs <tukusejssirs <at> protonmail.com>, git <at> vger.kernel.org Subject: Re: bug#60690: -P '\d' in GNU and git grep Date: Fri, 7 Apr 2023 12:00:16 -0700

On 2023-04-06 06:39, demerphq wrote: > Unicode specifies that \d match any digit > in any script that it supports. "Specifies" is too strong. The Unicode Regular Expressions technical standard (UTS#18) mentions \d only in Annex C[1], next to the word "digit" in a column labeled "Property" (even though \d is really syntax not a property). This is at best an informal recommendation, not a requirement, as UTS#18 0.2[2] says that UTS#18's syntax is only for illustration and that although it's similar to Perl's, the two syntax forms may not be exactly the same. So we can't look to UTS#18 for a definitive way out of the \d mess, as the Unicode folks specifically delegated matters to us. Even ignoring the \d issue the digit situation is messy. UTS#18 Annex C says "\p{gc=Decimal_Number}" is the standard recommended syntax assignment for digits. However, PCRE2 does not support this syntax; it supports another variant \p{Nd} that UTS#18 also recommends. So it appears that PCRE2 already does not implement every recommended aspect of UTS#18 syntax. PCRE2 also doesn't match Perl, which does support "\p{gc=Decimal_Number}". Anyway, since grep -P '\p{Nd}' implements Unicode's decimal digit class, that's clearly enough for grep -P to conform to UTS#18 with respect to digits. > A) how do you tell the regular expression > engine what semantics you want and B) how does the regular expression > library identify the encoding in the file, and how does it handle > malformed content in that file. Here's how GNU grep does it: * RE semantics are specified via command-line options like -P. * Text encoding is specified by locale, e.g., LC_ALL='en_US.utf8'. * REs do not match encoding errors. > on *nix there is no tradition of using BOM's to > distinguish the 6 different possible encodings of Unicode (UTF-8, > UTF-EBCDIC, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE) Yes, GNU/Linux never really experienced the joys of UTF-EBCDIC, Oracle UTFE, UTF-16LE vs UTF-16BE etc. If you're running legacy IBM mainframe or MS-Windows code these legacy encodings are obviously a big deal. However, there seems little reason to force their nontrivial hassles onto every GNU/Linux program that processes text. A few specialized apps like 'iconv' deal with offbeat encodings, and that is probably a better approach all around. > there seems > to be some level of desire of matching with unicode semantics against > files that are not uniformly encoded in one of these formats. That is a use case, yes. It's what 'strings' and 'grep' do. [1]: https://unicode.org/reports/tr18/#Compatibility_Properties [2]: https://unicode.org/reports/tr18/#Conformance

This bug report was last modified 2 years and 125 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #60690 [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P

GNU bug report logs - #60690
[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P