#60690 - [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P

GNU bug report logs - #60690
[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P

Package: grep;

Reported by: Ævar Arnfjörð Bjarmason <avarab <at> gmail.com>

Date: Mon, 9 Jan 2023 12:19:01 UTC

Severity: normal

Tags: patch

Message #64 received at 60690 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu> To: demerphq <demerphq <at> gmail.com> Cc: Carlo Arenas <carenas <at> gmail.com>, 60690 <at> debbugs.gnu.org, mega lith01 <megalith01 <at> gmail.com>, Philip.Hazel <at> gmail.com, Ævar Arnfjörð Bjarmason <avarab <at> gmail.com>, git <at> vger.kernel.org, Junio C Hamano <gitster <at> pobox.com>, Tukusej’s Sirs <tukusejssirs <at> protonmail.com>, pcre-dev <at> exim.org Subject: Re: bug#60690: -P '\d' in GNU and git grep Date: Fri, 7 Apr 2023 09:48:40 -0700

On 2023-04-06 08:45, demerphq wrote: >> Although this causes pcre2grep to mishandle Unicode characters: >> >> $ echo 'Ævar' | pcre2grep '[Ssß]' >> Ævar >> >> it mimics Perl 5.36: >> >> $ echo 'Ævar' | perl -ne 'print $_ if /[Ssß]/' >> Ævar >> >> so this seems to be what Perl users expect, despite its infelicities. > Actually no, I think you have misunderstood what is happening at the > different layers involved here. No, I understood what was going on. My point was that Perl users seem to have accepted this behavior, even though it does not match what people would ordinarily expect. > What you should have done is something like this: No, for two reasons. First, I'm no Perl expert and so I don't know (and don't particularly want to learn) its complicated Unicode options and calls. Second, /[Ss\x{DF}]/u is hard to read. If I want the S letters of traditional German, I'll write them in the obvious way, as [Ssß]. No doubt Perl will let me do this somehow - but it is telling that none of your examples do it in such a straightforward way. > $ echo 'Ævar' | perl -ne 'utf8::decode($_); print $_ if /[Ss\x{DF}]/u' > $ echo 'baß' | perl -MEncode -ne 'utf8::decode($_); print > encode_utf8($_) if /[Ss\x{DF}]/u' > baß > $ echo 'Ævar' | perl -MEncode -ne 'utf8::decode($_); print > encode_utf8($_) if /[Ss\x{C6}]/u' > Ævar > $ echo 'Ævar' | perl -MEncode -ne 'utf8::decode($_); print > encode_utf8($_) if /[Ss\x{e6}]/ui' > Ævar

This bug report was last modified 2 years and 125 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #60690 [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P

GNU bug report logs - #60690
[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P