GNU bug report logs - #60690
[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P

Previous Next

Package: grep;

Reported by: Ævar Arnfjörð Bjarmason <avarab <at> gmail.com>

Date: Mon, 9 Jan 2023 12:19:01 UTC

Severity: normal

Tags: patch

Merged with 62552, 62605

Full log


Message #64 received at 60690 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: demerphq <demerphq <at> gmail.com>
Cc: Carlo Arenas <carenas <at> gmail.com>, 60690 <at> debbugs.gnu.org,
 mega lith01 <megalith01 <at> gmail.com>, Philip.Hazel <at> gmail.com,
 Ævar Arnfjörð Bjarmason <avarab <at> gmail.com>,
 git <at> vger.kernel.org, Junio C Hamano <gitster <at> pobox.com>,
 Tukusej’s Sirs <tukusejssirs <at> protonmail.com>,
 pcre-dev <at> exim.org
Subject: Re: bug#60690: -P '\d' in GNU and git grep
Date: Fri, 7 Apr 2023 09:48:40 -0700
On 2023-04-06 08:45, demerphq wrote:
>> Although this causes pcre2grep to mishandle Unicode characters:
>>
>>     $ echo 'Ævar' | pcre2grep '[Ssß]'
>>     Ævar
>>
>> it mimics Perl 5.36:
>>
>>     $ echo 'Ævar' | perl -ne 'print $_ if /[Ssß]/'
>>     Ævar
>>
>> so this seems to be what Perl users expect, despite its infelicities.
> Actually no, I think you have misunderstood what is happening at the
> different layers involved here.

No, I understood what was going on. My point was that Perl users seem to 
have accepted this behavior, even though it does not match what people 
would ordinarily expect.


> What you should have done is something like this:

No, for two reasons. First, I'm no Perl expert and so I don't know (and 
don't particularly want to learn) its complicated Unicode options and 
calls. Second, /[Ss\x{DF}]/u is hard to read. If I want the S letters of 
traditional German, I'll write them in the obvious way, as [Ssß]. No 
doubt Perl will let me do this somehow - but it is telling that none of 
your examples do it in such a straightforward way.

> $ echo 'Ævar' | perl -ne 'utf8::decode($_); print $_ if /[Ss\x{DF}]/u'
> $ echo 'baß' | perl -MEncode -ne 'utf8::decode($_); print
> encode_utf8($_) if /[Ss\x{DF}]/u'
> baß
> $ echo 'Ævar' | perl -MEncode -ne 'utf8::decode($_); print
> encode_utf8($_) if /[Ss\x{C6}]/u'
> Ævar
> $ echo 'Ævar' | perl -MEncode -ne 'utf8::decode($_); print
> encode_utf8($_) if /[Ss\x{e6}]/ui'
> Ævar






This bug report was last modified 2 years and 125 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.