GNU bug report logs -
#60690
[PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P
Previous Next
Full log
Message #64 received at 60690 <at> debbugs.gnu.org (full text, mbox):
On 2023-04-06 08:45, demerphq wrote:
>> Although this causes pcre2grep to mishandle Unicode characters:
>>
>> $ echo 'Ævar' | pcre2grep '[Ssß]'
>> Ævar
>>
>> it mimics Perl 5.36:
>>
>> $ echo 'Ævar' | perl -ne 'print $_ if /[Ssß]/'
>> Ævar
>>
>> so this seems to be what Perl users expect, despite its infelicities.
> Actually no, I think you have misunderstood what is happening at the
> different layers involved here.
No, I understood what was going on. My point was that Perl users seem to
have accepted this behavior, even though it does not match what people
would ordinarily expect.
> What you should have done is something like this:
No, for two reasons. First, I'm no Perl expert and so I don't know (and
don't particularly want to learn) its complicated Unicode options and
calls. Second, /[Ss\x{DF}]/u is hard to read. If I want the S letters of
traditional German, I'll write them in the obvious way, as [Ssß]. No
doubt Perl will let me do this somehow - but it is telling that none of
your examples do it in such a straightforward way.
> $ echo 'Ævar' | perl -ne 'utf8::decode($_); print $_ if /[Ss\x{DF}]/u'
> $ echo 'baß' | perl -MEncode -ne 'utf8::decode($_); print
> encode_utf8($_) if /[Ss\x{DF}]/u'
> baß
> $ echo 'Ævar' | perl -MEncode -ne 'utf8::decode($_); print
> encode_utf8($_) if /[Ss\x{C6}]/u'
> Ævar
> $ echo 'Ævar' | perl -MEncode -ne 'utf8::decode($_); print
> encode_utf8($_) if /[Ss\x{e6}]/ui'
> Ævar
This bug report was last modified 2 years and 125 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.