GNU bug report logs -
#60618
unicode characters are not identified as such for \w and \b with -P
Previous Next
Reported by: Carlo Arenas <carenas <at> gmail.com>
Date: Sat, 7 Jan 2023 03:49:01 UTC
Severity: normal
Merged with 60621
Done: Jim Meyering <jim <at> meyering.net>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas <carenas <at> gmail.com> wrote:
> Reported to PCRE[1] with mention of GNU grep being also affected.
>
> [1] https://github.com/PCRE2Project/pcre2/issues/185
Yikes. This is a big deal.
Thank you for the patch and added test.
I made a tiny comment tweak and this test logic change that was
required to make the new test pass with the fixed version.
-grep -Po 'r\w' in > out && fail=1
+grep -Po 'r\w' in > out || fail=1
Also, make syntax-check required to change e.g.,
-compare out exp || fail=1
+compare exp out || fail=1
Every bug fix needs a NEWS entry, so I added this:
With -P, some non-ASCII UTF8 characters were not recognized as
word-constituent due to our omission of the PCRE_UCP flag. E.g.,
given f(){ echo Perú|LC_ALL=en_US.UTF-8 grep -Po "$1"; } and
this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r".
After the fix, it prints the correct results: "rú:ú".
Finally, I expanded the ChangeLog entry and gave credit where due.
I'll push this tomorrow:
[grep-pcre-fix.diff (application/octet-stream, attachment)]
This bug report was last modified 2 years and 132 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.