GNU bug report logs -
#72246
Possible PCRE bug in grep 3.11
Previous Next
Full log
Message #8 received at 72246 <at> debbugs.gnu.org (full text, mbox):
On 2024-07-22 11:25, Glenn Golden wrote:
> str=$(printf "begin\xe2\x80\x99end")
>
> #
> # grep 3.11 using PCRE '[\x80-\xFF]' doesn't find any of them,
> # and exits with 1, indicating no match.
> #
> printf"Using grep 3.11:\n"
> printf "${str}\n" | grep --color=auto -P -e '[\x80-\xFF]'
This asks 'grep' to output all lines containing characters in the range
\x80 through \xFF. In a single-byte locale this matches any line
containing a byte in that range (i.e., any byte with the top bit set),
and 'grep' will output the line and exit with status zero.
However, in a UTF-8 locale this will match any line containing the
characters U+0080 (a nameless control character) through U+00FF (LATIN
SMALL LETTER Y WITH DIAERESIS, or "ΓΏ"). Because the bytes E2, 80, 99 in
'str' represent U+2019 RIGHT SINGLE QUOTATION MARK, there is no match so
grep doesn't output anything and exits with status 1.
In short, to get the behavior your want, put LC_ALL="C" in the locale.
If pcregrep finds a match in a UTF-8 locale then that would appear to be
a bug in pcregrep; you might report it to the pcregrep maintainer.
This bug report was last modified 1 year and 25 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.