GNU bug report logs - #72246
Possible PCRE bug in grep 3.11

Package: grep;

Date: Mon, 22 Jul 2024 18:26:01 UTC

Severity: normal

View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: gdg <at> zplane.com
Cc: 72246 <at> debbugs.gnu.org
Subject: bug#72246: Possible PCRE bug in grep 3.11
Date: Mon, 22 Jul 2024 12:00:21 -0700

On 2024-07-22 11:25, Glenn Golden wrote:
> str=$(printf "begin\xe2\x80\x99end")
> 
> #
> # grep 3.11 using PCRE '[\x80-\xFF]' doesn't find any of them,
> # and exits with 1, indicating no match.
> #
> printf"Using grep 3.11:\n"
> printf "${str}\n" | grep --color=auto -P -e '[\x80-\xFF]'

This asks 'grep' to output all lines containing characters in the range 
\x80 through \xFF. In a single-byte locale this matches any line 
containing a byte in that range (i.e., any byte with the top bit set), 
and 'grep' will output the line and exit with status zero.

However, in a UTF-8 locale this will match any line containing the 
characters U+0080 (a nameless control character) through U+00FF (LATIN 
SMALL LETTER Y WITH DIAERESIS, or "ÿ"). Because the bytes E2, 80, 99 in 
'str' represent U+2019 RIGHT SINGLE QUOTATION MARK, there is no match so 
grep doesn't output anything and exits with status 1.

In short, to get the behavior your want, put LC_ALL="C" in the locale.

If pcregrep finds a match in a UTF-8 locale then that would appear to be 
a bug in pcregrep; you might report it to the pcregrep maintainer.

This bug report was last modified 1 year and 25 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #72246 Possible PCRE bug in grep 3.11

GNU bug report logs - #72246
Possible PCRE bug in grep 3.11