GNU bug report logs - #72246
Possible PCRE bug in grep 3.11

Previous Next

Package: grep;

Reported by: gdg <at> zplane.com

Date: Mon, 22 Jul 2024 18:26:01 UTC

Severity: normal

Full log


View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: gdg <at> zplane.com
Cc: 72246 <at> debbugs.gnu.org
Subject: bug#72246: Possible PCRE bug in grep 3.11
Date: Mon, 22 Jul 2024 12:00:21 -0700
On 2024-07-22 11:25, Glenn Golden wrote:
> str=$(printf "begin\xe2\x80\x99end")
> 
> #
> # grep 3.11 using PCRE '[\x80-\xFF]' doesn't find any of them,
> # and exits with 1, indicating no match.
> #
> printf"Using grep 3.11:\n"
> printf "${str}\n" | grep --color=auto -P -e '[\x80-\xFF]'

This asks 'grep' to output all lines containing characters in the range 
\x80 through \xFF. In a single-byte locale this matches any line 
containing a byte in that range (i.e., any byte with the top bit set), 
and 'grep' will output the line and exit with status zero.

However, in a UTF-8 locale this will match any line containing the 
characters U+0080 (a nameless control character) through U+00FF (LATIN 
SMALL LETTER Y WITH DIAERESIS, or "ΓΏ"). Because the bytes E2, 80, 99 in 
'str' represent U+2019 RIGHT SINGLE QUOTATION MARK, there is no match so 
grep doesn't output anything and exits with status 1.

In short, to get the behavior your want, put LC_ALL="C" in the locale.

If pcregrep finds a match in a UTF-8 locale then that would appear to be 
a bug in pcregrep; you might report it to the pcregrep maintainer.




This bug report was last modified 330 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.