GNU bug report logs - #21989
grep search by ASCII code unsuccessful

Previous Next

Package: grep;

Reported by: Shivanshu Goyal <shivanshu3 <at> gmail.com>

Date: Mon, 23 Nov 2015 07:57:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Shivanshu Goyal <shivanshu3 <at> gmail.com>
Cc: 21989 <at> debbugs.gnu.org
Subject: bug#21989: grep search by ASCII code unsuccessful
Date: Mon, 23 Nov 2015 15:05:14 +0000
2015-11-22 21:24:05 -0800, Shivanshu Goyal:
[...]
> I think I found a bug which did not exist in version 2.14, but does seem to
> exist in versions 2.16 and 2.22. I have not tested any other versions.
> 
> Say there is a file with the following contents:
> 
> shivanshu <at> thetis:tmp$ cat temp | xxd
> 0000000: 68e2 8093 680a                           h...h.
> 
> The following is the grep 2.14 command and output:
> 
> shivanshu <at> thetis:tmp$ cat temp | grep -P '\xe2\x80\x93'
> h–h
> 
> The following is the grep 2.16/2.22 command and output:
> 
> shivanshu <at> thetis:tmp$ cat temp | grep -P '\xe2\x80\x93'
> d1y8 <at> thetis:tmp$
[...]

If you read the pcrepattern man page, you'll see that \xe2
doesn't match the byte e2, but the character of code e2.

If you're in a UTF-8 locale, \xe2 would match the character of
Unicode code point e2 (LATIN SMALL LETTER A WITH CIRCUMFLEX)
which in UTF-8 is written as the bytes c3 a2.

The sequence e2 80 93 is actually the one character U+2013 (EN
DASH). So, here, you either want:

LC_ALL=C grep -P '\xe2\x80\x93'

That is use a locale where characters are single-byte and their
code is the byte value, or assuming the current locale is UTF-8,
use:

grep -P '\x{2013}'

Or, regardless of the locale:

grep -P '(*UTF8)\x{2013}'

-- 
Stephane




This bug report was last modified 9 years and 181 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.