GNU bug report logs - #21989
grep search by ASCII code unsuccessful

Previous Next

Package: grep;

Reported by: Shivanshu Goyal <shivanshu3 <at> gmail.com>

Date: Mon, 23 Nov 2015 07:57:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 21989 in the body.
You can then email your comments to 21989 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#21989; Package grep. (Mon, 23 Nov 2015 07:57:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Shivanshu Goyal <shivanshu3 <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 23 Nov 2015 07:57:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Shivanshu Goyal <shivanshu3 <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: grep search by ASCII code unsuccessful
Date: Sun, 22 Nov 2015 21:24:05 -0800
[Message part 1 (text/plain, inline)]
Hi,

I think I found a bug which did not exist in version 2.14, but does seem to
exist in versions 2.16 and 2.22. I have not tested any other versions.

Say there is a file with the following contents:

shivanshu <at> thetis:tmp$ cat temp | xxd
0000000: 68e2 8093 680a                           h...h.

The following is the grep 2.14 command and output:

shivanshu <at> thetis:tmp$ cat temp | grep -P '\xe2\x80\x93'
h–h

The following is the grep 2.16/2.22 command and output:

shivanshu <at> thetis:tmp$ cat temp | grep -P '\xe2\x80\x93'
d1y8 <at> thetis:tmp$

Thanks,
Shivanshu Goyal
shivanshu.ca
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#21989; Package grep. (Mon, 23 Nov 2015 15:06:02 GMT) Full text and rfc822 format available.

Message #8 received at 21989 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Shivanshu Goyal <shivanshu3 <at> gmail.com>
Cc: 21989 <at> debbugs.gnu.org
Subject: Re: bug#21989: grep search by ASCII code unsuccessful
Date: Mon, 23 Nov 2015 15:05:14 +0000
2015-11-22 21:24:05 -0800, Shivanshu Goyal:
[...]
> I think I found a bug which did not exist in version 2.14, but does seem to
> exist in versions 2.16 and 2.22. I have not tested any other versions.
> 
> Say there is a file with the following contents:
> 
> shivanshu <at> thetis:tmp$ cat temp | xxd
> 0000000: 68e2 8093 680a                           h...h.
> 
> The following is the grep 2.14 command and output:
> 
> shivanshu <at> thetis:tmp$ cat temp | grep -P '\xe2\x80\x93'
> h–h
> 
> The following is the grep 2.16/2.22 command and output:
> 
> shivanshu <at> thetis:tmp$ cat temp | grep -P '\xe2\x80\x93'
> d1y8 <at> thetis:tmp$
[...]

If you read the pcrepattern man page, you'll see that \xe2
doesn't match the byte e2, but the character of code e2.

If you're in a UTF-8 locale, \xe2 would match the character of
Unicode code point e2 (LATIN SMALL LETTER A WITH CIRCUMFLEX)
which in UTF-8 is written as the bytes c3 a2.

The sequence e2 80 93 is actually the one character U+2013 (EN
DASH). So, here, you either want:

LC_ALL=C grep -P '\xe2\x80\x93'

That is use a locale where characters are single-byte and their
code is the byte value, or assuming the current locale is UTF-8,
use:

grep -P '\x{2013}'

Or, regardless of the locale:

grep -P '(*UTF8)\x{2013}'

-- 
Stephane




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Mon, 23 Nov 2015 16:17:02 GMT) Full text and rfc822 format available.

Notification sent to Shivanshu Goyal <shivanshu3 <at> gmail.com>:
bug acknowledged by developer. (Mon, 23 Nov 2015 16:17:02 GMT) Full text and rfc822 format available.

Message #13 received at 21989-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>,
 Shivanshu Goyal <shivanshu3 <at> gmail.com>
Cc: 21989-done <at> debbugs.gnu.org
Subject: Re: bug#21989: grep search by ASCII code unsuccessful
Date: Mon, 23 Nov 2015 08:16:32 -0800
Thanks, Stephane, for diagnosing the problem. Closing the bug.




Information forwarded to bug-grep <at> gnu.org:
bug#21989; Package grep. (Mon, 23 Nov 2015 16:45:03 GMT) Full text and rfc822 format available.

Message #16 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Shivanshu Goyal <shivanshu3 <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Re: grep search by ASCII code unsuccessful
Date: Mon, 23 Nov 2015 05:31:26 +0000
[Message part 1 (text/plain, inline)]
Correction:

The following is the grep 2.16/2.22 command and output:
(It doesn't output anything)

shivanshu <at> thetis:tmp$ cat temp | grep -P '\xe2\x80\x93'
shivanshu <at> thetis:tmp$

On Sun, Nov 22, 2015 at 9:24 PM Shivanshu Goyal <shivanshu3 <at> gmail.com>
wrote:

> Hi,
>
> I think I found a bug which did not exist in version 2.14, but does seem
> to exist in versions 2.16 and 2.22. I have not tested any other versions.
>
> Say there is a file with the following contents:
>
> shivanshu <at> thetis:tmp$ cat temp | xxd
> 0000000: 68e2 8093 680a                           h...h.
>
> The following is the grep 2.14 command and output:
>
> shivanshu <at> thetis:tmp$ cat temp | grep -P '\xe2\x80\x93'
> h–h
>
> The following is the grep 2.16/2.22 command and output:
>
> shivanshu <at> thetis:tmp$ cat temp | grep -P '\xe2\x80\x93'
> d1y8 <at> thetis:tmp$
>
> Thanks,
> Shivanshu Goyal
> shivanshu.ca
>
[Message part 2 (text/html, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 22 Dec 2015 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 180 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.