GNU bug report logs - #24975
Matching issues with characters whose encoding ends in some other character

Previous Next

Package: grep;

Reported by: Stephane Chazelas <stephane.chazelas <at> gmail.com>

Date: Sun, 20 Nov 2016 21:51:01 UTC

Severity: normal

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

Full log


Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Matching issues with characters whose encoding ends in some other
 character
Date: Sun, 20 Nov 2016 21:50:28 +0000
$ locale charmap
GB18030
$ printf '\uC9\n' | grep  '.*7'  | hd
00000000  81 30 87 37 0a                                    |.0.7.|
00000005

U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030).

$ printf '\uC9\n' | grep  '.*0'

fails.

$ printf '\uC9\n' | grep  -o '.*7'

returns with a zero exit status but outputs nothing. It's as if
.*7 matched an empty string somewhere.

printf '\uC9\n' | grep  '\(.*7\)\1'

fails.

so do:

grep 7
grep '7$'
grep '.7'
grep '[^x]*7'
printf 'x\uC9\n' | grep -E '.+7'

These match:

grep '.\{0,1\}7'
grep -E '.?7'
printf '\uC9x\n' | grep  '.*7x' # still outputs nothing with -o

That's not confined to GB18030. You get similar issues with
BIG5-HKSCS, BIG5 or GBK.

$ locale charmap
BIG5-HKSCS
$ printf '\ue9\n' | grep  '.*m'  | hd
00000000  88 6d 0a                                          |.m.|
00000003

Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64.

-- 
Stephane




This bug report was last modified 8 years and 258 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.