GNU bug report logs - #24975
Matching issues with characters whose encoding ends in some other character

Previous Next

Package: grep;

Reported by: Stephane Chazelas <stephane.chazelas <at> gmail.com>

Date: Sun, 20 Nov 2016 21:51:01 UTC

Severity: normal

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Subject: bug#24975: closed (Re: bug#24975: Matching issues with characters
 whose encoding ends in some other character)
Date: Mon, 28 Nov 2016 00:00:03 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#24975: Matching issues with characters whose encoding ends in some other character

which was filed against the grep package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 24975 <at> debbugs.gnu.org.

-- 
24975: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=24975
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Jim Meyering <jim <at> meyering.net>
To: Stephane Chazelas <stephane.chazelas <at> gmail.com>
Cc: "bug-gnulib <at> gnu.org List" <bug-gnulib <at> gnu.org>, 24975-done <at> debbugs.gnu.org
Subject: Re: bug#24975: Matching issues with characters whose encoding ends in
 some other character
Date: Sun, 27 Nov 2016 15:59:05 -0800
[Message part 3 (text/plain, inline)]
On Sun, Nov 20, 2016 at 9:53 PM, Jim Meyering <jim <at> meyering.net> wrote:
> On Sun, Nov 20, 2016 at 2:59 PM, Stephane Chazelas
> <stephane.chazelas <at> gmail.com> wrote:
>> 2016-11-20 21:50:28 +0000, Stephane Chazelas:
>>> $ locale charmap
>>> GB18030
>>> $ printf '\uC9\n' | grep  '.*7'  | hd
>>> 00000000  81 30 87 37 0a                                    |.0.7.|
>>> 00000005
>>>
>>> U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030).
>> [...]
>>> Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64.
>> [...]
>>
>> Same behaviour with 2.26 on Solaris 11.
>
> Thank you for the report.
> I can reproduce that error on Fedora 25 with this:
>
>   $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep '.*7' k)|wc -c
>   5
>
> I confirmed that the problem does not arise (i.e., no match, with exit
> status of 1) when we force the use of glibc's regex matcher by
> inserting a trivial back-reference:
>
>   $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep -E
> '()\1.*7' k); echo $?
>   1
>
> This bisected to v2.18-54-g3ef4c8e, but that commit was just the
> messenger: it exposed the latent bug by making it so this case was no
> longer handled by glibc's regexp matcher, but rather by grep's dfa.c.

I've fixed this by forcing any non-UTF8 multibyte locale to use regex
rather than DFA matcher with the following.
The gnulib/dfa patch makes that change, and the grep change updates to
latest gnulib, adds tests and NEWS.

I suspect this won't be the last word in this area, because it feels
like we should be able to adjust DFA's tables so that people using
such locales can retain DFA's efficiency without the bug in the
current implementation.
[gnulib-dfa-mb-non-UTF8-fix.diff (text/plain, attachment)]
[grep-fix-false-matches-mb-non-UTF8.diff (text/plain, attachment)]
[Message part 6 (message/rfc822, inline)]
From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Matching issues with characters whose encoding ends in some other
 character
Date: Sun, 20 Nov 2016 21:50:28 +0000
$ locale charmap
GB18030
$ printf '\uC9\n' | grep  '.*7'  | hd
00000000  81 30 87 37 0a                                    |.0.7.|
00000005

U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030).

$ printf '\uC9\n' | grep  '.*0'

fails.

$ printf '\uC9\n' | grep  -o '.*7'

returns with a zero exit status but outputs nothing. It's as if
.*7 matched an empty string somewhere.

printf '\uC9\n' | grep  '\(.*7\)\1'

fails.

so do:

grep 7
grep '7$'
grep '.7'
grep '[^x]*7'
printf 'x\uC9\n' | grep -E '.+7'

These match:

grep '.\{0,1\}7'
grep -E '.?7'
printf '\uC9x\n' | grep  '.*7x' # still outputs nothing with -o

That's not confined to GB18030. You get similar issues with
BIG5-HKSCS, BIG5 or GBK.

$ locale charmap
BIG5-HKSCS
$ printf '\ue9\n' | grep  '.*m'  | hd
00000000  88 6d 0a                                          |.m.|
00000003

Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64.

-- 
Stephane



This bug report was last modified 8 years and 258 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.