GNU bug report logs - #15440
[PATCH] dfa: fix \s and \S to work for multibyte

Previous Next

Package: grep;

Reported by: Jim Meyering <jim <at> meyering.net>

Date: Mon, 23 Sep 2013 05:18:02 UTC

Severity: normal

Tags: patch

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 15440 in the body.
You can then email your comments to 15440 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#15440; Package grep. (Mon, 23 Sep 2013 05:18:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jim Meyering <jim <at> meyering.net>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 23 Sep 2013 05:18:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: bug-grep <at> gnu.org
Subject: [PATCH] dfa: fix \s and \S to work for multibyte
Date: Sun, 22 Sep 2013 22:17:01 -0700
[Message part 1 (text/plain, inline)]
This one really surprised me.
Learning that multibyte \s and \S had been broken since grep-2.6 did
not make my day. But fixing it helped.

Here's how it started:

To demonstrate the (first)bug, set up to use a UTF8 locale:

    export LC_ALL=en_US.UTF-8

then run this and note that it matches:

    $ printf '\x82\n' > in; ./grep -q '\S' in && echo match
    match

Now, require a back-reference (forcing switch from grep's DFA matcher
to use of the regex functions), and you see there is no match:

    $ printf '\x82\x82\n' > in; ./grep -qE '(\S)\1' in && echo match
    $

One fix would be to make it so dfaexec's \S-processing fails to match an
invalid multibyte sequence, just as it's "."-processing does.
That led me to this realization:

Uh oh.  This is worse: \s is not multi-byte aware.
The two-byte "NO-BREAK SPACE" character is not matched by \s.

This fails:
    $ printf 'a\xc2\xa0b\n'|./grep 'a\sb'
    $

This matches in spite of the fact that grep.texi says \s is
     equivalent to [[:space:]] :
    $ printf 'a\xc2\xa0b\n'|./grep 'a[[:space:]]b'
    a b

GNU grep fails:
(but if I do s/\\s/[[:space:]]/ to the RE, then it does match)
    $ printf 'a\xc2\xa0ba\xc2\xa0b\n'|./grep -E '(a\sb)\1' grep:
    $

Patch attached:
[0003-dfa-fix-s-and-S-to-work-for-multibyte.patch (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#15440; Package grep. (Mon, 23 Sep 2013 21:05:01 GMT) Full text and rfc822 format available.

Message #8 received at 15440 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Aharon Robbins <arnold <at> skeeve.com>, 15440 <at> debbugs.gnu.org
Subject: Re: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte
Date: Mon, 23 Sep 2013 14:04:09 -0700
[using the right bug address, this time]

On Mon, Sep 23, 2013 at 11:26 AM, Aharon Robbins <arnold <at> skeeve.com> wrote:
> Hi.
>
>>     $ printf '\x82\n' > in; ./grep -q '\S' in && echo match
>>     match
>>
>> Now, require a back-reference (forcing switch from grep's DFA matcher
>> to use of the regex functions), and you see there is no match:
>>
>>     $ printf '\x82\x82\n' > in; ./grep -qE '(\S)\1' in && echo match
>>     $
>
> I see similar results with gawk, accounting for syntactic difference
> and a different way to force the regex matcher.
>
> So far so good.
>
>> Uh oh.  This is worse: \s is not multi-byte aware.
>> The two-byte "NO-BREAK SPACE" character is not matched by \s.
>>
>> This fails:
>>     $ printf 'a\xc2\xa0b\n'|./grep 'a\sb'
>>     $
>>
>> This matches in spite of the fact that grep.texi says \s is
>>      equivalent to [[:space:]] :
>>     $ printf 'a\xc2\xa0b\n'|./grep 'a[[:space:]]b'
>>     a b
>>
>> GNU grep fails:
>> (but if I do s/\\s/[[:space:]]/ to the RE, then it does match)
>>     $ printf 'a\xc2\xa0ba\xc2\xa0b\n'|./grep -E '(a\sb)\1' grep:
>>     $
>
> I cannot reproduce this with gawk.  Setting GAWK_NO_DFA=1 in the
> environment causes gawk to bypass dfa. For these it makes no
> difference:
>
> $ printf 'a\xc2\xa0b\n' | ./gawk '/a\sb/'
> $ printf 'a\xc2\xa0b\n' | GAWK_NO_DFA=1 ./gawk '/a\sb/'
>
> No result from either, and similar results for [[:space:]].

Hi Arnold,
[re-adding CC to the bug tracker]

Thanks for testing.
When I test on glibc, I confirm what you report: [[:space:]] fails to
match NBSP.  Makes me think either glibc's UTF8 attribute tables are
wrong, or there's a bug in regex:

  $ printf 'a\xc2\xa0b\n'|LC_ALL=en_US.
UTF-8 grep 'a[[:space:]]b'
  [Exit 1]

Initially, I considered constructing a DFA that would match all UTF8
white space characters (see the FIXME comment), and another that would
match the complement of that set minus the set of invalid UTF8 bytes,
but ended up preferring the simpler change.

FTR, I tested this only on a system for which all tests passed (OS/X).
 Very surprised to find it doesn't work on a glibc-based system.




Information forwarded to bug-grep <at> gnu.org:
bug#15440; Package grep. (Tue, 24 Sep 2013 12:25:02 GMT) Full text and rfc822 format available.

Message #11 received at 15440 <at> debbugs.gnu.org (full text, mbox):

From: Aharon Robbins <arnold <at> skeeve.com>
To: jim <at> meyering.net, arnold <at> skeeve.com, 15440 <at> debbugs.gnu.org
Subject: Re: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte
Date: Tue, 24 Sep 2013 15:24:45 +0300
Hi Jim.

I should note that gawk uses its own regex, although it does rely
on glibc for isspace / iswspace etc...

Can you test gawk (using the master branch is fine) on Mac OS X?
Basically you'd want to enclose the pattern in /.../ on the command
line and use GAWK_NO_DFA=1 to force use of regex.

In any case, once you push the changes I'll pick them up.

Thanks,

Arnold

P.S. To test gawk, cut and paste:

	git clone git://git.savannah.gnu.org/gawk.git
	cd gawk
	./bootstrap.sh && ./configure && make -j 10 # or whatever
	make check	# optional

	printf '....' | ./gawk '/.../'	# your tests here. :-)

Much thanks!

> From: Jim Meyering <jim <at> meyering.net>
> Date: Mon, 23 Sep 2013 14:04:09 -0700
> Subject: Re: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte
> To: Aharon Robbins <arnold <at> skeeve.com>, 15440 <at> debbugs.gnu.org
>
> [using the right bug address, this time]
>
> On Mon, Sep 23, 2013 at 11:26 AM, Aharon Robbins <arnold <at> skeeve.com> wrote:
> > Hi.
> >
> >>     $ printf '\x82\n' > in; ./grep -q '\S' in && echo match
> >>     match
> >>
> >> Now, require a back-reference (forcing switch from grep's DFA matcher
> >> to use of the regex functions), and you see there is no match:
> >>
> >>     $ printf '\x82\x82\n' > in; ./grep -qE '(\S)\1' in && echo match
> >>     $
> >
> > I see similar results with gawk, accounting for syntactic difference
> > and a different way to force the regex matcher.
> >
> > So far so good.
> >
> >> Uh oh.  This is worse: \s is not multi-byte aware.
> >> The two-byte "NO-BREAK SPACE" character is not matched by \s.
> >>
> >> This fails:
> >>     $ printf 'a\xc2\xa0b\n'|./grep 'a\sb'
> >>     $
> >>
> >> This matches in spite of the fact that grep.texi says \s is
> >>      equivalent to [[:space:]] :
> >>     $ printf 'a\xc2\xa0b\n'|./grep 'a[[:space:]]b'
> >>     a b
> >>
> >> GNU grep fails:
> >> (but if I do s/\\s/[[:space:]]/ to the RE, then it does match)
> >>     $ printf 'a\xc2\xa0ba\xc2\xa0b\n'|./grep -E '(a\sb)\1' grep:
> >>     $
> >
> > I cannot reproduce this with gawk.  Setting GAWK_NO_DFA=1 in the
> > environment causes gawk to bypass dfa. For these it makes no
> > difference:
> >
> > $ printf 'a\xc2\xa0b\n' | ./gawk '/a\sb/'
> > $ printf 'a\xc2\xa0b\n' | GAWK_NO_DFA=1 ./gawk '/a\sb/'
> >
> > No result from either, and similar results for [[:space:]].
>
> Hi Arnold,
> [re-adding CC to the bug tracker]
>
> Thanks for testing.
> When I test on glibc, I confirm what you report: [[:space:]] fails to
> match NBSP.  Makes me think either glibc's UTF8 attribute tables are
> wrong, or there's a bug in regex:
>
>   $ printf 'a\xc2\xa0b\n'|LC_ALL=en_US.
> UTF-8 grep 'a[[:space:]]b'
>   [Exit 1]
>
> Initially, I considered constructing a DFA that would match all UTF8
> white space characters (see the FIXME comment), and another that would
> match the complement of that set minus the set of invalid UTF8 bytes,
> but ended up preferring the simpler change.
>
> FTR, I tested this only on a system for which all tests passed (OS/X).
>  Very surprised to find it doesn't work on a glibc-based system.




Information forwarded to bug-grep <at> gnu.org:
bug#15440; Package grep. (Wed, 02 Oct 2013 00:40:04 GMT) Full text and rfc822 format available.

Message #14 received at 15440 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Aharon Robbins <arnold <at> skeeve.com>
Cc: 15440 <at> debbugs.gnu.org
Subject: Re: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte
Date: Tue, 1 Oct 2013 17:38:40 -0700
On Tue, Sep 24, 2013 at 5:24 AM, Aharon Robbins <arnold <at> skeeve.com> wrote:
> Hi Jim.
>
> I should note that gawk uses its own regex, although it does rely
> on glibc for isspace / iswspace etc...
...

close 15440
thanks

I've pushed my grep patches, but chose to omit 4 multibyte space
characters from the list in the test, since each of those would
provoke a failure on recent glibc-based systems (fedora 19).  That
seems to be due to errors in glibc's UTF-8 multibyte flags (wrong
whitespace bit) for those characters.

Arnold, I tried your latest gawk on a Fedora 19 system, and see the
same failure for those four characters, e.g.,

$ printf '\xc2\xa0\n' | LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 ./gawk
'/[[:space:]]/'|wc -c
0




bug closed, send any further explanations to 15440 <at> debbugs.gnu.org and Jim Meyering <jim <at> meyering.net> Request was from Jim Meyering <jim <at> meyering.net> to control <at> debbugs.gnu.org. (Mon, 28 Oct 2013 00:22:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 25 Nov 2013 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 266 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.