GNU bug report logs - #15440
[PATCH] dfa: fix \s and \S to work for multibyte

Previous Next

Package: grep;

Reported by: Jim Meyering <jim <at> meyering.net>

Date: Mon, 23 Sep 2013 05:18:02 UTC

Severity: normal

Tags: patch

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

Full log


Message #8 received at 15440 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Aharon Robbins <arnold <at> skeeve.com>, 15440 <at> debbugs.gnu.org
Subject: Re: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte
Date: Mon, 23 Sep 2013 14:04:09 -0700
[using the right bug address, this time]

On Mon, Sep 23, 2013 at 11:26 AM, Aharon Robbins <arnold <at> skeeve.com> wrote:
> Hi.
>
>>     $ printf '\x82\n' > in; ./grep -q '\S' in && echo match
>>     match
>>
>> Now, require a back-reference (forcing switch from grep's DFA matcher
>> to use of the regex functions), and you see there is no match:
>>
>>     $ printf '\x82\x82\n' > in; ./grep -qE '(\S)\1' in && echo match
>>     $
>
> I see similar results with gawk, accounting for syntactic difference
> and a different way to force the regex matcher.
>
> So far so good.
>
>> Uh oh.  This is worse: \s is not multi-byte aware.
>> The two-byte "NO-BREAK SPACE" character is not matched by \s.
>>
>> This fails:
>>     $ printf 'a\xc2\xa0b\n'|./grep 'a\sb'
>>     $
>>
>> This matches in spite of the fact that grep.texi says \s is
>>      equivalent to [[:space:]] :
>>     $ printf 'a\xc2\xa0b\n'|./grep 'a[[:space:]]b'
>>     a b
>>
>> GNU grep fails:
>> (but if I do s/\\s/[[:space:]]/ to the RE, then it does match)
>>     $ printf 'a\xc2\xa0ba\xc2\xa0b\n'|./grep -E '(a\sb)\1' grep:
>>     $
>
> I cannot reproduce this with gawk.  Setting GAWK_NO_DFA=1 in the
> environment causes gawk to bypass dfa. For these it makes no
> difference:
>
> $ printf 'a\xc2\xa0b\n' | ./gawk '/a\sb/'
> $ printf 'a\xc2\xa0b\n' | GAWK_NO_DFA=1 ./gawk '/a\sb/'
>
> No result from either, and similar results for [[:space:]].

Hi Arnold,
[re-adding CC to the bug tracker]

Thanks for testing.
When I test on glibc, I confirm what you report: [[:space:]] fails to
match NBSP.  Makes me think either glibc's UTF8 attribute tables are
wrong, or there's a bug in regex:

  $ printf 'a\xc2\xa0b\n'|LC_ALL=en_US.
UTF-8 grep 'a[[:space:]]b'
  [Exit 1]

Initially, I considered constructing a DFA that would match all UTF8
white space characters (see the FIXME comment), and another that would
match the complement of that set minus the set of invalid UTF8 bytes,
but ended up preferring the simpler change.

FTR, I tested this only on a system for which all tests passed (OS/X).
 Very surprised to find it doesn't work on a glibc-based system.




This bug report was last modified 11 years and 267 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.