GNU bug report logs -
#15440
[PATCH] dfa: fix \s and \S to work for multibyte
Previous Next
Reported by: Jim Meyering <jim <at> meyering.net>
Date: Mon, 23 Sep 2013 05:18:02 UTC
Severity: normal
Tags: patch
Done: Jim Meyering <jim <at> meyering.net>
Bug is archived. No further changes may be made.
Full log
Message #11 received at 15440 <at> debbugs.gnu.org (full text, mbox):
Hi Jim.
I should note that gawk uses its own regex, although it does rely
on glibc for isspace / iswspace etc...
Can you test gawk (using the master branch is fine) on Mac OS X?
Basically you'd want to enclose the pattern in /.../ on the command
line and use GAWK_NO_DFA=1 to force use of regex.
In any case, once you push the changes I'll pick them up.
Thanks,
Arnold
P.S. To test gawk, cut and paste:
git clone git://git.savannah.gnu.org/gawk.git
cd gawk
./bootstrap.sh && ./configure && make -j 10 # or whatever
make check # optional
printf '....' | ./gawk '/.../' # your tests here. :-)
Much thanks!
> From: Jim Meyering <jim <at> meyering.net>
> Date: Mon, 23 Sep 2013 14:04:09 -0700
> Subject: Re: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte
> To: Aharon Robbins <arnold <at> skeeve.com>, 15440 <at> debbugs.gnu.org
>
> [using the right bug address, this time]
>
> On Mon, Sep 23, 2013 at 11:26 AM, Aharon Robbins <arnold <at> skeeve.com> wrote:
> > Hi.
> >
> >> $ printf '\x82\n' > in; ./grep -q '\S' in && echo match
> >> match
> >>
> >> Now, require a back-reference (forcing switch from grep's DFA matcher
> >> to use of the regex functions), and you see there is no match:
> >>
> >> $ printf '\x82\x82\n' > in; ./grep -qE '(\S)\1' in && echo match
> >> $
> >
> > I see similar results with gawk, accounting for syntactic difference
> > and a different way to force the regex matcher.
> >
> > So far so good.
> >
> >> Uh oh. This is worse: \s is not multi-byte aware.
> >> The two-byte "NO-BREAK SPACE" character is not matched by \s.
> >>
> >> This fails:
> >> $ printf 'a\xc2\xa0b\n'|./grep 'a\sb'
> >> $
> >>
> >> This matches in spite of the fact that grep.texi says \s is
> >> equivalent to [[:space:]] :
> >> $ printf 'a\xc2\xa0b\n'|./grep 'a[[:space:]]b'
> >> a b
> >>
> >> GNU grep fails:
> >> (but if I do s/\\s/[[:space:]]/ to the RE, then it does match)
> >> $ printf 'a\xc2\xa0ba\xc2\xa0b\n'|./grep -E '(a\sb)\1' grep:
> >> $
> >
> > I cannot reproduce this with gawk. Setting GAWK_NO_DFA=1 in the
> > environment causes gawk to bypass dfa. For these it makes no
> > difference:
> >
> > $ printf 'a\xc2\xa0b\n' | ./gawk '/a\sb/'
> > $ printf 'a\xc2\xa0b\n' | GAWK_NO_DFA=1 ./gawk '/a\sb/'
> >
> > No result from either, and similar results for [[:space:]].
>
> Hi Arnold,
> [re-adding CC to the bug tracker]
>
> Thanks for testing.
> When I test on glibc, I confirm what you report: [[:space:]] fails to
> match NBSP. Makes me think either glibc's UTF8 attribute tables are
> wrong, or there's a bug in regex:
>
> $ printf 'a\xc2\xa0b\n'|LC_ALL=en_US.
> UTF-8 grep 'a[[:space:]]b'
> [Exit 1]
>
> Initially, I considered constructing a DFA that would match all UTF8
> white space characters (see the FIXME comment), and another that would
> match the complement of that set minus the set of invalid UTF8 bytes,
> but ended up preferring the simpler change.
>
> FTR, I tested this only on a system for which all tests passed (OS/X).
> Very surprised to find it doesn't work on a glibc-based system.
This bug report was last modified 11 years and 267 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.