GNU bug report logs -
#16912
[PATCH] no longer use CSET for non-UTF8 locale in DFA engine
Previous Next
Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Date: Sat, 1 Mar 2014 09:49:02 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16912 in the body.
You can then email your comments to 16912 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#16912
; Package
grep
.
(Sat, 01 Mar 2014 09:49:03 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Sat, 01 Mar 2014 09:49:03 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Package: grep
Tags: patch
I have overlooked the important thing about optimization by
trivial_case_ignore. After optimization by trivial_case_ignore,
kwset engine can be used yet. However, if remove trivial_case_ignore,
it's never used longer because kwsmusts does nothing when MB_CUR_MAX > 1
&& match_icase.
The patch reverts removal of trivial_case_ignore and fixes 200x slower
for non-UTF8 locales with another approach. It always prefers CSET to
replacement to OR and no longer use CSET for non-UTF8 locales in DFA
engine.
It can also optimize by trivial_case_ignore and enables to speed-up >20x
for non-UTF8 locales. (I tested it with euc-jp)
Norihiro
[patch.txt (application/octet-stream, attachment)]
[tests.txt (application/octet-stream, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16912
; Package
grep
.
(Sun, 02 Mar 2014 00:14:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 16912 <at> debbugs.gnu.org (full text, mbox):
Thanks for looking into this. Unfortunately the combination of the two
patches causes "make check" to fail, because it reintroduces a titlecase
bug. I can draft a further patch for that, but in the meantime can you
look at a few other things?
First, why does the first patch add those four using_utf8 calls to
parse_bracket_exp? Isn't that optimization valid regardless of whether
the multibyte encoding is UTF-8?
Second, the comment "UTF-8 allows treating a simple, non-inverted MBCSET
like a CSET." no longer seems to match the code, since addtok no longer
invokes using_utf8.
Third, could you please draft a proper commit message? The format is
something like this:
grep: minor tuning for mb_case_map_apply
* src/kwsearch.c (mb_case_map_apply): Avoid unnecessary widening of
size_t to intmax_t. Avoid unnecessary reinitialization of k.
That is, a first line of the form "program: short description". Then an
empty line. Then a ChangeLog entry in standard GNU format.
I'll take a look at the second patch later.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16912
; Package
grep
.
(Sun, 02 Mar 2014 01:24:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 16912 <at> debbugs.gnu.org (full text, mbox):
Hi Paul
Thank you for checking the patch.
> First, why does the first patch add those four using_utf8 calls to
> parse_bracket_exp? Isn't that optimization valid regardless of
> whether the multibyte encoding is UTF-8?
The optimization which MBCSET is changed into CSET in addtok is completed
on UTF8 locale only, because even if work_mbc->cset is defined in non-UTF8
locales, it's treated as not CSET but MBCSET. So if not CSET to replacement
to OR, dfa will keep MBCSET until last and return backref. I want to
avoid it.
However I don't understand why the optimization isn't completed on
non-UTF8 locale only. Can you explain it?
Norihiro
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16912
; Package
grep
.
(Sun, 02 Mar 2014 09:49:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 16912 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
I have added several modifications to the patch.
First, I fixed the bug for titlecase.
Second, I changed it so that prefered replacement to OR to CSET in order
to reduce a number of states.
Third, I modified comments in source code and put drafts of commit
messages in the patch.
Norihiro
[patch.txt (application/octet-stream, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16912
; Package
grep
.
(Mon, 03 Mar 2014 06:14:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 16912 <at> debbugs.gnu.org (full text, mbox):
Norihiro Tanaka wrote:
> However I don't understand why the optimization isn't completed on
> non-UTF8 locale only. Can you explain it?
Sorry, no; there's a lot about that code I don't yet understand.
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Mon, 03 Mar 2014 07:08:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Norihiro Tanaka <noritnk <at> kcn.ne.jp>
:
bug acknowledged by developer.
(Mon, 03 Mar 2014 07:08:03 GMT)
Full text and
rfc822 format available.
Message #22 received at 16912-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Thanks, I tweaked the ChangeLog entries a bit and pushed that. I also
pushed the attached patch, which fixes some new bugs and some bugs that
were reintroduced by the revival of trivial_case_ignore. I wish we
didn't need that function, as it is a bit of a kludge.
[0001-grep-fix-some-unlikely-bugs-in-trivial_case_ignore.patch (text/plain, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16912
; Package
grep
.
(Tue, 04 Mar 2014 15:51:02 GMT)
Full text and
rfc822 format available.
Message #25 received at 16912 <at> debbugs.gnu.org (full text, mbox):
Il 03/03/2014 07:13, Paul Eggert ha scritto:
> Norihiro Tanaka wrote:
>> However I don't understand why the optimization isn't completed on
>> non-UTF8 locale only. Can you explain it?
>
> Sorry, no; there's a lot about that code I don't yet understand.
IIRC it's because a CSET matches any byte, while the corresponding
MBCSET only matches that byte if it is a single-byte character. So for
example, say "\x83A" is a two-byte character. The CSET "A" will match
it but the corresponding MBCSET will not.
This can happen in the Shift-JIS encoding.
Paolo
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16912
; Package
grep
.
(Tue, 04 Mar 2014 23:13:02 GMT)
Full text and
rfc822 format available.
Message #28 received at 16912 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert wrote:
> IIRC it's because a CSET matches any byte, while the corresponding
> MBCSET only matches that byte if it is a single-byte character.
> So for example, say "\x82\x61" is a two-byte character. The CSET "A"
> will match it but the corresponding MBCSET will not.
>
> This can happen in the Shift-JIS encoding.
First, I also thoutht such a case. But perhaps it's no problem, because
DFA will never come across CSET on second byte in Shift_JIS.
"grep -i A" -> [Aa] -> CSET
"grep -i $"\x82A" -> [$"\x82\x82A"$"\x82\x82"] -> \x82 A CAT \x82 \x82 CAT OR
Laster will be never \x82 [A\x82] -> \x82 CSET CAT.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16912
; Package
grep
.
(Wed, 05 Mar 2014 08:01:02 GMT)
Full text and
rfc822 format available.
Message #31 received at 16912 <at> debbugs.gnu.org (full text, mbox):
Il 05/03/2014 00:12, Norihiro Tanaka ha scritto:
> First, I also thoutht such a case. But perhaps it's no problem, because
> DFA will never come across CSET on second byte in Shift_JIS.
>
> "grep -i A" -> [Aa] -> CSET
> "grep -i $"\x82A" -> [$"\x82\x82A"$"\x82\x82"] -> \x82 A CAT \x82 \x82 CAT OR
>
> Laster will be never \x82 [A\x82] -> \x82 CSET CAT.
What about these two commands:
grep [a]
grep -i A
Would they match \x82\x61 ("B", U+0FF22) with your patch? And without it?
Paolo
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16912
; Package
grep
.
(Wed, 05 Mar 2014 13:42:02 GMT)
Full text and
rfc822 format available.
Message #34 received at 16912 <at> debbugs.gnu.org (full text, mbox):
Paolo Bonzini wrote:
> What about these two commands:
>
> grep [a]
> grep -i A
>
> Would they match \x82\x61 ("B", U+0FF22) with your patch? And without it?
No match for all.
--
Before the patch:
$ locale -a | grep sjis
ja_JP.sjis
$ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep -i 'A'
dfaanalyze:
0:A 1:a 2:OR 3:END 4:CAT
$ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep '[a]'
dfaanalyze:
0:MBCSET 1:END 2:CAT
After the patch:
$ locale -a | grep sjis
ja_JP.sjis
$ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep -i 'A'
dfaanalyze:
0:CSET 1:END 2:CAT
$ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep '[a]'
dfaanalyze:
0:CSET 1:END 2:CAT
--
Norihiro
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16912
; Package
grep
.
(Wed, 05 Mar 2014 14:33:02 GMT)
Full text and
rfc822 format available.
Message #37 received at 16912 <at> debbugs.gnu.org (full text, mbox):
Il 05/03/2014 14:41, Norihiro Tanaka ha scritto:
> Paolo Bonzini wrote:
>
>> What about these two commands:
>>
>> grep [a]
>> grep -i A
>>
>> Would they match \x82\x61 ("B", U+0FF22) with your patch? And without it?
>
> No match for all.
Right, it's handled by SKIP_REMAINS_MB_IF_INITIAL_STATE.
dfa.c never stops surprising me. Great catch!
Paolo
> --
> Before the patch:
>
> $ locale -a | grep sjis
> ja_JP.sjis
> $ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep -i 'A'
> dfaanalyze:
> 0:A 1:a 2:OR 3:END 4:CAT
> $ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep '[a]'
> dfaanalyze:
> 0:MBCSET 1:END 2:CAT
>
> After the patch:
>
> $ locale -a | grep sjis
> ja_JP.sjis
> $ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep -i 'A'
> dfaanalyze:
> 0:CSET 1:END 2:CAT
> $ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep '[a]'
> dfaanalyze:
> 0:CSET 1:END 2:CAT
> --
>
> Norihiro
>
>
Information forwarded
to
bug-grep <at> gnu.org
:
bug#16912
; Package
grep
.
(Wed, 05 Mar 2014 23:48:01 GMT)
Full text and
rfc822 format available.
Message #40 received at 16912 <at> debbugs.gnu.org (full text, mbox):
Paolo Bonzini wrote:
> Right, it's handled by SKIP_REMAINS_MB_IF_INITIAL_STATE.
Yes. It's handled by SKIP_REMAINS_MB_IF_INITIAL_STATE, so no problem.
Norihiro
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 03 Apr 2014 11:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 11 years and 78 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.