GNU bug report logs - #16912
[PATCH] no longer use CSET for non-UTF8 locale in DFA engine

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Sat, 1 Mar 2014 09:49:02 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16912 in the body.
You can then email your comments to 16912 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#16912; Package grep. (Sat, 01 Mar 2014 09:49:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sat, 01 Mar 2014 09:49:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: submit <at> debbugs.gnu.org
Subject: [PATCH] no longer use CSET for non-UTF8 locale in DFA engine
Date: Sat, 01 Mar 2014 18:48:22 +0900
[Message part 1 (text/plain, inline)]
Package: grep
Tags: patch

I have overlooked the important thing about optimization by
trivial_case_ignore.  After optimization by trivial_case_ignore,
kwset engine can be used yet.  However, if remove trivial_case_ignore,
it's never used longer because kwsmusts does nothing when MB_CUR_MAX > 1
&& match_icase.

The patch reverts removal of trivial_case_ignore and fixes 200x slower
for non-UTF8 locales with another approach.  It always prefers CSET to
replacement to OR and no longer use CSET for non-UTF8 locales in DFA
engine.

It can also optimize by trivial_case_ignore and enables to speed-up >20x
for non-UTF8 locales. (I tested it with euc-jp)

Norihiro
[patch.txt (application/octet-stream, attachment)]
[tests.txt (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16912; Package grep. (Sun, 02 Mar 2014 00:14:01 GMT) Full text and rfc822 format available.

Message #8 received at 16912 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 16912 <at> debbugs.gnu.org
Subject: Re: bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in
 DFA engine
Date: Sat, 01 Mar 2014 16:13:14 -0800
Thanks for looking into this.  Unfortunately the combination of the two 
patches causes "make check" to fail, because it reintroduces a titlecase 
bug.  I can draft a further patch for that, but in the meantime can you 
look at a few other things?

First, why does the first patch add those four using_utf8 calls to 
parse_bracket_exp?  Isn't that optimization valid regardless of whether 
the multibyte encoding is UTF-8?

Second, the comment "UTF-8 allows treating a simple, non-inverted MBCSET 
like a CSET." no longer seems to match the code, since addtok no longer 
invokes using_utf8.

Third, could you please draft a proper commit message?  The format is 
something like this:

grep: minor tuning for mb_case_map_apply

* src/kwsearch.c (mb_case_map_apply): Avoid unnecessary widening of
size_t to intmax_t.  Avoid unnecessary reinitialization of k.


That is, a first line of the form "program: short description".  Then an 
empty line.  Then a ChangeLog entry in standard GNU format.

I'll take a look at the second patch later.




Information forwarded to bug-grep <at> gnu.org:
bug#16912; Package grep. (Sun, 02 Mar 2014 01:24:02 GMT) Full text and rfc822 format available.

Message #11 received at 16912 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 16912 <at> debbugs.gnu.org
Subject: Re: bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in DFA
 engine
Date: Sun, 02 Mar 2014 10:23:28 +0900
Hi Paul

Thank you for checking the patch.

> First, why does the first patch add those four using_utf8 calls to
> parse_bracket_exp?  Isn't that optimization valid regardless of
> whether the multibyte encoding is UTF-8?

The optimization which MBCSET is changed into CSET in addtok is completed
on UTF8 locale only, because even if work_mbc->cset is defined in non-UTF8
locales, it's treated as not CSET but MBCSET.  So if not CSET to replacement
to OR, dfa will keep MBCSET until last and return backref.  I want to
avoid it.

However I don't understand why the optimization isn't completed on
non-UTF8 locale only.  Can you explain it?

Norihiro





Information forwarded to bug-grep <at> gnu.org:
bug#16912; Package grep. (Sun, 02 Mar 2014 09:49:01 GMT) Full text and rfc822 format available.

Message #14 received at 16912 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 16912 <at> debbugs.gnu.org
Subject: Re: bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in DFA
 engine
Date: Sun, 02 Mar 2014 18:47:59 +0900
[Message part 1 (text/plain, inline)]
I have added several modifications to the patch.

First, I fixed the bug for titlecase.

Second, I changed it so that prefered replacement to OR to CSET in order
to reduce a number of states.

Third, I modified comments in source code and put drafts of commit
messages in the patch.

Norihiro
[patch.txt (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16912; Package grep. (Mon, 03 Mar 2014 06:14:01 GMT) Full text and rfc822 format available.

Message #17 received at 16912 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 16912 <at> debbugs.gnu.org
Subject: Re: bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in
 DFA engine
Date: Sun, 02 Mar 2014 22:13:25 -0800
Norihiro Tanaka wrote:
> However I don't understand why the optimization isn't completed on
> non-UTF8 locale only.  Can you explain it?

Sorry, no; there's a lot about that code I don't yet understand.




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Mon, 03 Mar 2014 07:08:02 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Mon, 03 Mar 2014 07:08:03 GMT) Full text and rfc822 format available.

Message #22 received at 16912-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 16912-done <at> debbugs.gnu.org
Subject: Re: bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in
 DFA engine
Date: Sun, 02 Mar 2014 23:07:41 -0800
[Message part 1 (text/plain, inline)]
Thanks, I tweaked the ChangeLog entries a bit and pushed that.  I also 
pushed the attached patch, which fixes some new bugs and some bugs that 
were reintroduced by the revival of trivial_case_ignore.  I wish we 
didn't need that function, as it is a bit of a kludge.


[0001-grep-fix-some-unlikely-bugs-in-trivial_case_ignore.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16912; Package grep. (Tue, 04 Mar 2014 15:51:02 GMT) Full text and rfc822 format available.

Message #25 received at 16912 <at> debbugs.gnu.org (full text, mbox):

From: Paolo Bonzini <bonzini <at> gnu.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 16912 <at> debbugs.gnu.org
Subject: Re: bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in
 DFA engine
Date: Tue, 04 Mar 2014 16:49:59 +0100
Il 03/03/2014 07:13, Paul Eggert ha scritto:
> Norihiro Tanaka wrote:
>> However I don't understand why the optimization isn't completed on
>> non-UTF8 locale only.  Can you explain it?
>
> Sorry, no; there's a lot about that code I don't yet understand.

IIRC it's because a CSET matches any byte, while the corresponding 
MBCSET only matches that byte if it is a single-byte character.  So for 
example, say "\x83A" is a two-byte character.  The CSET "A" will match 
it but the corresponding MBCSET will not.

This can happen in the Shift-JIS encoding.

Paolo





Information forwarded to bug-grep <at> gnu.org:
bug#16912; Package grep. (Tue, 04 Mar 2014 23:13:02 GMT) Full text and rfc822 format available.

Message #28 received at 16912 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paolo Bonzini <bonzini <at> gnu.org>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 16912 <at> debbugs.gnu.org
Subject: Re: bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in DFA
 engine
Date: Wed, 05 Mar 2014 08:12:42 +0900
Paul Eggert wrote:
> IIRC it's because a CSET matches any byte, while the corresponding
> MBCSET only matches that byte if it is a single-byte character.
> So for example, say "\x82\x61" is a two-byte character.  The CSET "A"
> will match it but the corresponding MBCSET will not.
> 
> This can happen in the Shift-JIS encoding.

First, I also thoutht such a case.  But perhaps it's no problem, because
DFA will never come across CSET on second byte in Shift_JIS.

  "grep -i A" -> [Aa] -> CSET
  "grep -i $"\x82A" -> [$"\x82\x82A"$"\x82\x82"] -> \x82 A CAT \x82 \x82 CAT OR

Laster will be never \x82 [A\x82] -> \x82 CSET CAT.





Information forwarded to bug-grep <at> gnu.org:
bug#16912; Package grep. (Wed, 05 Mar 2014 08:01:02 GMT) Full text and rfc822 format available.

Message #31 received at 16912 <at> debbugs.gnu.org (full text, mbox):

From: Paolo Bonzini <bonzini <at> gnu.org>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 16912 <at> debbugs.gnu.org
Subject: Re: bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in
 DFA engine
Date: Wed, 05 Mar 2014 08:59:52 +0100
Il 05/03/2014 00:12, Norihiro Tanaka ha scritto:
> First, I also thoutht such a case.  But perhaps it's no problem, because
> DFA will never come across CSET on second byte in Shift_JIS.
>
>   "grep -i A" -> [Aa] -> CSET
>   "grep -i $"\x82A" -> [$"\x82\x82A"$"\x82\x82"] -> \x82 A CAT \x82 \x82 CAT OR
>
> Laster will be never \x82 [A\x82] -> \x82 CSET CAT.

What about these two commands:

   grep [a]
   grep -i A

Would they match \x82\x61 ("B", U+0FF22) with your patch?  And without it?

Paolo




Information forwarded to bug-grep <at> gnu.org:
bug#16912; Package grep. (Wed, 05 Mar 2014 13:42:02 GMT) Full text and rfc822 format available.

Message #34 received at 16912 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paolo Bonzini <bonzini <at> gnu.org>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 16912 <at> debbugs.gnu.org
Subject: Re: bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in DFA
 engine
Date: Wed, 05 Mar 2014 22:41:31 +0900
Paolo Bonzini wrote:

> What about these two commands:
> 
>     grep [a]
>     grep -i A
> 
> Would they match \x82\x61 ("B", U+0FF22) with your patch?  And without it?

No match for all.

--
Before the patch:

$ locale -a | grep sjis
ja_JP.sjis
$ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep -i 'A'
dfaanalyze:
 0:A 1:a 2:OR 3:END 4:CAT
$ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep '[a]'
dfaanalyze:
 0:MBCSET 1:END 2:CAT

After the patch:

$ locale -a | grep sjis
ja_JP.sjis
$ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep -i 'A'
dfaanalyze:
 0:CSET 1:END 2:CAT
$ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep '[a]'
dfaanalyze:
 0:CSET 1:END 2:CAT
--

Norihiro





Information forwarded to bug-grep <at> gnu.org:
bug#16912; Package grep. (Wed, 05 Mar 2014 14:33:02 GMT) Full text and rfc822 format available.

Message #37 received at 16912 <at> debbugs.gnu.org (full text, mbox):

From: Paolo Bonzini <bonzini <at> gnu.org>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 16912 <at> debbugs.gnu.org
Subject: Re: bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in
 DFA engine
Date: Wed, 05 Mar 2014 15:31:54 +0100
Il 05/03/2014 14:41, Norihiro Tanaka ha scritto:
> Paolo Bonzini wrote:
>
>> What about these two commands:
>>
>>     grep [a]
>>     grep -i A
>>
>> Would they match \x82\x61 ("B", U+0FF22) with your patch?  And without it?
>
> No match for all.

Right, it's handled by SKIP_REMAINS_MB_IF_INITIAL_STATE.
dfa.c never stops surprising me.  Great catch!

Paolo

> --
> Before the patch:
>
> $ locale -a | grep sjis
> ja_JP.sjis
> $ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep -i 'A'
> dfaanalyze:
>  0:A 1:a 2:OR 3:END 4:CAT
> $ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep '[a]'
> dfaanalyze:
>  0:MBCSET 1:END 2:CAT
>
> After the patch:
>
> $ locale -a | grep sjis
> ja_JP.sjis
> $ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep -i 'A'
> dfaanalyze:
>  0:CSET 1:END 2:CAT
> $ printf "\x82\x61\n" | env LC_ALL=ja_JP.sjis src/grep '[a]'
> dfaanalyze:
>  0:CSET 1:END 2:CAT
> --
>
> Norihiro
>
>





Information forwarded to bug-grep <at> gnu.org:
bug#16912; Package grep. (Wed, 05 Mar 2014 23:48:01 GMT) Full text and rfc822 format available.

Message #40 received at 16912 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paolo Bonzini <bonzini <at> gnu.org>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 16912 <at> debbugs.gnu.org
Subject: Re: bug#16912: [PATCH] no longer use CSET for non-UTF8 locale in DFA
 engine
Date: Thu, 06 Mar 2014 08:47:41 +0900
Paolo Bonzini wrote:
> Right, it's handled by SKIP_REMAINS_MB_IF_INITIAL_STATE.

Yes.  It's handled by SKIP_REMAINS_MB_IF_INITIAL_STATE, so no problem.

Norihiro





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 03 Apr 2014 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 78 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.