GNU bug report logs - #18777
[PATCH] dfa: improvement for checking of multibyte character boundary

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Mon, 20 Oct 2014 15:05:01 UTC

Severity: normal

Tags: patch

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 18777 in the body.
You can then email your comments to 18777 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Mon, 20 Oct 2014 15:05:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 20 Oct 2014 15:05:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: bug-grep <at> gnu.org
Subject: [PATCH] dfa: improvement for checking of multibyte character boundary
Date: Tue, 21 Oct 2014 00:04:02 +0900
[Message part 1 (text/plain, inline)]
This patch improves performance for input string which doesn't match
even the first part of a pattern.  Although there is no less effective
for grep as it uses a superset of DFA, gawk speeds up about 40%.

$ time -p env LC_ALL=ja_JP.eucJP ./gawk '/k/ { print }' ../k

(before)
  real 2.85  user 2.79  sys 0.05

(after)
  real 1.70  user 1.64  sys 0.06

I think that this improvement should have been performed in bug#17576.
[0001-dfa-improvement-for-checking-of-multibyte-character-.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Mon, 20 Oct 2014 15:22:01 GMT) Full text and rfc822 format available.

Message #8 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Tue, 21 Oct 2014 00:21:02 +0900
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> $ time -p env LC_ALL=ja_JP.eucJP ./gawk '/k/ { print }' ../k

The file `k' is below.

  $ yes `printf '%040d' 0` | head -10000000 >../k





Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Mon, 20 Oct 2014 16:08:01 GMT) Full text and rfc822 format available.

Message #11 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Mon, 20 Oct 2014 10:07:20 -0600
[Message part 1 (text/plain, inline)]
On 10/20/2014 09:04 AM, Norihiro Tanaka wrote:
> This patch improves performance for input string which doesn't match
> even the first part of a pattern.  Although there is no less effective
> for grep as it uses a superset of DFA, gawk speeds up about 40%.
> 

> 
> When found newline, we can skip check of a multibyte character boundary
> before the character, as we assume newline as a single byte character.
> by that.

POSIX requires that NUL, slash, dot, newline, and carriage return all be
single bytes that cannot occur inside a multibyte character (because
they have special meaning to file name resolution and/or terminal
interaction); it added this requirement fairly recently, but only after
confirming that common existing locales satisfy this constraint.  (The
same is not true for most any other character; even though POSIX
requires that a-z, A-Z, and 0-9 be single bytes, it does not forbid
those characters from also being bytes embedded within multibyte
characters).  Is it worth extending your optimization to all five of the
POSIX-guaranteed single byte characters?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Mon, 20 Oct 2014 23:10:03 GMT) Full text and rfc822 format available.

Message #14 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Eric Blake <eblake <at> redhat.com>
Cc: 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Tue, 21 Oct 2014 08:09:24 +0900
Eric Blake <eblake <at> redhat.com> wrote:
> Is it worth extending your optimization to all five of the
> POSIX-guaranteed single byte characters?

Thanks, but I don't want to perform it immediately.  DFA has already
regarded newline as a single byte character, but hasn't others yet.  So,
we may need to make many changes to handle invalid locales and sequences
not to conform to the rule.  If we omitted that, It might be that limits
are added to the locale to be able to apply DFA to.  Threfore, it should
be performed carefully.





Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Tue, 21 Oct 2014 06:24:02 GMT) Full text and rfc822 format available.

Message #17 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: noritnk <at> kcn.ne.jp, eblake <at> redhat.com
Cc: 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Tue, 21 Oct 2014 00:23:07 -0600
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> Eric Blake <eblake <at> redhat.com> wrote:
> > Is it worth extending your optimization to all five of the
> > POSIX-guaranteed single byte characters?
>
> Thanks, but I don't want to perform it immediately.  DFA has already
> regarded newline as a single byte character, but hasn't others yet.  So,
> we may need to make many changes to handle invalid locales and sequences
> not to conform to the rule.  If we omitted that, It might be that limits
> are added to the locale to be able to apply DFA to.  Threfore, it should
> be performed carefully.

I would think adding a check for '\r' would be safe and would help
too; given that on Windows systems '\r' generally occurs just as
frequently as '\n', it should give a nice speedup for gawk on those
systems.

The other characters that Erik cited seem less like a big issue to me.

Thanks,

Arnold




Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Tue, 21 Oct 2014 13:26:02 GMT) Full text and rfc822 format available.

Message #20 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: arnold <at> skeeve.com
Cc: eblake <at> redhat.com, 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Tue, 21 Oct 2014 22:25:21 +0900
arnold <at> skeeve.com wrote:
> I would think adding a check for '\r' would be safe and would help
> too; given that on Windows systems '\r' generally occurs just as
> frequently as '\n', it should give a nice speedup for gawk on those
> systems.

As I recognize that DFA and regex aren't support multiple eolbytes as
CR-LF, I can't understand where we can use the change.  Grep converts
Windows text to Unix text by removal of CR in advance.

BTW, although I say `newline', correctly notice that it's `eolbyte'
which mayn't be either LF or NUL.





Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Tue, 21 Oct 2014 14:54:02 GMT) Full text and rfc822 format available.

Message #23 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: noritnk <at> kcn.ne.jp, arnold <at> skeeve.com
Cc: eblake <at> redhat.com, 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Tue, 21 Oct 2014 08:43:47 -0600
Hi.

Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> arnold <at> skeeve.com wrote:
> > I would think adding a check for '\r' would be safe and would help
> > too; given that on Windows systems '\r' generally occurs just as
> > frequently as '\n', it should give a nice speedup for gawk on those
> > systems.
>
> As I recognize that DFA and regex aren't support multiple eolbytes as
> CR-LF, I can't understand where we can use the change.  Grep converts
> Windows text to Unix text by removal of CR in advance.

Gawk does not remove CR in advance, unless someone specifically
set RS = "\r\n", in which case the full regex matcher is used
to first find \r\n in the raw input buffer.

So for gawk, adding a check for (c == eolbyte || c == '\r')
should produce more speedup on Windows.

(Hmm, on Windows the default is probably text mode which causes
the library/OS to hide the \r anway. Harumph.  But if binary mode
wsa requested then it could still make a difference.)

> BTW, although I say `newline', correctly notice that it's `eolbyte'
> which mayn't be either LF or NUL.

Understood and agreed.

Adding a check for \r isn't a big deal in any case, but of the 5
characters Erik mentioned originally, that is the only one where I
see a potential for a check to really make a difference.

Thanks!

Arnold




Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Wed, 22 Oct 2014 15:29:02 GMT) Full text and rfc822 format available.

Message #26 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: arnold <at> skeeve.com
Cc: eblake <at> redhat.com, 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Thu, 23 Oct 2014 00:28:35 +0900
arnold <at> skeeve.com wrote:
> Gawk does not remove CR in advance, unless someone specifically
> set RS = "\r\n", in which case the full regex matcher is used
> to first find \r\n in the raw input buffer.

Thanks, I also confirmed it on source code of Gawk.

> So for gawk, adding a check for (c == eolbyte || c == '\r')
> should produce more speedup on Windows.
> 
> (Hmm, on Windows the default is probably text mode which causes
> the library/OS to hide the \r anway. Harumph.  But if binary mode
> wsa requested then it could still make a difference.)

I think It's better to build KWset rather than rely on checking for '\r'
in non-UTF8 multibyte mode of DFA.

Further more, even if we add checking for '\r' to DFA, I think that we
can't use to speed up on Windows, so that DFA can't correctly locate a matched
position except a pattern which is fixed string.





Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Mon, 15 Dec 2014 15:00:04 GMT) Full text and rfc822 format available.

Message #29 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Eric Blake <eblake <at> redhat.com>
Cc: 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Mon, 15 Dec 2014 23:59:32 +0900
[Message part 1 (text/plain, inline)]
On Mon, 20 Oct 2014 10:07:20 -0600
Eric Blake <eblake <at> redhat.com> wrote:

> POSIX requires that NUL, slash, dot, newline, and carriage return all be
> single bytes that cannot occur inside a multibyte character (because
> they have special meaning to file name resolution and/or terminal
> interaction); it added this requirement fairly recently, but only after
> confirming that common existing locales satisfy this constraint.  (The
> same is not true for most any other character; even though POSIX
> requires that a-z, A-Z, and 0-9 be single bytes, it does not forbid
> those characters from also being bytes embedded within multibyte
> characters).  Is it worth extending your optimization to all five of the
> POSIX-guaranteed single byte characters?

I rewrote the patch so that NUL, slash, dot and carriage return as well
as newline might be also regarded as a special character.
[0001-dfa-improvement-for-checking-of-multibyte-character-.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Mon, 15 Dec 2014 17:45:02 GMT) Full text and rfc822 format available.

Message #32 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, Eric Blake <eblake <at> redhat.com>
Cc: 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Mon, 15 Dec 2014 09:43:54 -0800
On 12/15/2014 06:59 AM, Norihiro Tanaka wrote:
> +/* True if each byte can not occur inside a multibyte character  */
> +static bool always_single_byte[NOTCHAR];
> +
> +static void
> +dfaalwayssb (void)
> +{
> +  size_t i;
> +  unsigned char const uc[] = { '\0', '\n', '\r', '.', '/' };
> +  for (i = 0; i < sizeof uc / sizeof uc[0]; ++i)
> +    always_single_byte[uc[i]] = true;
> +}

Can't we improve this when using_utf8 () is true?  In that case, every 
ASCII character is always single byte.  Also, the bytes 0xc0, 0xc1, and 
0xf5 through 0xff can be added to the table: they are not single-byte 
characters but they are always encoding errors so they will be a 
character boundary as far as skip_remains_mb is concerned.  This 
suggests that the table 'always_single_byte' should be renamed to 
something like 'always_character_boundary'.

>     wint_t wc = WEOF;
> +  if (always_single_byte[*p])
> +    return p;

This won't assign anything to *WCP, contrary to the documented API for 
for skip_remains_mb.  This is OK (as callers don't care) but the API 
documentation should be changed to reflect the actual behavior.




Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Tue, 16 Dec 2014 12:43:02 GMT) Full text and rfc822 format available.

Message #35 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eric Blake <eblake <at> redhat.com>, 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Tue, 16 Dec 2014 21:42:32 +0900
[Message part 1 (text/plain, inline)]
On Mon, 15 Dec 2014 09:43:54 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Can't we improve this when using_utf8 () is true?  In that case, every
> ASCII character is always single byte.  Also, the bytes 0xc0, 0xc1,
> and 0xf5 through 0xff can be added to the table: they are not
> single-byte characters but they are always encoding errors so they will
> be a character boundary as far as skip_remains_mb is concerned.  This
> suggests that the table 'always_single_byte' should be renamed to
> something like 'always_character_boundary'.
> 
> >     wint_t wc = WEOF;
> > +  if (always_single_byte[*p])
> > +    return p;

Thanks for the review and suggestion.  If using_utf8 () is true, we can
set always_character_boundary to true except 0x80-0xbf.

> This won't assign anything to *WCP, contrary to the documented API for
> for skip_remains_mb.  This is OK (as callers don't care) but the API
> documentation should be changed to reflect the actual behavior.

Oh!  if WCP is needed, we must be go through step by step, as a wide
character before P is set to *WCP.  I fixed it and updated the API
documentation.
[0001-dfa-improvement-for-checking-of-multibyte-character-.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Tue, 16 Dec 2014 17:13:02 GMT) Full text and rfc822 format available.

Message #38 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Eric Blake <eblake <at> redhat.com>, 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Tue, 16 Dec 2014 09:12:21 -0800
On 12/16/2014 04:42 AM, Norihiro Tanaka wrote:
> Thanks for the review and suggestion.  If using_utf8 () is true, we can
> set always_character_boundary to true except 0x80-0xbf.

Even better, thanks.


>> >This won't assign anything to *WCP, contrary to the documented API for
>> >for skip_remains_mb.  This is OK (as callers don't care) but the API
>> >documentation should be changed to reflect the actual behavior.
> Oh!  if WCP is needed, we must be go through step by step, as a wide
> character before P is set to *WCP.  I fixed it and updated the API
> documentation.

This part of the patch does too much work, as the caller inspects *WCP 
only when skip_remains_mb returns a value not equal to p.  So there's no 
need for the "wcp == NULL &&" test in the patch. Instead, the documented 
API can change, saying that *WCP is assigned to only if WCP is non-NULL 
and the result is greater than p.




Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Tue, 16 Dec 2014 23:23:02 GMT) Full text and rfc822 format available.

Message #41 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eric Blake <eblake <at> redhat.com>, 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Wed, 17 Dec 2014 08:22:20 +0900
On Tue, 16 Dec 2014 09:12:21 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> 
> This part of the patch does too much work, as the caller inspects *WCP
> only when skip_remains_mb returns a value not equal to p.  So there's
> no need for the "wcp == NULL &&" test in the patch. Instead, the
> documented API can change, saying that *WCP is assigned to only if WCP
> is non-NULL and the result is greater than p.

Thanks, you are right.  However, first it is no longer portable after
remove it.  Second if it is compiled with GCC 4.3 or later, the function
is inlined by and "WCP == NULL &&" will be pruned.





Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Wed, 17 Dec 2014 00:08:02 GMT) Full text and rfc822 format available.

Message #44 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Eric Blake <eblake <at> redhat.com>, 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Tue, 16 Dec 2014 16:06:54 -0800
Norihiro Tanaka wrote:
> However, first it is no longer portable after
> remove it.

"portable"?  This issue is independent of platform, surely.  By "portable" did 
you mean "robust in the presence of future changes?

> Second if it is compiled with GCC 4.3 or later, the function
> is inlined by and "WCP == NULL &&" will be pruned.

True, but I wasn't worried so much about that.  I was worried about the case 
where WCP != NULL: there, the inlined function will be slower because it won't 
use the faster approach of checking always_character_boundary[*p]: it'll always 
use the much-slower loop.




Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Wed, 17 Dec 2014 17:22:01 GMT) Full text and rfc822 format available.

Message #47 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eric Blake <eblake <at> redhat.com>, 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Thu, 18 Dec 2014 02:21:30 +0900
On Tue, 16 Dec 2014 16:06:54 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> did you mean "robust in the presence of future changes? 

Yes.  However, I might have made too big a deal of the effect about
"Portable".

> True, but I wasn't worried so much about that. I was worried about the
> case where WCP != NULL: there, the inlined function will be slower
> because it won't use the faster approach of checking
> always_character_boundary[*p]: it'll always use the much-slower loop. 

If WCP != NULL, all of following code will be pruned, although I think
that it is ignorable for the performance.

  if (wcp == NULL && always_character_boundary[*p])
    return p;





Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Wed, 17 Dec 2014 17:47:02 GMT) Full text and rfc822 format available.

Message #50 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Eric Blake <eblake <at> redhat.com>, 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Wed, 17 Dec 2014 09:46:09 -0800
On 12/17/2014 09:21 AM, Norihiro Tanaka wrote:
> If WCP != NULL, all of following code will be pruned, although I think
> that it is ignorable for the performance.
>
>    if (wcp == NULL && always_character_boundary[*p])
>      return p;

Yes, and that's the point: we don't want this if-statement to be pruned 
if WCP != NULL.  We want the code to return P right away in the typical 
case where P is at a character boundary.  If MBP is way less than P, 
this will save the work of the following loop.




Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Wed, 17 Dec 2014 23:51:01 GMT) Full text and rfc822 format available.

Message #53 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eric Blake <eblake <at> redhat.com>, 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Thu, 18 Dec 2014 08:50:19 +0900
On Wed, 17 Dec 2014 09:46:09 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Yes, and that's the point: we don't want this if-statement to be pruned
> if WCP != NULL.  We want the code to return P right away in the typical
> case where P is at a character boundary.  If MBP is way less than P,
> this will save the work of the following loop.

We must set a wide character for not next but previous character to WCP
in a case to return P.

For example, I assume following sequence in Shift_JIS locale.  A pair of
0x95 0x5c is a multibyte character in Shift_JIS locale.  I assume to
input MBP = position (a) and P = position (d) into skip_remains_mb().

      0x41 0x95 0x5c 0x0a
    (a)  (b)  (c)  (d)

If WCP == NULL, we can return P right away.  On the other hands, if
WCP != NULL, we must set a wide character for 0x95 0x5c to WCP before
return P.

Do you have any ideas to utilize always_character_boundary for the case?





Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Thu, 18 Dec 2014 09:41:03 GMT) Full text and rfc822 format available.

Message #56 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Eric Blake <eblake <at> redhat.com>, 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Thu, 18 Dec 2014 01:40:18 -0800
Norihiro Tanaka wrote:

> if WCP != NULL, we must set a wide character for 0x95 0x5c to WCP before return P.

Why?  The (only) caller with WCP != NULL doesn't use *WCP when skip_remains_mb 
(D, P, ..., WCP) returns P.  So it's OK to not set *WCP in that case.




Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Thu, 18 Dec 2014 15:56:02 GMT) Full text and rfc822 format available.

Message #59 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eric Blake <eblake <at> redhat.com>, 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Fri, 19 Dec 2014 00:54:58 +0900
[Message part 1 (text/plain, inline)]
On Thu, 18 Dec 2014 01:40:18 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Why?  The (only) caller with WCP != NULL doesn't use *WCP when
> skip_remains_mb (D, P, ..., WCP) returns P.  So it's OK to not set *WCP
> in that case.

Thanks, I understood that you said.  You are right.  I changed the patch
so that always_character_boundary is not pruned even if WCP != NULL, and
fixed the API document.
[0001-dfa-improvement-for-checking-of-multibyte-character-.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Sat, 17 Jan 2015 04:28:02 GMT) Full text and rfc822 format available.

Message #62 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <ADS18022 <at> nifty.com>
To: 18777 <at> debbugs.gnu.org
Cc: Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
 character boundary
Date: Sat, 17 Jan 2015 10:54:21 +0900
[Message part 1 (text/plain, inline)]
On Fri, 19 Dec 2014 00:54:58 +0900
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:

> On Thu, 18 Dec 2014 01:40:18 -0800
> Thanks, I understood that you said.  You are right.  I changed the patch
> so that always_character_boundary is not pruned even if WCP != NULL, and
> fixed the API document.

I fixed a mismatch with the comment.  It does not changes the behavior.
We expect that skip_remains_mb() is inlined, and "*WCP = WC" is merged
into "IF (P < MBP)" in caller.

--
+   exceeds P.  If WCP is non-NULL and the result is greater than p, set
+   *WCP to the final wide character processed, or if no wide character
+   is processed, set it to WEOF.  Both P and MBP must be no larger than
+   END.
    ........
-  if (wcp != NULL)
+  if (wcp != NULL && p < mbp)

[0001-dfa-improvement-for-checking-of-multibyte-character-.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Thu, 21 Apr 2016 06:22:02 GMT) Full text and rfc822 format available.

Message #65 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <ADS18022 <at> nifty.com>
Cc: 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
Date: Wed, 20 Apr 2016 23:21:28 -0700
[Message part 1 (text/plain, inline)]
I'm attaching a revised patch, relative to the latest grep, to implement the 
idea of the Bug#18777 patch. This revision calls the new array "never_trail" 
instead of "always_character_boundary" to nail down the concept a bit more 
precisely. It also removes what appears to be an unnecessary p < mbp test, and 
adjusts to more-recent changes in the code.

I'm not installing this into the master branch on savannah, as we'd like to 
release a new 'grep' soon and this patch should probably wait until after the 
release.
[0001-dfa-speed-up-checking-for-character-boundary.patch (text/x-diff, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18777; Package grep. (Thu, 21 Apr 2016 06:49:02 GMT) Full text and rfc822 format available.

Message #68 received at 18777 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Norihiro Tanaka <ADS18022 <at> nifty.com>, 18777 <at> debbugs.gnu.org
Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte
Date: Wed, 20 Apr 2016 23:47:37 -0700
On Wed, Apr 20, 2016 at 11:21 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> I'm attaching a revised patch, relative to the latest grep, to implement the
> idea of the Bug#18777 patch. This revision calls the new array "never_trail"
> instead of "always_character_boundary" to nail down the concept a bit more
> precisely. It also removes what appears to be an unnecessary p < mbp test,
> and adjusts to more-recent changes in the code.
>
> I'm not installing this into the master branch on savannah, as we'd like to
> release a new 'grep' soon and this patch should probably wait until after
> the release.

Thanks for deferring that.
I hope to have time to release grep-2.25 tomorrow evening.




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Mon, 02 May 2016 06:00:03 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Mon, 02 May 2016 06:00:03 GMT) Full text and rfc822 format available.

Message #73 received at 18777-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 18777-done <at> debbugs.gnu.org
Subject: Re: [PATCH] dfa: improvement for checking of multibyte character
 boundary
Date: Sun, 1 May 2016 22:59:14 -0700
I have installed this and am closing the bug report.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 30 May 2016 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 73 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.