GNU bug report logs - #16544
Optimazation for is_mb_middle

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Sat, 25 Jan 2014 02:40:02 UTC

Severity: normal

Tags: patch

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16544 in the body.
You can then email your comments to 16544 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#16544; Package grep. (Sat, 25 Jan 2014 02:40:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sat, 25 Jan 2014 02:40:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: submit <at> debbugs.gnu.org
Subject: Optimazation for is_mb_middle
Date: Sat, 25 Jan 2014 11:39:11 +0900
[Message part 1 (text/plain, inline)]
Package: grep
Tags: patch

When matched characters to a regular expression is found by kwsexec or
dfaexec, we need check whether it is in the middle of a multi-byte character.

`is_mb_middle' of searchutils.c is used for it. However, it's expensive,
even if most of them contain constitute with single-byte characters.
For example, a source code written in a language with multibyte characters,
has a lot of single-byte characters.

Now, I post the patch which optimizes `is_mb_middle'. It checks whether
each single-byte is completion as an character before execution, and
caches them. In addition, for UTF-8 further optimization is performed.

Only when it's impossible to determine the length of a multibyte character
with caches, the length is determined with `mbrlen' in execution.
[is_mb_middle.txt (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16544; Package grep. (Sun, 26 Jan 2014 23:53:02 GMT) Full text and rfc822 format available.

Message #8 received at 16544 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 16544 <at> debbugs.gnu.org
Subject: Re: bug#16544: Optimazation for is_mb_middle
Date: Sun, 26 Jan 2014 15:52:22 -0800
[Message part 1 (text/plain, inline)]
On Fri, Jan 24, 2014 at 6:39 PM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> Package: grep
> Tags: patch
>
> When matched characters to a regular expression is found by kwsexec or
> dfaexec, we need check whether it is in the middle of a multi-byte character.
>
> `is_mb_middle' of searchutils.c is used for it. However, it's expensive,
> even if most of them contain constitute with single-byte characters.
> For example, a source code written in a language with multibyte characters,
> has a lot of single-byte characters.
>
> Now, I post the patch which optimizes `is_mb_middle'. It checks whether
> each single-byte is completion as an character before execution, and
> caches them. In addition, for UTF-8 further optimization is performed.
>
> Only when it's impossible to determine the length of a multibyte character
> with caches, the length is determined with `mbrlen' in execution.

Thank you for the patch.
If you performed profiling that led you to optimize that function,
would you please share that data? In any case, can you give
examples showing how much of a difference this change makes?

Also, when I applied that patch and attempted to compile, gcc printed
an error and two new warnings (see attached).  Would you please
address those and send a new version?
[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16544; Package grep. (Tue, 28 Jan 2014 15:37:01 GMT) Full text and rfc822 format available.

Message #11 received at 16544 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: 16544 <at> debbugs.gnu.org
Subject: Re: bug#16544: Optimazation for is_mb_middle
Date: Wed, 29 Jan 2014 00:35:54 +0900
[Message part 1 (text/plain, inline)]
I'm sorry that I don't test the patch sufficiently.

I fixed several bugs in the patch.  In addition to the patch, I attach
the results of the compile and the performance test.
[is_mb_middle.txt (application/octet-stream, attachment)]
[make.txt (application/octet-stream, attachment)]
[test.txt (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16544; Package grep. (Tue, 28 Jan 2014 18:38:03 GMT) Full text and rfc822 format available.

Message #14 received at 16544 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, Jim Meyering <jim <at> meyering.net>
Cc: 16544 <at> debbugs.gnu.org
Subject: Re: bug#16544: Optimazation for is_mb_middle
Date: Tue, 28 Jan 2014 10:37:42 -0800
Thanks for the further patch.  Some comments:

> +      if (!mbsinit (&mbs))
> +        memset (&mbs, 0, sizeof mbs);

Why bother with mbsinit?  Just do the memset, or better yet declare and 
initialize mbs here.

> +      mbclen_guess[i] = mbrlen ((const char *) &i, 1, &mbs);

This assumes a little-endian machine, which is not portable.  Please use 
something like this instead:

  for (i = CHAR_MIN; i <= CHAR_MAX; i++)
    {
       char c = i;
       unsigned char uc = i;
mbstate_t mbs = { 0 };
       mbclen_guess[uc] = mbrlen (&c, 1, &mbs);
    }

Here I'm using a style that avoids casts, as casts in general can be 
dangerous.

A minor question about naming: in what sense is mbclen_guess a guess?  
It doesn't seem to be guessing anything. Perhaps rename it to mbclen_cache?




Information forwarded to bug-grep <at> gnu.org:
bug#16544; Package grep. (Wed, 29 Jan 2014 13:08:02 GMT) Full text and rfc822 format available.

Message #17 received at 16544 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Jim Meyering <jim <at> meyering.net>, 16544 <at> debbugs.gnu.org
Subject: Re: bug#16544: Optimazation for is_mb_middle
Date: Wed, 29 Jan 2014 22:07:22 +0900
[Message part 1 (text/plain, inline)]
Hi Paul,

Thank you for reviewing tha patch.

> Please use something like this instead

All right.

> A minor question about naming: in what sense is mbclen_guess a guess?  

Because mbclen_guess always returns -2 for characters of two or more bytes,
I consider that what isn't mbclen_cache should be used instead of it. But
I don't particularly attach one's mind to the name. Now, I rename it to
mbclen_cache.
[is_mb_middle.txt (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16544; Package grep. (Sun, 02 Feb 2014 01:19:02 GMT) Full text and rfc822 format available.

Message #20 received at 16544 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 16544 <at> debbugs.gnu.org
Subject: Re: bug#16544: Optimazation for is_mb_middle
Date: Sat, 1 Feb 2014 17:18:07 -0800
[Message part 1 (text/plain, inline)]
Thank you for the updated patch.
I am attaching it (with a detailed commit/ChangeLog I have added,
and a NEWS entry), along with four additional commits.
One fixes an actual bug.  I expect to merge those commits into yours,
discarding their individual commit logs.  I leave them separate here
solely to make the changes easier to review.

By the way, I measured the speed-up before and after your change
on an idle AMD FX-4100 (best of 10), running these commands:

  yes $(printf '%078dm' 0)|head -1000000 > in
  for i in $(seq 10); do env LC_ALL=ja_JP.eucJP time src/grep -v m in; done

Before: 2.46
After: 0.36

Let me know if you see anything that can be improved,

Jim
[k.txt (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#16544; Package grep. (Sun, 02 Feb 2014 13:16:01 GMT) Full text and rfc822 format available.

Message #23 received at 16544 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 16544 <at> debbugs.gnu.org
Subject: Re: bug#16544: Optimazation for is_mb_middle
Date: Sun, 02 Feb 2014 22:15:07 +0900
Hi Jim,

Thank you for the review, test and fix for the patch.
I have nothing that can be improved after your change.

Norihiro





Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Sun, 02 Feb 2014 16:36:02 GMT) Full text and rfc822 format available.

Notification sent to Norihiro Tanaka <noritnk <at> kcn.ne.jp>:
bug acknowledged by developer. (Sun, 02 Feb 2014 16:36:03 GMT) Full text and rfc822 format available.

Message #28 received at 16544-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 16544-done <at> debbugs.gnu.org
Subject: Re: bug#16544: Optimazation for is_mb_middle
Date: Sun, 2 Feb 2014 08:35:15 -0800
Merged and pushed.




Information forwarded to bug-grep <at> gnu.org:
bug#16544; Package grep. (Mon, 10 Feb 2014 04:57:02 GMT) Full text and rfc822 format available.

Message #31 received at 16544-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 16544-done <at> debbugs.gnu.org
Subject: Re: bug#16544: Optimazation for is_mb_middle
Date: Sun, 9 Feb 2014 20:55:51 -0800
I have just noticed that this (currently topmost) commit lists my name
as the author.
That was not my intention.  Norihiro Tanaka should be listed as the author.
I plan to take the unusual step of replacing that public/pushed commit with an
identical one with that important correction.  Then I will post
instructions telling people
how to deal with this unusual situation.

Norihiro, you can help avoid recurrence by submitting future patches created
by running "git format-patch --stdout -1".  That way, your name is recorded
in the patch, and I don't have to take extra steps to add it when
creating a commit.




Information forwarded to bug-grep <at> gnu.org:
bug#16544; Package grep. (Mon, 10 Feb 2014 13:52:02 GMT) Full text and rfc822 format available.

Message #34 received at 16544-done <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Jim Meyering <jim <at> meyering.net>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 16544-done <at> debbugs.gnu.org
Subject: Re: bug#16544: Optimazation for is_mb_middle
Date: Mon, 10 Feb 2014 22:50:30 +0900
Hi Jim,

Sorry for the trouble.  When I submit future patches, I will create them
with "git format-patch --stdout -1".





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 11 Mar 2014 11:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 162 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.