GNU bug report logs - #20768
RFC: Multithreaded grep

Previous Next

Package: grep;

Reported by: Zev Weiss <zev <at> bewilderbeest.net>

Date: Mon, 8 Jun 2015 05:33:01 UTC

Severity: normal

Tags: patch

To reply to this bug, email your comments to 20768 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#20768; Package grep. (Mon, 08 Jun 2015 05:33:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Zev Weiss <zev <at> bewilderbeest.net>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 08 Jun 2015 05:33:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Zev Weiss <zev <at> bewilderbeest.net>
To: bug-grep <at> gnu.org
Subject: RFC: Multithreaded grep
Date: Sun, 7 Jun 2015 23:25:43 -0500
Hello,

After what looks like 16+ years as an entry in the TODO file (back to
the first commit in the git history), I decided to see if I couldn't
get grep going multithreaded.  I now have a working version, so I
figured I'd try to get some feedback on it in its current form and
hopefully ultimately get it included in GNU grep.

At a high level: I've added a new flag, -M/--parallel[=N], that
enables multithreaded operation.  The N worker threads (defaulting to
the number of available CPUs) operate in parallel at file granularity,
so it's only of use when operating on multiple files (perhaps via
-R/-r/--directories=recurse).  The main thread opens each file and
adds it to a global queue of files to be searched ('workqueue' in
src/grep.c), and the worker threads then pull items off the queue to
search through.

For some rough performance numbers, on a 3.5GHz 4-core/8-thread
Haswell running Linux 3.16 with a warm pagecache:

 $ time ./src/grep -r FOOB $DIR
 ...

 real    0m13.740s
 user    0m12.424s
 sys     0m2.376s


 $ time ./src/grep -M -r FOOB $DIR
 ...

 real    0m3.348s
 user    0m19.428s
 sys     0m3.480s

$DIR here is a 15GB tree (24299 directories, 305282 files) of assorted
code -- git repos, tarballs, etc.  [Side note: this workload (perhaps
grep in general, not sure) suffered a major performance regression
between v2.20 and v2.21, which I bisected to commit cd36abd4 -- commit
dfff75a4 then recovered some of the lost performance, but it's still
substantially slower than it had been.]

If preferred I can send all the patches directly to the mailing list
(or as one combined diff, whatever's convenient), but for now it's
currently viewable as a series of mostly fairly fine-grained git
commits in the 'multithread' branch at:

 https://github.com/zevweiss/grep

(A few of the small cleanup commits might also be of use independently
of the multithreading effort.)

Some notes, caveats and such:

- It doesn't presently build -- all my testing & development has
  involved running the final link command manually, because (despite
  hours of bashing my head against it) I've been unable to convince
  the bootstrap/autotools/gnulib/etc agglomeration to add -lpthread
  to the command line.  Any advice here would be appreciated.

- There's a slight quirk in command-line interpretation: since the
  '-M' short flag takes an optional numeric argument, 'grep -M4' will
  be interpreted as '--parallel=4', not '--parallel --context=4'.
  This seems safe from a backwards-compatibility standpoint (since no
  existing scripts should be using '-M'), but could potentially
  produce surprising results if existing scripts are modified in
  particular ways (e.g. changing 'grep -R4' to 'grep -RM4').  I'm
  open to suggestions on more graceful ways to handle this.

- I'm not really at all familiar with the internal workings of grep's
  actual text-search guts, so some of the identifiers I've chosen in
  the parts that had to touch those bits (or elsewhere, for that
  matter) may not be the best.

- Some of the un-sharing (moving global state into thread-private
  structs) is probably a bit heavy-handed; i.e. some things are now
  thread-private that could have remained global, but (at least for
  now) it was easier to just do it that way.

- Along similar lines, it would probably be nicer to have a way to
  just copy the result of `compile()' rather than re-compiling for
  every thread.

- Testing: I've extended a few of the tests in the testsuite to also
  run multithreaded, but the coverage isn't exactly extensive.

- I haven't (yet) completed the FSF's copyright-assignment process,
  though I have sent the initial email requesting the paperwork.

- Commit messages are in a different style than used in the grep git
  repo; I'll fix them up depending on how exactly they should be
  bundled together.

- Though I have attempted to follow the GNU coding style, my own
  usual coding style differs from it quite a bit, and there may as a
  result be some stylistic misfits here and there.  I'm happy to fix
  any of these anyone points out.


Any/all feedback welcome.


Thanks,
Zev Weiss





Information forwarded to bug-grep <at> gnu.org:
bug#20768; Package grep. (Tue, 09 Jun 2015 07:07:02 GMT) Full text and rfc822 format available.

Message #8 received at 20768 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zev Weiss <zev <at> bewilderbeest.net>, 20768 <at> debbugs.gnu.org
Subject: Re: bug#20768: RFC: Multithreaded grep
Date: Tue, 09 Jun 2015 00:05:57 -0700
Thanks very much for looking into this.  I'll take a look at it in more detail 
once the paperwork goes through.  I use grep -r a lot, and would appreciate the 
speedup.




Information forwarded to bug-grep <at> gnu.org:
bug#20768; Package grep. (Tue, 09 Jun 2015 10:01:03 GMT) Full text and rfc822 format available.

Message #11 received at 20768 <at> debbugs.gnu.org (full text, mbox):

From: Aaron Crane <grep <at> aaroncrane.co.uk>
To: Zev Weiss <zev <at> bewilderbeest.net>
Cc: 20768 <at> debbugs.gnu.org
Subject: Re: bug#20768: RFC: Multithreaded grep
Date: Tue, 9 Jun 2015 11:00:34 +0100
Zev Weiss <zev <at> bewilderbeest.net> wrote:
> At a high level: I've added a new flag, -M/--parallel[=N], that
> enables multithreaded operation.

Thanks, this looks really interesting.

I'd like to suggest changing the code to use -j/--jobs as the name for
the relevant option; that would match both GNU Make and GNU parallel.
(GNU parallel also allows -P/--max-procs as an alias, but -P already
has a meaning in GNU grep.)

-- 
Aaron Crane ** http://aaroncrane.co.uk/




Information forwarded to bug-grep <at> gnu.org:
bug#20768; Package grep. (Tue, 09 Jun 2015 10:27:02 GMT) Full text and rfc822 format available.

Message #14 received at 20768 <at> debbugs.gnu.org (full text, mbox):

From: Zev Weiss <zev <at> bewilderbeest.net>
To: Aaron Crane <grep <at> aaroncrane.co.uk>
Cc: 20768 <at> debbugs.gnu.org
Subject: Re: bug#20768: RFC: Multithreaded grep
Date: Tue, 9 Jun 2015 05:26:37 -0500
On Tue, Jun 09, 2015 at 11:00:34AM +0100, Aaron Crane wrote:
>Zev Weiss <zev <at> bewilderbeest.net> wrote:
>> At a high level: I've added a new flag, -M/--parallel[=N], that
>> enables multithreaded operation.
>
>Thanks, this looks really interesting.
>
>I'd like to suggest changing the code to use -j/--jobs as the name for
>the relevant option; that would match both GNU Make and GNU parallel.
>(GNU parallel also allows -P/--max-procs as an alias, but -P already
>has a meaning in GNU grep.)
>
>-- 
>Aaron Crane ** http://aaroncrane.co.uk/

Hmm -- I picked --parallel largely for consistency with the 
corresponding flag for coreutils' sort, which strikes me as a closer 
relative to grep than either make or parallel.  Also, the word "jobs" 
has (at least to me) a definite suggestion of multiple processes rather 
than threads.  sort doesn't have a matching short option though, so I 
went with -M to suggest "mulithreaded" (since, as you point out, -P is 
already in use).  Though I notice now that lower-case -p is still 
available; perhaps that might be better than -M.


Zev





Information forwarded to bug-grep <at> gnu.org:
bug#20768; Package grep. (Tue, 09 Jun 2015 11:05:02 GMT) Full text and rfc822 format available.

Message #17 received at 20768 <at> debbugs.gnu.org (full text, mbox):

From: Aaron Crane <grep <at> aaroncrane.co.uk>
To: Zev Weiss <zev <at> bewilderbeest.net>
Cc: 20768 <at> debbugs.gnu.org
Subject: Re: bug#20768: RFC: Multithreaded grep
Date: Tue, 9 Jun 2015 12:04:11 +0100
Zev Weiss <zev <at> bewilderbeest.net> wrote:
> Hmm -- I picked --parallel largely for consistency with the corresponding
> flag for coreutils' sort, which strikes me as a closer relative to grep than
> either make or parallel.

That's a good point; I wasn't aware of sort's --parallel option.
Though I also note that "sort --parallel=4" limits the number of
threads to 4, rather than increasing the number of threads from 1 to
4, so the comparison isn't exact.

> sort doesn't
> have a matching short option though, so I went with -M to suggest
> "mulithreaded" (since, as you point out, -P is already in use).  Though I
> notice now that lower-case -p is still available; perhaps that might be
> better than -M.

I'm a little unhappy about the idea of proliferating the world's set
of short options in this space, to be honest. If grep didn't already
have -P, I'd be happy enough with -P and either --parallel or
--max-procs, but I'm not terribly fond of the idea of introducing
either -M or -p.

-- 
Aaron Crane ** http://aaroncrane.co.uk/




Information forwarded to bug-grep <at> gnu.org:
bug#20768; Package grep. (Tue, 09 Jun 2015 19:42:02 GMT) Full text and rfc822 format available.

Message #20 received at 20768 <at> debbugs.gnu.org (full text, mbox):

From: Zev Weiss <zev <at> bewilderbeest.net>
To: Aaron Crane <grep <at> aaroncrane.co.uk>
Cc: 20768 <at> debbugs.gnu.org
Subject: Re: bug#20768: RFC: Multithreaded grep
Date: Tue, 9 Jun 2015 14:41:32 -0500
On Tue, Jun 09, 2015 at 12:04:11PM +0100, Aaron Crane wrote:
>Zev Weiss <zev <at> bewilderbeest.net> wrote:
>> Hmm -- I picked --parallel largely for consistency with the corresponding
>> flag for coreutils' sort, which strikes me as a closer relative to grep than
>> either make or parallel.
>
>That's a good point; I wasn't aware of sort's --parallel option.
>Though I also note that "sort --parallel=4" limits the number of
>threads to 4, rather than increasing the number of threads from 1 to
>4, so the comparison isn't exact.
>
>> sort doesn't
>> have a matching short option though, so I went with -M to suggest
>> "mulithreaded" (since, as you point out, -P is already in use).  Though I
>> notice now that lower-case -p is still available; perhaps that might be
>> better than -M.
>
>I'm a little unhappy about the idea of proliferating the world's set
>of short options in this space, to be honest. If grep didn't already
>have -P, I'd be happy enough with -P and either --parallel or
>--max-procs, but I'm not terribly fond of the idea of introducing
>either -M or -p.
>
>-- 
>Aaron Crane ** http://aaroncrane.co.uk/

True, I suppose that's a reasonable concern (especially given how many 
there are now).  My thought was that at least for me (and it sounds like 
perhaps Paul as well) this would be fairly likely to be a commonly used 
option, so I'd like a nice concise way of enabling it.  With sort 
there's no real downside to just enabling multithreading by default, so 
a longopt-only flag is fine.  With grep however (at least with my 
current implementation) there are tradeoffs with output ordering that 
may be undesirable (and which I don't see a good way around without 
introducing a bunch of potentially-complicated and performance-reducing 
per-file output buffering), so I kept it off by default.

There's also the question of the argument parsing mentioned in my 
original email -- as it stands now, '-M' would be the only short option 
with an optional argument, which has potential to be confusing.  
Thinking about it a bit more, I realize that what I really want out of 
the short flag is just a shorter way to say --parallel=NUMCPUS (and not 
have to remember how many CPUs the machine I'm on has), so perhaps 
another possibility on that front would be to leave the long option 
as-is but have the short flag (assuming there is one) not take an 
argument (though I suppose that could perhaps be seen as confusing in 
its own way too).


Zev





Information forwarded to bug-grep <at> gnu.org:
bug#20768; Package grep. (Thu, 11 Jun 2015 16:45:02 GMT) Full text and rfc822 format available.

Message #23 received at 20768 <at> debbugs.gnu.org (full text, mbox):

From: Zev Weiss <zev <at> bewilderbeest.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 20768 <at> debbugs.gnu.org
Subject: Re: bug#20768: RFC: Multithreaded grep
Date: Thu, 11 Jun 2015 11:44:28 -0500
On Tue, Jun 09, 2015 at 12:05:57AM -0700, Paul Eggert wrote:
>Thanks very much for looking into this.  I'll take a look at it in 
>more detail once the paperwork goes through.  I use grep -r a lot, and 
>would appreciate the speedup.

OK, I've received confirmation from Donald Robertson at the FSF that the 
assignment process is complete.  I've rebased that branch on github with 
a few minor style & bug fixes since my initial email, but it's still 
basically as described.

Zev





Information forwarded to bug-grep <at> gnu.org:
bug#20768; Package grep. (Wed, 01 Jul 2015 17:48:02 GMT) Full text and rfc822 format available.

Message #26 received at 20768 <at> debbugs.gnu.org (full text, mbox):

From: Zev Weiss <zev <at> bewilderbeest.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 20768 <at> debbugs.gnu.org
Subject: Re: bug#20768: RFC: Multithreaded grep
Date: Wed, 1 Jul 2015 12:47:29 -0500
On Thu, Jun 11, 2015 at 11:44:28AM -0500, Zev Weiss wrote:
>On Tue, Jun 09, 2015 at 12:05:57AM -0700, Paul Eggert wrote:
>>Thanks very much for looking into this.  I'll take a look at it in 
>>more detail once the paperwork goes through.  I use grep -r a lot, 
>>and would appreciate the speedup.
>
>OK, I've received confirmation from Donald Robertson at the FSF that 
>the assignment process is complete.  I've rebased that branch on 
>github with a few minor style & bug fixes since my initial email, but 
>it's still basically as described.
>
>Zev
>

Any further thoughts on this now that the paperwork's done?  I realize 
reviewing large patches is non-trivial, but if there's anything I can do 
to ease the process (e.g. changes to how the patches are organized) 
please let me know.


Thanks,
Zev





Information forwarded to bug-grep <at> gnu.org:
bug#20768; Package grep. (Wed, 01 Jul 2015 19:19:01 GMT) Full text and rfc822 format available.

Message #29 received at 20768 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zev Weiss <zev <at> bewilderbeest.net>
Cc: 20768 <at> debbugs.gnu.org
Subject: Re: bug#20768: RFC: Multithreaded grep
Date: Wed, 01 Jul 2015 12:18:11 -0700
Zev Weiss wrote:
>
> Any further thoughts on this now that the paperwork's done?

Not yet I'm afraid.  I do hope to get to grep tasks sometime this month....




Information forwarded to bug-grep <at> gnu.org:
bug#20768; Package grep. (Sun, 10 Apr 2016 22:10:01 GMT) Full text and rfc822 format available.

Message #32 received at 20768 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zev Weiss <zev <at> bewilderbeest.net>, Jim Meyering <jim <at> meyering.net>
Cc: Bjoern Jacke <bjoern <at> j3e.de>, 23234-done <at> debbugs.gnu.org,
 20768 <at> debbugs.gnu.org, 23234 <at> debbugs.gnu.org
Subject: Re: bug#23234: unexpected results with charset handling in GNU grep
 2.23
Date: Sun, 10 Apr 2016 15:09:17 -0700
On 04/10/2016 02:59 PM, Zev Weiss wrote:
> I still have my multithreading patch series 
> (https://github.com/zevweiss/grep/) awaiting review, which I'd hope to 
> get applied at some point, though I'd guess it's enough of a review 
> task that delaying an impending release for it isn't likely (the 
> mbtoupper()-removal patch made that series one patch shorter though, 
> since one was to deal with that function's thread-unsafety).  I've 
> been rebasing it periodically and running it on my own system in 
> /usr/local without any problems for a while now, for what that's worth.
>
> With current HEAD from savannah though, all check-very-expensive tests 
> pass for me on Debian stretch with gcc 5.3, glibc 2.22, and Linux 
> kernel 4.3.

Thanks for pinging us about this. Sorry, I kind of dropped the ball on 
this one. I will try to bump its priority. There are some other 
long-pending patches that also need review. I agree that these shouldn't 
delay the next release, but perhaps it could delay the release after 
that....




Added tag(s) patch. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Mon, 11 Apr 2016 04:45:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#20768; Package grep. (Wed, 04 Sep 2024 07:25:02 GMT) Full text and rfc822 format available.

Message #37 received at 20768 <at> debbugs.gnu.org (full text, mbox):

From: Dale Maggee <antisol <at> antisol.org>
To: 20768 <at> debbugs.gnu.org
Subject: Status
Date: Wed, 4 Sep 2024 12:50:56 +1000
I'm just curious as to the status of this near-decade-old patch? This 
would be a super-useful feature which I'd use on a near-daily basis.

It appears the project made the submitter jump through its FSF paperwork 
hoops, which he did, and then proceeded to do nothing with it? Was there 
some issue or blocker which didn't get mentioned on this bug report?




Information forwarded to bug-grep <at> gnu.org:
bug#20768; Package grep. (Wed, 04 Sep 2024 17:37:01 GMT) Full text and rfc822 format available.

Message #40 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Thomas Wolff <towo <at> towo.net>
To: bug-grep <at> gnu.org
Subject: Re: bug#20768: Status
Date: Wed, 4 Sep 2024 19:35:37 +0200
So what is bug #20768?

Am 04.09.2024 um 04:50 schrieb Dale Maggee:
> I'm just curious as to the status of this near-decade-old patch? This
> would be a super-useful feature which I'd use on a near-daily basis.
>
> It appears the project made the submitter jump through its FSF
> paperwork hoops, which he did, and then proceeded to do nothing with
> it? Was there some issue or blocker which didn't get mentioned on this
> bug report?
>
>
>





This bug report was last modified 283 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.