GNU bug report logs - #20768
RFC: Multithreaded grep

Previous Next

Package: grep;

Reported by: Zev Weiss <zev <at> bewilderbeest.net>

Date: Mon, 8 Jun 2015 05:33:01 UTC

Severity: normal

Tags: patch

Full log


View this message in rfc822 format

From: Zev Weiss <zev <at> bewilderbeest.net>
To: 20768 <at> debbugs.gnu.org
Subject: bug#20768: RFC: Multithreaded grep
Date: Sun, 7 Jun 2015 23:25:43 -0500
Hello,

After what looks like 16+ years as an entry in the TODO file (back to
the first commit in the git history), I decided to see if I couldn't
get grep going multithreaded.  I now have a working version, so I
figured I'd try to get some feedback on it in its current form and
hopefully ultimately get it included in GNU grep.

At a high level: I've added a new flag, -M/--parallel[=N], that
enables multithreaded operation.  The N worker threads (defaulting to
the number of available CPUs) operate in parallel at file granularity,
so it's only of use when operating on multiple files (perhaps via
-R/-r/--directories=recurse).  The main thread opens each file and
adds it to a global queue of files to be searched ('workqueue' in
src/grep.c), and the worker threads then pull items off the queue to
search through.

For some rough performance numbers, on a 3.5GHz 4-core/8-thread
Haswell running Linux 3.16 with a warm pagecache:

 $ time ./src/grep -r FOOB $DIR
 ...

 real    0m13.740s
 user    0m12.424s
 sys     0m2.376s


 $ time ./src/grep -M -r FOOB $DIR
 ...

 real    0m3.348s
 user    0m19.428s
 sys     0m3.480s

$DIR here is a 15GB tree (24299 directories, 305282 files) of assorted
code -- git repos, tarballs, etc.  [Side note: this workload (perhaps
grep in general, not sure) suffered a major performance regression
between v2.20 and v2.21, which I bisected to commit cd36abd4 -- commit
dfff75a4 then recovered some of the lost performance, but it's still
substantially slower than it had been.]

If preferred I can send all the patches directly to the mailing list
(or as one combined diff, whatever's convenient), but for now it's
currently viewable as a series of mostly fairly fine-grained git
commits in the 'multithread' branch at:

 https://github.com/zevweiss/grep

(A few of the small cleanup commits might also be of use independently
of the multithreading effort.)

Some notes, caveats and such:

- It doesn't presently build -- all my testing & development has
  involved running the final link command manually, because (despite
  hours of bashing my head against it) I've been unable to convince
  the bootstrap/autotools/gnulib/etc agglomeration to add -lpthread
  to the command line.  Any advice here would be appreciated.

- There's a slight quirk in command-line interpretation: since the
  '-M' short flag takes an optional numeric argument, 'grep -M4' will
  be interpreted as '--parallel=4', not '--parallel --context=4'.
  This seems safe from a backwards-compatibility standpoint (since no
  existing scripts should be using '-M'), but could potentially
  produce surprising results if existing scripts are modified in
  particular ways (e.g. changing 'grep -R4' to 'grep -RM4').  I'm
  open to suggestions on more graceful ways to handle this.

- I'm not really at all familiar with the internal workings of grep's
  actual text-search guts, so some of the identifiers I've chosen in
  the parts that had to touch those bits (or elsewhere, for that
  matter) may not be the best.

- Some of the un-sharing (moving global state into thread-private
  structs) is probably a bit heavy-handed; i.e. some things are now
  thread-private that could have remained global, but (at least for
  now) it was easier to just do it that way.

- Along similar lines, it would probably be nicer to have a way to
  just copy the result of `compile()' rather than re-compiling for
  every thread.

- Testing: I've extended a few of the tests in the testsuite to also
  run multithreaded, but the coverage isn't exactly extensive.

- I haven't (yet) completed the FSF's copyright-assignment process,
  though I have sent the initial email requesting the paperwork.

- Commit messages are in a different style than used in the grep git
  repo; I'll fix them up depending on how exactly they should be
  bundled together.

- Though I have attempted to follow the GNU coding style, my own
  usual coding style differs from it quite a bit, and there may as a
  result be some stylistic misfits here and there.  I'm happy to fix
  any of these anyone points out.


Any/all feedback welcome.


Thanks,
Zev Weiss





This bug report was last modified 283 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.