GNU bug report logs - #71094
[PATCH] Prefer to run find and grep in parallel in rgrep

Previous Next

Package: emacs;

Reported by: Spencer Baugh <sbaugh <at> janestreet.com>

Date: Tue, 21 May 2024 14:36:01 UTC

Severity: normal

Tags: patch

Done: Andrea Corallo <acorallo <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


Message #23 received at 71094 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dmitry <at> gutov.dev>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: sbaugh <at> janestreet.com, 71094 <at> debbugs.gnu.org, rgm <at> gnu.org
Subject: Re: bug#71094: [PATCH] Prefer to run find and grep in parallel in
 rgrep
Date: Wed, 22 May 2024 17:22:56 +0300
On 22/05/2024 16:50, Eli Zaretskii wrote:
>> Date: Wed, 22 May 2024 15:34:06 +0300
>> Cc: 71094 <at> debbugs.gnu.org, rgm <at> gnu.org
>> From: Dmitry Gutov <dmitry <at> gutov.dev>
>>
>> On 22/05/2024 14:59, Eli Zaretskii wrote:
>>
>>> With how many files did you measure the 40% speedup?  Can you show the
>>> performance with much fewer and much more files than what you used?
>>
>> FWIW my test indicated that for a smaller project (such as Emacs) the
>> difference is fairly small - the new code is slightly better or the same.
>>
>> The directory where I saw significant improvement has 300K files.
> 
> That's what I thought.  So we are changing the decade-old defaults to
> favor huge directories, which is not necessarily the wisest thing to
> do.

I don't see any regression on small directories, though. And an 
improvement on big ones.

So the way I see it, we're expanding Emacs's applicability to wider 
audience without any apparent drawbacks.

It might actually give us an improvement in smaller projects as well, if 
we decrease xargs's batch size (with -s or -n). But those are fairly 
fast already, so it's not critical.

>>> I
>>> suspect that the effect depends on that.  (It also depends on the
>>> system limit on the number of files and the length of the command line
>>> that xargs can use.)  The argument about 'find' waiting is no longer
>>> relevant with 'exec-plus', since in most cases there will be just one
>>> invocation of 'grep'.
>>
>> If there's just one invocation, wouldn't that mean that it will happen
>> at the end of the full directory scan? Rather than in parallel.
> 
> That's true, but what is your mental model of how the pipe with xargs
> works in practice?  How many invocations of grep will xargs do, and
> when will the first invocation happen?

In my mental model xargs acts like an asynchronous queue with batch 
processing. The first invocation will happen after the output reaches 
the maximum line number of maximum number of arguments configured. They 
are system-dependent by default.

For example, on my system 'xargs --show-limits' says

  Size of command buffer we are actually using: 131072

Whereas in the Emacs repository "find ... -print0 | wc" reports 202928 
characters. Meaning, it uses just 1.5 'grep' invocations. To see better 
parallelism there we'll need to either lower the limit or test it in a 
project at least twice as big.

So here is another example: a Linux kernel checkout (76K files). Also 
about 30% improvement: 1.40s vs 2.00s.




This bug report was last modified 326 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.