GNU bug report logs - #64735
29.0.92; find invocations are ~15x slower because of ignores

Previous Next

Package: emacs;

Reported by: Spencer Baugh <sbaugh <at> janestreet.com>

Date: Wed, 19 Jul 2023 21:17:02 UTC

Severity: normal

Found in version 29.0.92

Full log


Message #551 received at 64735 <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: Dmitry Gutov <dmitry <at> gutov.dev>
Cc: 64735 <at> debbugs.gnu.org
Subject: Re: bug#64735: 29.0.92; find invocations are ~15x slower because of
 ignores
Date: Tue, 02 Sep 2025 12:23:10 -0400
[Message part 1 (text/plain, inline)]
Dmitry Gutov <dmitry <at> gutov.dev> writes:

> On 20/07/2023 00:16, Spencer Baugh wrote:
>> In Emacs alone, there are a few things we could do:
>> - we could mitigate the find bug by optimizing the regexp before we pass
>> it to find; this should basically remove all the overhead but makes the
>> find command uglier and harder to edit
>
> I like these two approaches.

Much delayed, but the attached patch implements this optimization
approach.  In my testing, it provides a substantial speed-up for rgrep.

[0001-Optimize-arguments-passed-to-find-in-rgrep.patch (text/x-patch, inline)]
From f0e790f59752108352e5e11919c90c15a8caa5a9 Mon Sep 17 00:00:00 2001
From: Spencer Baugh <sbaugh <at> janestreet.com>
Date: Fri, 24 May 2024 15:22:35 -0400
Subject: [PATCH] Optimize arguments passed to find in rgrep

Previously, Emacs passed each element of grep-find-ignored-directories
and grep-find-ignored-files to find as a separate -path or -name
argument.  In emacs -Q, this is 73 -path or -name arguments.  Likewise,
the globs passed in the FILES arguments to limit the search were each
passed as a separate argument.

Optimizing these globs to just three -regex arguments (FILES,
grep-find-ignored-directories, and grep-find-ignored-files)
substantially improves find performance.

On my system, in the emacs source tree, rgrep goes from taking 410ms to
180ms.  In a larger tree, it goes from 11.7s to 2.8s.

Ideally, GNU find would do this optimization internally, I requested
this here: https://savannah.gnu.org/bugs/index.php?58197

As part of doing this, dired-glob-regexp now takes an additional
argument to tell it not to anchor the glob to match an entire string -
this is required because -name globs are supposed to match only the
basename, but -regex is matched against the entire filename.

Also, rx-bracket-open and regexp-opt-bracket-open allow controlling
whether rx and regexp-opt use \(?: or \( to group regexes.  find (and
gnulib) only support the latter, unfortunately.

* lisp/dired.el (dired-glob-regexp): Add NOANCHOR argument.
* lisp/emacs-lisp/regexp-opt.el (regexp-opt-bracket-open): Add.
(regexp-opt, regexp-opt-group): Use regexp-opt-bracket-open.
* lisp/emacs-lisp/rx.el (rx-bracket-open): Add.
(rx--bracket, rx--translate-or): Use rx-bracket-open.
* lisp/progmodes/grep.el (grep-find-optimize-matching): Add,
defaulting to t. (bug#64735)
(grep--globs-to-rx-form, grep--globs-to-find-argument): Add.
(rgrep-default-command): Use grep--globs-to-find-argument.
---
 lisp/dired.el                 | 11 ++--
 lisp/emacs-lisp/regexp-opt.el | 12 ++++-
 lisp/emacs-lisp/rx.el         | 15 ++++--
 lisp/progmodes/grep.el        | 98 ++++++++++++++++++++++++++++-------
 4 files changed, 109 insertions(+), 27 deletions(-)

diff --git a/lisp/dired.el b/lisp/dired.el
index 996ca9c23bb..3da914e35cd 100644
--- a/lisp/dired.el
+++ b/lisp/dired.el
@@ -3586,8 +3586,11 @@ dired-buffers-for-dir-or-subdir
         (setq result (cons buf result)))))
     result))
 
-(defun dired-glob-regexp (pattern)
-  "Convert glob-pattern PATTERN to a regular expression."
+(defun dired-glob-regexp (pattern &optional noanchor)
+  "Convert glob-pattern PATTERN to a regular expression.
+
+If NOANCHOR is non-nil, the regular expression is not anchored to
+match an entire string."
   (let ((matched-in-pattern 0)  ;; How many chars of PATTERN we've handled.
 	regexp)
     (while (string-match "[[?*]" pattern matched-in-pattern)
@@ -3614,11 +3617,11 @@ dired-glob-regexp
 	      ((= next-op ?*)
 	       (setq regexp (concat regexp ".*"))
 	       (setq matched-in-pattern op-end)))))
-    (concat "\\`"
+    (concat (unless noanchor "\\`")
 	    regexp
 	    (regexp-quote
 	     (substring pattern matched-in-pattern))
-	    "\\'")))
+            (unless noanchor "\\'"))))
 
 (defun dired-advertise ()
   ;;"Advertise in variable `dired-buffers' that we dired `default-directory'."
diff --git a/lisp/emacs-lisp/regexp-opt.el b/lisp/emacs-lisp/regexp-opt.el
index 3edaca78e32..d24f6f9a6e5 100644
--- a/lisp/emacs-lisp/regexp-opt.el
+++ b/lisp/emacs-lisp/regexp-opt.el
@@ -83,6 +83,14 @@
 
 ;;; Code:
 
+(defvar regexp-opt-bracket-open "\\(?:"
+  "The opening string used to group a subexpression, followed by \"\\)\".
+
+`regexp-opt' will use this when it needs to group a subexpression together.
+
+Dynamically binding this is useful to produce regular expressions which
+are compatible with regexp engines which don't support \"\\(?:\".")
+
 ;;;###autoload
 (defun regexp-opt (strings &optional paren)
   "Return a regexp to match a string in the list STRINGS.
@@ -143,7 +151,7 @@ regexp-opt
                     (delete-dups (sort strings))
                     (or open t) (not open))
                  ;; No strings: return an unmatchable regexp.
-                 (concat (or open "\\(?:") regexp-unmatchable "\\)"))))
+                 (concat (or open regexp-opt-bracket-open) regexp-unmatchable "\\)"))))
       (cond ((eq paren 'words)
 	     (concat "\\<" re "\\>"))
 	    ((eq paren 'symbols)
@@ -185,7 +193,7 @@ regexp-opt-group
   ;; Also we delay the addition of grouping parenthesis as long as possible
   ;; until we're sure we need them, and try to remove one-character sequences
   ;; so we can use character sets rather than grouping parenthesis.
-  (let* ((open-group (cond ((stringp paren) paren) (paren "\\(?:") (t "")))
+  (let* ((open-group (cond ((stringp paren) paren) (paren regexp-opt-bracket-open) (t "")))
 	 (close-group (if paren "\\)" ""))
 	 (open-charset (if lax "" open-group))
 	 (close-charset (if lax "" close-group)))
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 58f95c7d89a..51f2e45f1ea 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -223,8 +223,16 @@ rx--enclose
   "Bracket REXP by LEFT-STR and RIGHT-STR."
   (append (list left-str) rexp (list right-str)))
 
+(defvar rx-bracket-open "\\(?:"
+  "The opening string used to group a subexpression, followed by \"\\)\".
+
+`rx' will use this when it needs to group a subexpression together.
+
+Dynamically binding this is useful to produce regular expressions which
+are compatible with regexp engines which don't support \"\\(?:\".")
+
 (defun rx--bracket (rexp)
-  (rx--enclose "\\(?:" rexp "\\)"))
+  (rx--enclose rx-bracket-open rexp "\\)"))
 
 (defun rx--sequence (left right)
   "Return the sequence (concatenation) of two translated items,
@@ -490,8 +498,9 @@ rx--translate-or
     (let ((args (mapcar #'rx--normalise-char-pattern body)))
       (if (rx--all-string-branches-p args)
           ;; All branches are strings: use `regexp-opt'.
-          (cons (list (regexp-opt (rx--collect-or-strings args) nil))
-                t)
+          (let ((regexp-opt-bracket-open rx-bracket-open))
+            (cons (list (regexp-opt (rx--collect-or-strings args) nil))
+                  t))
         (let ((form (rx--optimise-or-args args)))
           (if (eq (car-safe form) 'or)
               (let ((branches (cdr form)))
diff --git a/lisp/progmodes/grep.el b/lisp/progmodes/grep.el
index f91cd2cc400..3e6bf6b542e 100644
--- a/lisp/progmodes/grep.el
+++ b/lisp/progmodes/grep.el
@@ -270,6 +270,17 @@ grep-find-ignored-files
   :type '(choice (repeat :tag "Ignored file" string)
 		 (const :tag "No ignored files" nil)))
 
+(defcustom grep-find-optimize-matching t
+  "If non-nil, optimize find arguments when building `rgrep' commands.
+
+Instead of passing individual \"-path\" and \"-name\" arguments,
+construct a single \"-regex\" argument for each, which substantially
+improves performance.
+
+This affects ignores from `grep-find-ignored-directories' and
+`grep-find-ignored-files' and the globs passed as FILES to `rgrep'."
+  :type 'boolean)
+
 (defcustom grep-save-buffers 'ask
   "If non-nil, save buffers before running the grep commands.
 If `ask', ask before saving.  If a function, call it with no arguments
@@ -1490,28 +1501,85 @@ rgrep-find-ignored-directories
   "Return the list of ignored directories applicable to DIR."
   (grep--filter-list-by-dir grep-find-ignored-directories dir))
 
+(defun grep--globs-to-rx-form (globs)
+  "Return an `rx' form which matches any of GLOBS.
+
+Elements of GLOBS which are just literal strings or are of the form
+\"*string\" are grouped into a separate `or' form, so they'll be
+optimized by `rx' (which calls `regexp-opt')."
+  (require 'dired)
+  (declare-function dired-glob-regexp "dired" (pattern &optional noanchor))
+  (let (rx-forms)
+    (let (literals suffixes)
+      (dolist (glob globs)
+        (cond
+         ((not (string-match "[[?*]" glob 0))
+          ;; The glob is just a literal string.
+          (push glob literals))
+         ((and
+           (not (string-empty-p glob))
+           (not (string-match "[[?*]" glob 1))
+           (eq (aref glob 0) ?*))
+          ;; The glob is * followed by a literal string.  This handles the
+          ;; elements of `grep-find-ignored-files' which come from
+          ;; `completion-ignored-extensions'.
+          (push (substring glob 1) suffixes))
+         (t
+          ;; The glob is something else more complex.
+          (push `(regexp ,(dired-glob-regexp glob 'noanchor)) rx-forms))))
+      (when literals
+        (push `(or . ,literals) rx-forms))
+      (when suffixes
+        (push `(seq (* any) (or . ,suffixes)) rx-forms)))
+    `(or . ,rx-forms)))
+
+(defun grep--globs-to-find-argument (globs arg)
+  "Convert GLOBS into a series of arguments to \"find\".
+
+ARG is put in front of each glob and the arguments are concatenated
+into a string and escaped for the shell.
+
+If `grep-find-optimize-matching' is non-nil and ARG is a known valid
+find argument, then instead return a single \"-regex\" or \"-iregex\"
+argument which matches all of GLOBS."
+  (if-let* ((grep-find-optimize-matching)
+            (known-arg (assoc arg '(("-path" "-regex" 'path)
+                                    ("-name" "-regex" 'name)
+                                    ("-ipath" "-iregex" 'path)
+                                    ("-iname" "-iregex" 'name))))
+            ((> (length globs) 1)))
+      (let* ((globs-form (grep--globs-to-rx-form globs))
+             (form (if (eq (nth 2 known-arg) 'path)
+                       ;; Globs match the whole filename
+                       globs-form
+                     ;; Globs match the basename
+                     `(seq (* any) "/" ,globs-form string-end))))
+        (concat
+         (nth 1 known-arg) " "
+         (shell-quote-argument
+          ;; Make the produced regexp find-compatible.
+          (let ((rx-bracket-open "\\("))
+            (rx-to-string form 'nogroup))
+          grep-quoting-style)))
+    (concat arg " "
+            (mapconcat
+             (lambda (ignore) (shell-quote-argument ignore grep-quoting-style))
+             globs
+             (concat " -o " arg " ")))))
+
 (defun rgrep-default-command (regexp files dir)
   "Compute the command for \\[rgrep] to use by default."
   (require 'find-dired)                 ; for `find-name-arg'
   (let ((ignored-files-arg
          (when-let* ((ignored-files (grep-find-ignored-files dir)))
            (concat (shell-quote-argument "(" grep-quoting-style)
-                   ;; we should use shell-quote-argument here
-                   " -name "
-                   (mapconcat
-                    (lambda (ignore) (shell-quote-argument ignore grep-quoting-style))
-                    ignored-files
-                    " -o -name ")
+                   " " (grep--globs-to-find-argument ignored-files "-name")
                    " " (shell-quote-argument ")" grep-quoting-style)))))
     (grep-expand-template
      grep-find-template
      regexp
      (concat (shell-quote-argument "(" grep-quoting-style)
-             " " find-name-arg " "
-             (mapconcat
-              (lambda (x) (shell-quote-argument x grep-quoting-style))
-              (split-string files)
-              (concat " -o " find-name-arg " "))
+             " " (grep--globs-to-find-argument (split-string files) find-name-arg)
              " "
              (shell-quote-argument ")" grep-quoting-style)
              (when (and rgrep-find-ignores-in-<f> ignored-files-arg)
@@ -1521,13 +1589,7 @@ rgrep-default-command
       (when-let* ((ignored-dirs (rgrep-find-ignored-directories dir)))
         (concat "-type d "
                 (shell-quote-argument "(" grep-quoting-style)
-                ;; we should use shell-quote-argument here
-                " -path "
-                (mapconcat
-                 (lambda (d)
-                   (shell-quote-argument (concat "*/" d) grep-quoting-style))
-                 ignored-dirs
-                 " -o -path ")
+                " " (grep--globs-to-find-argument ignored-dirs "-path")
                 " "
                 (shell-quote-argument ")" grep-quoting-style)
                 " -prune -o "))
-- 
2.43.7


This bug report was last modified 16 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.