Package: emacs;
Reported by: Spencer Baugh <sbaugh <at> janestreet.com>
Date: Wed, 19 Jul 2023 21:17:02 UTC
Severity: normal
Found in version 29.0.92
Message #551 received at 64735 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Dmitry Gutov <dmitry <at> gutov.dev> Cc: 64735 <at> debbugs.gnu.org Subject: Re: bug#64735: 29.0.92; find invocations are ~15x slower because of ignores Date: Tue, 02 Sep 2025 12:23:10 -0400
[Message part 1 (text/plain, inline)]
Dmitry Gutov <dmitry <at> gutov.dev> writes: > On 20/07/2023 00:16, Spencer Baugh wrote: >> In Emacs alone, there are a few things we could do: >> - we could mitigate the find bug by optimizing the regexp before we pass >> it to find; this should basically remove all the overhead but makes the >> find command uglier and harder to edit > > I like these two approaches. Much delayed, but the attached patch implements this optimization approach. In my testing, it provides a substantial speed-up for rgrep.
[0001-Optimize-arguments-passed-to-find-in-rgrep.patch (text/x-patch, inline)]
From f0e790f59752108352e5e11919c90c15a8caa5a9 Mon Sep 17 00:00:00 2001 From: Spencer Baugh <sbaugh <at> janestreet.com> Date: Fri, 24 May 2024 15:22:35 -0400 Subject: [PATCH] Optimize arguments passed to find in rgrep Previously, Emacs passed each element of grep-find-ignored-directories and grep-find-ignored-files to find as a separate -path or -name argument. In emacs -Q, this is 73 -path or -name arguments. Likewise, the globs passed in the FILES arguments to limit the search were each passed as a separate argument. Optimizing these globs to just three -regex arguments (FILES, grep-find-ignored-directories, and grep-find-ignored-files) substantially improves find performance. On my system, in the emacs source tree, rgrep goes from taking 410ms to 180ms. In a larger tree, it goes from 11.7s to 2.8s. Ideally, GNU find would do this optimization internally, I requested this here: https://savannah.gnu.org/bugs/index.php?58197 As part of doing this, dired-glob-regexp now takes an additional argument to tell it not to anchor the glob to match an entire string - this is required because -name globs are supposed to match only the basename, but -regex is matched against the entire filename. Also, rx-bracket-open and regexp-opt-bracket-open allow controlling whether rx and regexp-opt use \(?: or \( to group regexes. find (and gnulib) only support the latter, unfortunately. * lisp/dired.el (dired-glob-regexp): Add NOANCHOR argument. * lisp/emacs-lisp/regexp-opt.el (regexp-opt-bracket-open): Add. (regexp-opt, regexp-opt-group): Use regexp-opt-bracket-open. * lisp/emacs-lisp/rx.el (rx-bracket-open): Add. (rx--bracket, rx--translate-or): Use rx-bracket-open. * lisp/progmodes/grep.el (grep-find-optimize-matching): Add, defaulting to t. (bug#64735) (grep--globs-to-rx-form, grep--globs-to-find-argument): Add. (rgrep-default-command): Use grep--globs-to-find-argument. --- lisp/dired.el | 11 ++-- lisp/emacs-lisp/regexp-opt.el | 12 ++++- lisp/emacs-lisp/rx.el | 15 ++++-- lisp/progmodes/grep.el | 98 ++++++++++++++++++++++++++++------- 4 files changed, 109 insertions(+), 27 deletions(-) diff --git a/lisp/dired.el b/lisp/dired.el index 996ca9c23bb..3da914e35cd 100644 --- a/lisp/dired.el +++ b/lisp/dired.el @@ -3586,8 +3586,11 @@ dired-buffers-for-dir-or-subdir (setq result (cons buf result))))) result)) -(defun dired-glob-regexp (pattern) - "Convert glob-pattern PATTERN to a regular expression." +(defun dired-glob-regexp (pattern &optional noanchor) + "Convert glob-pattern PATTERN to a regular expression. + +If NOANCHOR is non-nil, the regular expression is not anchored to +match an entire string." (let ((matched-in-pattern 0) ;; How many chars of PATTERN we've handled. regexp) (while (string-match "[[?*]" pattern matched-in-pattern) @@ -3614,11 +3617,11 @@ dired-glob-regexp ((= next-op ?*) (setq regexp (concat regexp ".*")) (setq matched-in-pattern op-end))))) - (concat "\\`" + (concat (unless noanchor "\\`") regexp (regexp-quote (substring pattern matched-in-pattern)) - "\\'"))) + (unless noanchor "\\'")))) (defun dired-advertise () ;;"Advertise in variable `dired-buffers' that we dired `default-directory'." diff --git a/lisp/emacs-lisp/regexp-opt.el b/lisp/emacs-lisp/regexp-opt.el index 3edaca78e32..d24f6f9a6e5 100644 --- a/lisp/emacs-lisp/regexp-opt.el +++ b/lisp/emacs-lisp/regexp-opt.el @@ -83,6 +83,14 @@ ;;; Code: +(defvar regexp-opt-bracket-open "\\(?:" + "The opening string used to group a subexpression, followed by \"\\)\". + +`regexp-opt' will use this when it needs to group a subexpression together. + +Dynamically binding this is useful to produce regular expressions which +are compatible with regexp engines which don't support \"\\(?:\".") + ;;;###autoload (defun regexp-opt (strings &optional paren) "Return a regexp to match a string in the list STRINGS. @@ -143,7 +151,7 @@ regexp-opt (delete-dups (sort strings)) (or open t) (not open)) ;; No strings: return an unmatchable regexp. - (concat (or open "\\(?:") regexp-unmatchable "\\)")))) + (concat (or open regexp-opt-bracket-open) regexp-unmatchable "\\)")))) (cond ((eq paren 'words) (concat "\\<" re "\\>")) ((eq paren 'symbols) @@ -185,7 +193,7 @@ regexp-opt-group ;; Also we delay the addition of grouping parenthesis as long as possible ;; until we're sure we need them, and try to remove one-character sequences ;; so we can use character sets rather than grouping parenthesis. - (let* ((open-group (cond ((stringp paren) paren) (paren "\\(?:") (t ""))) + (let* ((open-group (cond ((stringp paren) paren) (paren regexp-opt-bracket-open) (t ""))) (close-group (if paren "\\)" "")) (open-charset (if lax "" open-group)) (close-charset (if lax "" close-group))) diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el index 58f95c7d89a..51f2e45f1ea 100644 --- a/lisp/emacs-lisp/rx.el +++ b/lisp/emacs-lisp/rx.el @@ -223,8 +223,16 @@ rx--enclose "Bracket REXP by LEFT-STR and RIGHT-STR." (append (list left-str) rexp (list right-str))) +(defvar rx-bracket-open "\\(?:" + "The opening string used to group a subexpression, followed by \"\\)\". + +`rx' will use this when it needs to group a subexpression together. + +Dynamically binding this is useful to produce regular expressions which +are compatible with regexp engines which don't support \"\\(?:\".") + (defun rx--bracket (rexp) - (rx--enclose "\\(?:" rexp "\\)")) + (rx--enclose rx-bracket-open rexp "\\)")) (defun rx--sequence (left right) "Return the sequence (concatenation) of two translated items, @@ -490,8 +498,9 @@ rx--translate-or (let ((args (mapcar #'rx--normalise-char-pattern body))) (if (rx--all-string-branches-p args) ;; All branches are strings: use `regexp-opt'. - (cons (list (regexp-opt (rx--collect-or-strings args) nil)) - t) + (let ((regexp-opt-bracket-open rx-bracket-open)) + (cons (list (regexp-opt (rx--collect-or-strings args) nil)) + t)) (let ((form (rx--optimise-or-args args))) (if (eq (car-safe form) 'or) (let ((branches (cdr form))) diff --git a/lisp/progmodes/grep.el b/lisp/progmodes/grep.el index f91cd2cc400..3e6bf6b542e 100644 --- a/lisp/progmodes/grep.el +++ b/lisp/progmodes/grep.el @@ -270,6 +270,17 @@ grep-find-ignored-files :type '(choice (repeat :tag "Ignored file" string) (const :tag "No ignored files" nil))) +(defcustom grep-find-optimize-matching t + "If non-nil, optimize find arguments when building `rgrep' commands. + +Instead of passing individual \"-path\" and \"-name\" arguments, +construct a single \"-regex\" argument for each, which substantially +improves performance. + +This affects ignores from `grep-find-ignored-directories' and +`grep-find-ignored-files' and the globs passed as FILES to `rgrep'." + :type 'boolean) + (defcustom grep-save-buffers 'ask "If non-nil, save buffers before running the grep commands. If `ask', ask before saving. If a function, call it with no arguments @@ -1490,28 +1501,85 @@ rgrep-find-ignored-directories "Return the list of ignored directories applicable to DIR." (grep--filter-list-by-dir grep-find-ignored-directories dir)) +(defun grep--globs-to-rx-form (globs) + "Return an `rx' form which matches any of GLOBS. + +Elements of GLOBS which are just literal strings or are of the form +\"*string\" are grouped into a separate `or' form, so they'll be +optimized by `rx' (which calls `regexp-opt')." + (require 'dired) + (declare-function dired-glob-regexp "dired" (pattern &optional noanchor)) + (let (rx-forms) + (let (literals suffixes) + (dolist (glob globs) + (cond + ((not (string-match "[[?*]" glob 0)) + ;; The glob is just a literal string. + (push glob literals)) + ((and + (not (string-empty-p glob)) + (not (string-match "[[?*]" glob 1)) + (eq (aref glob 0) ?*)) + ;; The glob is * followed by a literal string. This handles the + ;; elements of `grep-find-ignored-files' which come from + ;; `completion-ignored-extensions'. + (push (substring glob 1) suffixes)) + (t + ;; The glob is something else more complex. + (push `(regexp ,(dired-glob-regexp glob 'noanchor)) rx-forms)))) + (when literals + (push `(or . ,literals) rx-forms)) + (when suffixes + (push `(seq (* any) (or . ,suffixes)) rx-forms))) + `(or . ,rx-forms))) + +(defun grep--globs-to-find-argument (globs arg) + "Convert GLOBS into a series of arguments to \"find\". + +ARG is put in front of each glob and the arguments are concatenated +into a string and escaped for the shell. + +If `grep-find-optimize-matching' is non-nil and ARG is a known valid +find argument, then instead return a single \"-regex\" or \"-iregex\" +argument which matches all of GLOBS." + (if-let* ((grep-find-optimize-matching) + (known-arg (assoc arg '(("-path" "-regex" 'path) + ("-name" "-regex" 'name) + ("-ipath" "-iregex" 'path) + ("-iname" "-iregex" 'name)))) + ((> (length globs) 1))) + (let* ((globs-form (grep--globs-to-rx-form globs)) + (form (if (eq (nth 2 known-arg) 'path) + ;; Globs match the whole filename + globs-form + ;; Globs match the basename + `(seq (* any) "/" ,globs-form string-end)))) + (concat + (nth 1 known-arg) " " + (shell-quote-argument + ;; Make the produced regexp find-compatible. + (let ((rx-bracket-open "\\(")) + (rx-to-string form 'nogroup)) + grep-quoting-style))) + (concat arg " " + (mapconcat + (lambda (ignore) (shell-quote-argument ignore grep-quoting-style)) + globs + (concat " -o " arg " "))))) + (defun rgrep-default-command (regexp files dir) "Compute the command for \\[rgrep] to use by default." (require 'find-dired) ; for `find-name-arg' (let ((ignored-files-arg (when-let* ((ignored-files (grep-find-ignored-files dir))) (concat (shell-quote-argument "(" grep-quoting-style) - ;; we should use shell-quote-argument here - " -name " - (mapconcat - (lambda (ignore) (shell-quote-argument ignore grep-quoting-style)) - ignored-files - " -o -name ") + " " (grep--globs-to-find-argument ignored-files "-name") " " (shell-quote-argument ")" grep-quoting-style))))) (grep-expand-template grep-find-template regexp (concat (shell-quote-argument "(" grep-quoting-style) - " " find-name-arg " " - (mapconcat - (lambda (x) (shell-quote-argument x grep-quoting-style)) - (split-string files) - (concat " -o " find-name-arg " ")) + " " (grep--globs-to-find-argument (split-string files) find-name-arg) " " (shell-quote-argument ")" grep-quoting-style) (when (and rgrep-find-ignores-in-<f> ignored-files-arg) @@ -1521,13 +1589,7 @@ rgrep-default-command (when-let* ((ignored-dirs (rgrep-find-ignored-directories dir))) (concat "-type d " (shell-quote-argument "(" grep-quoting-style) - ;; we should use shell-quote-argument here - " -path " - (mapconcat - (lambda (d) - (shell-quote-argument (concat "*/" d) grep-quoting-style)) - ignored-dirs - " -o -path ") + " " (grep--globs-to-find-argument ignored-dirs "-path") " " (shell-quote-argument ")" grep-quoting-style) " -prune -o ")) -- 2.43.7
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.