GNU bug report logs - #63225
Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)

Previous Next

Package: emacs;

Reported by: Ihor Radchenko <yantar92 <at> posteo.net>

Date: Tue, 2 May 2023 07:35:02 UTC

Severity: normal

Tags: patch

Full log


Message #38 received at 63225 <at> debbugs.gnu.org (full text, mbox):

From: Ihor Radchenko <yantar92 <at> posteo.net>
To: Mattias EngdegÄrd <mattias.engdegard <at> gmail.com>
Cc: 63225 <at> debbugs.gnu.org
Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in
 search.c)
Date: Wed, 03 May 2023 09:36:01 +0000
Mattias EngdegÄrd <mattias.engdegard <at> gmail.com> writes:

>> I tried this, and it did not give any noticeable improvement.
>> Most likely, because the actual `cond' is
>> 
>>  (cond ((looking-at "foo) (<parser function calling many more regex searches>)) ...)
>
> I see, so it doesn't run through all top-level cases very often then? I thought that would be the common path (plain text).

You are indeed right. Top-level cases are ran very often. So, what I
said does not make much sense.

Yet, in my tests, I am unable to see any improvement when I consolidate
the regexps.

If I do (progn (set-regexp-cache-size 50) (org-element-parse-buffer) nil)

Without consolidation, but using `looking-at-p' as much as possible:

Profiler top
        ;; 4160  21% + org-element--current-element
        ;; 2100  10% + org-element--parse-elements
        ;; 1894   9% + org-element--parse-objects
        ;; 1422   7%   Automatic GC
        ;;  871   4% + org-element--headline-deferred
        ;;  806   4% + apply
        ;;  796   4% + org-element-create
        ;;  638   3% + org-element--list-struct

Perf top
    ;; 16.72%  emacs         emacs                                  [.] re_match_2_internal
    ;;  7.16%  emacs         emacs                                  [.] exec_byte_code
    ;;  4.08%  emacs         emacs                                  [.] funcall_subr
    ;;  4.06%  emacs         emacs                                  [.] re_search_2

With consolidation into a giant rx (or ...) with groups:

        ;; 4158  21% + org-element--current-element
        ;; 2163  11% + org-element--parse-objects
        ;; 1796   9% + org-element--parse-elements
        ;; 1276   6%   Automatic GC
        ;;  921   4% + org-element--headline-deferred
        ;;  833   4% + apply
        ;;  793   4% + org-element-create
        ;;  660   3% + org-element--list-struct

    ;; 16.44%  emacs         emacs                                  [.] re_match_2_internal
    ;;  7.03%  emacs         emacs                                  [.] exec_byte_code
    ;;  6.78%  emacs         emacs                                  [.] process_mark_stack
    ;;  4.05%  emacs         emacs                                  [.] re_search_2
    ;;  4.02%  emacs         emacs                                  [.] funcall_subr

The version with giant single rx form is actually slower overall (!),
making no difference at all in `org-element--current-element'.

> Perhaps you just don't see much improvement until the working set of regexps fits in the cache.

As you see, I now increased cache size to 50. No improvement. Same with
my observations on current master.

> The regexp compiler doesn't do much optimisation in order to keep the translation fast. It doesn't even convert "[a]" to "a".

I guess that it is another thing that could be improved if we were to
have compiled regexp objects. Compilation time would not matter as much.

Ideally, the compiler should do something similar to
what https://www.colm.net/open-source/ragel/ does.

>> Or, alternatively, the parsed regexps can be attached to string objects
>> internally. Then, regexp cache lookup will degenerate to looking into a
>> string object slot.
>
> That would work too but we really don't want to make our strings any fancier, they are already much too big and slow.

Then, what about making compiled regexp object similar to string, but
with plist slot replaced by compiled regexp slot? Maybe some other slots
removed (I am not very familiar with specific of internal string
representation)

AFAIU, compiled regexp read/write syntax can be uniquely represented
simply by a string. Something like #r"[a-z]+" (maybe even with special
handling for backslashes, like proposed in
https://yhetil.org/emacs-devel/4209edd83cfee7c84b2d75ebfcd38784fa21b23c.camel <at> crossproduct.net)

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>




This bug report was last modified 2 years and 37 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.