GNU bug report logs - #63225
Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)

Package: emacs;

Reported by: Ihor Radchenko <yantar92 <at> posteo.net>

Date: Tue, 2 May 2023 07:35:02 UTC

Severity: normal

Tags: patch

Message #59 received at 63225 <at> debbugs.gnu.org (full text, mbox):

From: Ihor Radchenko <yantar92 <at> posteo.net>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: 63225 <at> debbugs.gnu.org
Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in
 search.c)
Date: Fri, 05 May 2023 10:31:01 +0000

Mattias Engdegård <mattias.engdegard <at> gmail.com> writes:

>> What is the aim of instrumenting regexp engine in this scenario?
>> I already know that additional regexps will be tested by individual
>> `org-element-X-parser' functions.
>
> I got the impression that the 'spine' of the parser, the sequence of `looking-at` calls in `org-element--current-element`, would frequently be run through in its entirety which means that consolidating these would reduce the number of working regexps by about 20 (if I'm counting correctly).

Not exactly. The actual statistics is the following (of course, it is a
subject of the actual parsed file structure).

Below, I measured time spent in different branches of cond.
Note describes the cond type.

| Depth |  count |     Time, msec | Note                              | Avg time, μsec/count | Element type  |
|-------+--------+----------------+-----------------------------------+----------------------+---------------|
|     0 |  89592 |         31.094 | eq                                |           0.43315819 | item          |
|     1 |   1984 |           0.68 | eq                                |           0.45513710 | table row     |
|     2 | 206607 |          43.23 | eq                                |           0.24850257 | node-property |
|     3 |  72770 |         302.95 | looking-at-p, skip-chars          |            4.8545025 | headline      |
|     4 |  56000 |      39.190916 | memq                              |           0.69983779 | section       |
|     5 |   8231 |      26.129109 | looking-at-p, lookback            |            3.1744756 | planning      |
|     6 |  54852 |      503.97346 | looking-at-p, multiline, lookback |            9.1878776 | prop drawer   |
|     7 |  89510 |      78.514284 | bolp                              |           0.87715656 | paragraph     |
|     8 |  29610 |      79.589466 | looking-at-p                      |            2.6879252 | clock         |
|     9 |    231 |       1.644304 | eq                                |            7.1181991 | inlinetask    |
|    10 |      0 | tot: 1173 msec | eq                                |                  0/0 | affiliated    |
|-------+--------+----------------+-----------------------------------+----------------------+---------------|
|    11 |     30 |       0.060081 | looking-at-p                      |               2.0027 | latex env     |
|    12 |  45443 |      187.41703 | looking-at-p                      |            4.1242222 | drawer        |
|    13 |     21 |       0.255528 | looking-at-p                      |               12.168 | fixed width   |
|    14 |    967 |        6.67522 | looking-at                        |            6.9030196 | block         |
|    15 |     53 |       0.342144 | looking-at-p                      |            6.4555472 | call          |
|    16 |      0 |              0 | looking-at-p                      |                  0/0 | dynblock      |
|    17 |     29 |       0.361915 | looking-at-p                      |            12.479828 | keyword       |
|    18 |      0 |              0 | eq                                |                  0/0 | paragraph     |
|    19 |      0 |              0 | looking-at-p                      |                  0/0 | footnote def  |
|    20 |      0 |              0 | looking-at-p                      |                  0/0 | rule          |
|    21 |      0 |              0 | looking-at-p                      |                  0/0 | diary sexp    |
|-------+--------+----------------+-----------------------------------+----------------------+---------------|
|    22 |     66 |       0.752823 | looking-at-p, re-search-forward   |            11.406409 | table         |
|    23 |  41509 |      303.39472 | looking-at-p                      |            7.3091310 | item          |
|    24 |   5340 |      41.188231 | t                                 |            7.7131519 | paragraph     |
|       |        | tot: 1713 msec |                                   |                      |               |

If I try to group regexps into one giant rx form and then compare time
spend in different cond branches, I get the following.
(I grouped the regexps between horizontal rules in the above table)

I tried two different files: (1) notes.org that is heavy on headlines;
(2) org-manual that is heavy on actual text.

Grouping with rx gives no noticeable impact. 

| Depth |  Avg time, μs | Avg time, μs |   Avg time, μs | Avg time, μs |
|       | (notes+no rx) |   (notes+rx) | (manual+no rx) |  (manual+rx) |
|-------+---------------+--------------+----------------+--------------|
|     0 |    0.34576248 |   0.35948186 |     0.43996679 |   0.44675874 |
|     1 |    0.35749752 |   0.37239325 |     0.44559585 |   0.43868739 |
|     2 |    0.18958309 |   0.20197035 |     0.29960921 |   0.29960921 |
|     3 |     4.1282904 |    4.2407582 |      4.4482968 |    4.4711219 |
|     4 |    0.61503580 |   0.59914459 |     0.64377158 |   0.63460540 |
|     5 |    0.88028169 |   0.83916084 |      1.2820513 |    1.2820513 |
|     6 |     2.6515244 |    2.6348024 |      2.6795055 |    2.7579648 |
|     7 |     7.8175124 |    7.8262918 |      7.1999256 |    7.1996154 |
|     8 |    0.75458424 |   0.75368242 |     0.70958084 |   0.72455090 |
|     9 |     2.1446653 |    2.1466905 |            10. |          10. |
|    10 |     5.2813853 |    5.2813853 |      5.4761905 |    6.5476190 |
|    11 |          0./0 |         0./0 |           0./0 |         0./0 |
|    12 |            2. |    2.3333333 |           0./0 |         0./0 |
|    13 |     3.5030250 |    4.6581886 |      4.0623783 |    5.8718692 |
|    14 |     11.428571 |    10.952381 |      2.6970634 |    3.3307573 |
|    15 |     5.6508264 |    4.6177686 |      5.1308629 |    4.3741902 |
|    16 |     6.2264151 |    4.1509434 |           0./0 |         0./0 |
|    17 |          0./0 |         0./0 |           0./0 |         0./0 |
|    18 |     10.689655 |          12. |      5.7134386 |    3.7413831 |
|    19 |          0./0 |         0./0 |           0./0 |         0./0 |
|    20 |          0./0 |         0./0 |      2.8888889 |    2.9444444 |
|    21 |          0./0 |         0./0 |           0./0 |         0./0 |
|    22 |          0./0 |         0./0 |           0./0 |         0./0 |
|    23 |     10.746269 |    9.4029851 |      6.2695313 |    6.1328125 |
|    24 |     6.4371193 |    6.2419339 |      6.0138782 |    5.8558211 |
|    25 |     6.4154824 |    6.3855647 |      4.9707695 |    4.7727182 |
|    26 |          0./0 |         0./0 |           0./0 |         0./0 |

> Now if as you suggest the parsing is dominated by sequences of regexps in the branches, it prompts the questions: which branches, what regexps, why are there so many of them, and is there anything that can be done to reduce their number?

Oh. No. The parsing is dominated by `org-element--current-element'. I
can clearly see it because the profiler hits
`org-element--current-element', not the branches.

I just had no idea what to make of your suggestion about

    Run on a reduced dataset, and see if the sequence of regexps being
    exercised, and their frequencies, are consistent with what you
    expect.

Also, my testing showed that

(looking-at
  (rx
   (or
    (group-n 1 (regexp org-element--latex-begin-environment))
    (group-n 2 (regexp org-element-drawer-re))
    (group-n 3 (regexp "[ \t]*:\\( \\|$\\)"))
    (group-n 7 (regexp org-element-dynamic-block-open-re))
    (seq (group-n 4 (regexp "[ \t]*#\\+"))
         (or
          (seq "BEGIN_" (group-n 5 (1+ (not space))))
          (group-n 6 "CALL:")
          (group-n 8 (1+ (not space)) ":")
          ))
    (group-n 9 (regexp org-footnote-definition-re))
    (group-n 10 (regexp "[ \t]*-\\{5,\\}[ \t]*$"))
    (group-n 11 "%%("))))

is actually slightly slower overall compared to a series of `looking-at-p'.
AFAIU, because the `looking-at' needs to allocate match-data vector for
all these 11 groups, which leads to
;;  6.78%  emacs         emacs                                  [.] process_mark_stack
floating up in the perf top.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

This bug report was last modified 2 years and 135 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #63225 Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)

GNU bug report logs - #63225
Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)