GNU bug report logs - #63225
Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)

Previous Next

Package: emacs;

Reported by: Ihor Radchenko <yantar92 <at> posteo.net>

Date: Tue, 2 May 2023 07:35:02 UTC

Severity: normal

Tags: patch

Full log


View this message in rfc822 format

From: Ihor Radchenko <yantar92 <at> posteo.net>
To: Mattias EngdegÄrd <mattias.engdegard <at> gmail.com>
Cc: 63225 <at> debbugs.gnu.org
Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
Date: Mon, 08 May 2023 19:38:59 +0000
Mattias EngdegÄrd <mattias.engdegard <at> gmail.com> writes:

>> 		 (save-excursion
>> 		   (beginning-of-line 0)
>> 		   (not (looking-at-p "[[:blank:]]*$"))))
>
> I wonder if that last part isn't better written as
>
>   (save-excursion
>     (forward-line 0)   ; faster than beginning-of-line

Fair point. It is probably worth looking through Org sources and
replacing all those `beginning-of-line' and `end-of-line' calls. I doubt
that we even intend to use fields for real.

>     (skip-chars-forward "[:blank:]") ; faster than looking-at-p
>     (not (eolp)))   ; very cheap

Hmm. I feel confused.
Does this imply that simple regexps like

(looking-at-p (rx (seq bol (zero-or-more (any "\t ")) "#" (or " " eol))))

should better be implemented as the following?

(and (bolp)
     (skip-chars-forward " \t")
     (eq (char-after) ?#) (forward-char)
     (or (eolp) (eq (char-after) ?\s)))

(I now start thinking that it might be more efficient to create a bunch
or char tables and step over them to match "regexp", just like any
finite automaton would)

> But yes, I sort of understand what you are getting at (except the business with the MODE parameter which is still a bit mysterious to me).

MODE parameter is used because we constrain what kinds of syntax
elements are allowed inside other. For example, see
`org-element-object-restrictions'. And within
`org-element--current-element', MODE is used, for example, to constrain
planning/property drawer to be only the first child of a parent heading,
parsed earlier.

>> [now part of the giant rx]
>> 
>> (rx line-start (0+ (any ?\s ?\t))
>>      ":" (1+ (any ?- ?_ word)) ":"
>>      (0+ (any ?\s ?\t)) line-end)
>
> Any reason you don't capture the part between the colons here, so that you don't need to match it later on?

That would demand all the callers of `org-element-drawer-parser' to set
match data appropriately. Which is doable, but headache for maintenance.
We sometimes do call parsers explicitly not from inside
`org-element--current-element'.

>> But why? Aren't (in word ?_ ?-) and (or word ?_ ?-) not the same?
>
> "[-_[:word:]]" and "\\w\\|[_-]" indeed match the same thing but they don't generate the same regexp bytecode -- the former is faster. (In this case rx makes a literal translation to those strings but we should probably make it optimise to the faster regexp.)
>
> There is a regexp disassembler for the really curious but it doesn't come with Emacs.

I really hope that I did not need to do all these workarounds specific to
current implementation pitfalls of Emacs regexp compiler.

>>> Maybe, but you still cons each time. (And remember that the plist-get equality funarg is new in Emacs 29.)
>> 
>> Sure it does.
>> It is just one of the variable parts of Org syntax that might be
>> changed. There are ways to make this into constant, but it is a fragile
>> area of the code that I do not want to touch without a reason.
>> (Especially given that I am not familiar with org-list.el)
>
> So it's fine to use elisp constructs new in Emacs 29 in Org? Then the line
>
>  ;; Package-Requires: ((emacs "26.1"))
>
> in org.el should probably be updated, right?

Nope. It is just that the link I shared is for WIP branch I am
developing using Emacs master. I will ensure backwards compatibility
later. For example, by converting `plist-get' to `assoc'.

>> I hope we do. If only Emacs had a way to define `case-fold-search' right
>> within the regexp itself.
>
> I would like that too, but changing that isn't easy.

I am sure that it is easy.
For example, regexp command may optionally accept a vector with first
element being regexp and second element setting a flag for
`case-fold-search'.

Of course, it is just one, maybe not the best way to implement this.

My suggestion in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=63225#56
is also compatible with this kind of approach.

> By the way, it seems that org-element-node-property-parser binds case-fold-search without actually using it. Bug?

It actually does nothing given that `org-property-re' does not contain
letters. I will remove it.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>




This bug report was last modified 2 years and 37 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.