Mattias Engdegård Fri, 16 May 2025 15:46:40 +0200 writes

> And my sincerest thanks to you for noticing this. Everyone who writes technical 
> texts knows how valuable people who actually work through examples are.

I encountered this problem while trying to write a simple CSS parser
using PEG. My first reaction was to look for regex examples in the
manual, and I’m glad I found it :p

> Your proposed solution,
> 
> > (rx "/*"
> >    (* (| (not "*")
> >          (: (1+ "*") (not (or "*" "/")))))
> >    (1+ "*") "/")
> 
> appears correct but Emacs's NFA engine will match a final run of stars twice. 
> Consider the text
> 
>   /*************************************/
> 
> The regexp will match all stars, encounter the final slash, backtrack and match 
> the stars again before matching the slash. A bit inelegant perhaps. More 
> seriously, the stack usage is such that it can't parse a 1 MB comment without 
> running out of stack space (on my machine). To be fair, he original regexp had 
> the same problem.
> 
> (And yes, non-greedy operators can be used for a simple solution but as the 
> footnote in the text says that's not the point here.)
> 
> 
> 2. Better alternative?
> 
> (rx "/*"
>     (* (not "*"))
>     (+ "*")
>     (* (not (in "*/"))
>        (* (not "*"))
>        (+ "*"))
>     "/")
> 
> is slightly more complicated but doesn't backtrack as much.
> It still produces unnecessary backtrack points between runs of stars; perhaps 
> the analysis to eliminate them is too hard for the compiler.

I'm not an expert in regular expressions, but it seems that cases like C
block comments are hard to handle without introducing
backtracking. Although it's unlikely to encounter such enormous and
specific comments in practice, the implementation I provided does
introduce some implicit performance issues.

During my search, I think I might have seen the improved version you
provided on Stack Overflow or somewhere else. It’s definitely better,
but also harder to understand.

> 2. But is it a good example?
> 
> The purpose was never parsing C comments but to provide an example of how rx 
> can help. Can we find something simpler?
> Here is regexp for a simple quoted string:
> 
> (rx ?\"
>     (* (or (not (or ?\\ ?\"))
>            (: ?\\ (or ?\\ ?\"))))
>     ?\")
> 
> Would that be a better example? The backslashes obscure things a bit.
> 
> Right now I'm leaning towards using the proposed fix.

Yes, I think this example is great, simpler examples are always
better. C comments are indeed a case where it’s hard to come up with a
solution right away. It would be better to present it as a demonstration
of how to think about and implement a regular expression, rather than
just a simple regex example.

But I’d still like to keep it in some places, since it can
actually be quite useful sometimes.

[ Of course, this isn’t really a problem nowadays — LLMs can quickly
lead us to solutions for some classic problems (though they occasionally
introduce some unexpected issues). ]

Regards