GNU bug report logs - #64128
regexp parser zero-width assertion bugs

Previous Next

Package: emacs;

Reported by: Mattias EngdegÄrd <mattias.engdegard <at> gmail.com>

Date: Sat, 17 Jun 2023 12:21:02 UTC

Severity: normal

Full log


Message #41 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Mattias EngdegÄrd <mattias.engdegard <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Stefan Monnier <monnier <at> iro.umontreal.ca>,
 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Mon, 19 Jun 2023 21:52:40 +0200
19 juni 2023 kl. 21.21 skrev Paul Eggert <eggert <at> cs.ucla.edu>:

> If I understand things correctly, this would cause "\b*c" to be treated like "\b\*c".

Actually it already works that way. What the patch does, is preventing AB\b*C from being treated as \(?:AB\b\)*C but as AB\b\*C instead, which I think we can all agree is less wrong.

You can check the test cases in the patch:

  (should (equal (string-match "q\\b*!" "q*!") 0))
  (should (equal (string-match "q\\b*!" "!") nil))

which in current Emacs produce 2 and 0 respectively.

> It's long been documented that the only reason "*" is ordinary at the start of a regular expression or subexpression is "historical compatibility", and it's also long been documented that you shouldn't take advantage of this and you should backslash-escape the "*" anyway. In contrast, for constructs like \b* there is not a historical compatibility reason, so there's not a good argument for treating "*" as an ordinary character after "\b".

Sure, we can turn \b and \B into group B assertions, but the patch was more conservative in nature.
We also have \` to consider -- I think we have to preserve \`* meaning \`\* for compatibility, historical or not, because it's something we keep sighting in the wild.

> Instead, \b should not be a special case before "*", and \b* should be equivalent to \(\b\)* and should match only the empty string. Similarly for the other zero-width backslash escapes. This is what I would expect from these constructs from the longstanding documentation.
> 
> If we instead added a rule to say that a construct that can only match the empty string causes following "*" to ordinary, then \b* and \(\b\)* would both be equivalent to \*. Although consistent, this would be confusing: it would compound the historical-compatibility mistake. Let's keep things simple instead.

Yes, I definitely would be confused by such semantics.

> Also, whatever change we make to the behavior should be documented in the manual and in etc/NEWS.

Will be happy to oblige, although in this case it really just was a bug fix.

What I really would like to see is the regexp parser somehow separated from the NFA bytecode generator, which would make both clearer. The parser could then be re-used for other purposes such as a different back-end (DFA construction) or a built-in xr-like converter.






This bug report was last modified 2 years and 2 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.