GNU bug report logs - #64128
regexp parser zero-width assertion bugs

Previous Next

Package: emacs;

Reported by: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Date: Sat, 17 Jun 2023 12:21:02 UTC

Severity: normal

Full log


Message #38 received at 64128 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>,
 Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 64128 <at> debbugs.gnu.org
Subject: Re: bug#64128: regexp parser zero-width assertion bugs
Date: Mon, 19 Jun 2023 12:21:50 -0700
On 2023-06-19 11:34, Mattias Engdegård wrote:
> Here is a reduced patch that only fixes the really silly behaviour reported earlier, by making sure that `laststart` is reset correctly for all group A assertions. This should be uncontroversial.
> Maybe we should change group B assertions so that they work in the same way.

> -     operand.  Reset at the beginning of groups and alternatives.  */
> +     operand.  Reset at the beginning of groups and alternatives,
> +     and after zero-width assertions which should not be the target
> +     of any postfix repetition operators.  */

If I understand things correctly, this would cause "\b*c" to be treated 
like "\b\*c". If so, it's headed in the wrong direction.

It's long been documented that the only reason "*" is ordinary at the 
start of a regular expression or subexpression is "historical 
compatibility", and it's also long been documented that you shouldn't 
take advantage of this and you should backslash-escape the "*" anyway. 
In contrast, for constructs like \b* there is not a historical 
compatibility reason, so there's not a good argument for treating "*" as 
an ordinary character after "\b".

Instead, \b should not be a special case before "*", and \b* should be 
equivalent to \(\b\)* and should match only the empty string. Similarly 
for the other zero-width backslash escapes. This is what I would expect 
from these constructs from the longstanding documentation.

If we instead added a rule to say that a construct that can only match 
the empty string causes following "*" to ordinary, then \b* and \(\b\)* 
would both be equivalent to \*. Although consistent, this would be 
confusing: it would compound the historical-compatibility mistake. Let's 
keep things simple instead.

Also, whatever change we make to the behavior should be documented in 
the manual and in etc/NEWS.




This bug report was last modified 2 years and 2 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.