GNU bug report logs - #64128
regexp parser zero-width assertion bugs

Package: emacs;

Reported by: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Date: Sat, 17 Jun 2023 12:21:02 UTC

Severity: normal

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Emacs Bug Report <bug-gnu-emacs <at> gnu.org>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, Stefan Monnier <monnier <at> iro.umontreal.ca>
Subject: regexp parser zero-width assertion bugs
Date: Sat, 17 Jun 2023 14:20:27 +0200

[Message part 1 (text/plain, inline)]

In Emacs regexps, some but not all zero-width assertions have the special property in that they are not treated as an element for an immediately following ?, * or +. For example,

  \b*

matches a literal asterisk at a word boundary -- the `*` becomes literal because it is treated as if there were nothing for it to act upon. Even stranger:

  xy\b*

is parsed as, in rx syntax, (* "xy" word-boundary) which is remarkable: the repetition operator encompasses several elements even though there are no brackets given. Demo:

(and (string-match "quack,\\b*" "quack,quack,quack,quaaaack!")
     (match-data))
=> (0 18)

Zero-width assertions that have the property:
^ (bol), $ (eol), \` (bos), \' (eos), \b (word-boundary), \B (not-word-boundary)

Zero-width assertions that do not have the property (and are treated as any other element):
\< (bow), \> (eow), \_< (symbol-start), \_> (symbol-end), \= (point)

These regexp patterns should be very rare in practice: they should always be a mistake, but it would be nice if they behaved in a way that makes some kind of sense.

A modest improvement would be to make operators become literal after any zero-width assertion, so that

  \<*

becomes (: word-start "*") instead of (* word-start), and

  xy\b*

becomes (: "xy" word-boundary "*") instead of (* "xy" word-boundary).

Suggested patch attached.

[regexp-zero-width-assertion-bug.diff (application/octet-stream, attachment)]

This bug report was last modified 2 years and 56 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #64128 regexp parser zero-width assertion bugs

GNU bug report logs - #64128
regexp parser zero-width assertion bugs