GNU bug report logs -
#64128
regexp parser zero-width assertion bugs
Previous Next
Full log
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
In Emacs regexps, some but not all zero-width assertions have the special property in that they are not treated as an element for an immediately following ?, * or +. For example,
\b*
matches a literal asterisk at a word boundary -- the `*` becomes literal because it is treated as if there were nothing for it to act upon. Even stranger:
xy\b*
is parsed as, in rx syntax, (* "xy" word-boundary) which is remarkable: the repetition operator encompasses several elements even though there are no brackets given. Demo:
(and (string-match "quack,\\b*" "quack,quack,quack,quaaaack!")
(match-data))
=> (0 18)
Zero-width assertions that have the property:
^ (bol), $ (eol), \` (bos), \' (eos), \b (word-boundary), \B (not-word-boundary)
Zero-width assertions that do not have the property (and are treated as any other element):
\< (bow), \> (eow), \_< (symbol-start), \_> (symbol-end), \= (point)
These regexp patterns should be very rare in practice: they should always be a mistake, but it would be nice if they behaved in a way that makes some kind of sense.
A modest improvement would be to make operators become literal after any zero-width assertion, so that
\<*
becomes (: word-start "*") instead of (* word-start), and
xy\b*
becomes (: "xy" word-boundary "*") instead of (* "xy" word-boundary).
Suggested patch attached.
[regexp-zero-width-assertion-bug.diff (application/octet-stream, attachment)]
This bug report was last modified 2 years and 2 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.