GNU bug report logs -
#37659
rx additions: anychar, unmatchable, unordered-or
Previous Next
Reported by: Mattias Engdegård <mattiase <at> acm.org>
Date: Tue, 8 Oct 2019 09:37:01 UTC
Severity: wishlist
Tags: fixed, patch
Fixed in version 27.1
Done: Mattias Engdegård <mattiase <at> acm.org>
Bug is archived. No further changes may be made.
Full log
Message #34 received at 37659 <at> debbugs.gnu.org (full text, mbox):
22 okt. 2019 kl. 19.33 skrev Paul Eggert <eggert <at> cs.ucla.edu>:
>> Thus, instead of 'unordered-or', define the operator in terms of long matches: 'or-max' (working name) would work like 'or' but guarantee a longest match, and only permit strings and 'or-max' forms as arguments.
>
> That's an odd restriction. I'm not sure it's a good idea to add an operator with such a restriction. That is, I know why the restriction is there (it's because of limitations in the Emacs regexp matcher), but it's not clear that users should have to know and understand these details.
The restriction is simple and easy to document. It is not necessary to know the underlying reason for it in order to use the construct effectively.
> Moreover, if greed is the longstanding tradition for regexp-opt, shouldn't plain "or" be greedy, to be consistent with other operators?
Yes, I very much favour switching to a DFA engine; is there another way? Even then a backtracking engine would be needed for backrefs and other messy cases. However, that's a completely different amount of work. (Meanwhile, we have 'posix-string-match' etc for those who want greed at any cost.)
The problem that I'm trying to solve here is: how do we make it easy to match one of multiple strings --- keywords, say --- in rx? Currently, the answer is something like (regexp (regexp-opt my-keywords)), which doesn't integrate well with rx user definitions. In addition, the output of one regexp-opt cannot be used as input to another.
'or-max' would allow a user to say
(rx-define veggies (or-max "carrot" "tomato" "cucumber"))
(rx-define meats (or-max "beef" "chicken" "pork"))
... (rx (or-max veggies meats)) ...
and get a regexp that is guaranteed to be greedy, well-optimised as if all strings were passed to 'regexp-opt' at once, and robust: a small change won't change the behaviour radically, and the user won't have to game or second-guess the engine in order to produce the desired result.
If, in the future, 'or' becomes greedy, then 'or-max' will just be a synonym.
> If it's too much trouble to make plain "or" greedy, I suggest just documenting it as possibly being greedy and possibly not (that is, document it as being unordered, even if it happens to be ordered now). This will give us more opportunity for optimization later.
That would make rx strictly less useful than string regexps. That is why 'unordered-or' was a mistake: the unpredictability made it useless in many cases, and everyone would just have used regexp-opt (or skipped rx altogether).
It is desirable to have the semantics for 'or' in rx and \| in string regexps; otherwise, translating and understanding become unnecessarily difficult.
We could say that 'or' and \| either match greedily or in left-to-right order. However, I'm not sure this solves any problem right now.
This bug report was last modified 5 years and 81 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.