GNU bug report logs -
#37659
rx additions: anychar, unmatchable, unordered-or
Previous Next
Reported by: Mattias Engdegård <mattiase <at> acm.org>
Date: Tue, 8 Oct 2019 09:37:01 UTC
Severity: wishlist
Tags: fixed, patch
Fixed in version 27.1
Done: Mattias Engdegård <mattiase <at> acm.org>
Bug is archived. No further changes may be made.
Full log
Message #43 received at 37659 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
24 okt. 2019 kl. 01.14 skrev Paul Eggert <eggert <at> cs.ucla.edu>:
>
>> how do we make it easy to match one of multiple strings --- keywords, say --- in rx?
>
> If that's the real problem, perhaps the name should be "or-tokens" or something like that, to help remind the reader of the limitations of the proposed operator: it's meant only for greedy tokenization and it isn't suited for regular expressions in general. A problem with the name "or-max" is that it implies a more-general functionality than the implementation really has.
'or-strings' then perhaps, since there is nothing really restricting it to 'tokens' (which is a bit hazardous terminology given that regexps are commonly used for tokenising). In particular, there is no delimiting; (or-max "IN" "OUT") will match the first part of "INSPECT", which may be unexpected of something ostensibly matching tokens.
On the other hand, 'or-strings' sort of precludes a future relaxation of the argument restriction.
> What happens if you apply or-tokens to arguments that aren't strings or other or-tokens? Does rx diagnose this? I hope it does.
Yes, of course. Working patch attached (it still uses the name 'or-max').
'or-max' isn't a vital addition; it just seemed to fill a gap, after experience with traditional regexp usage. It clearly shouldn't be added it on a whim. I wanted to get it in place for 27.1, but such a version rush has rarely resulted in good design.
> I was thinking of something more-compatible: we could say that \| is left-to-right (for users who need compatibility with regexp "|"), and that 'or' is not necessarily left-to-right (to make room for future extensions that make 'or' greedy, or more efficient, or both).
Sorry, by '\|' I meant the string regexp operator; I take it you propose separate semantics for the rx '|' and 'or' operators? Maybe we should worry about that if we ever get near the point of replacing the engine. There are other concerns, such as how capture groups are set (even if two branches match equally long texts).
I honestly don't think much would break if '\|' (in string regexps) became greedy overnight, but there is plenty of room to confuse the user if we introduce subtle distinctions between what has hitherto been perceived as synonyms.
[0003-Add-the-rx-or-max-operator.patch (application/octet-stream, attachment)]
This bug report was last modified 5 years and 81 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.