GNU bug report logs - #58727
29.0.50; rx doc: Semantics of RX...

Previous Next

Package: emacs;

Reported by: Michael Heerdegen <michael_heerdegen <at> web.de>

Date: Sun, 23 Oct 2022 02:33:02 UTC

Severity: normal

Found in version 29.0.50

Done: Michael Heerdegen <michael_heerdegen <at> web.de>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 58727 in the body.
You can then email your comments to 58727 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#58727; Package emacs. (Sun, 23 Oct 2022 02:33:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Michael Heerdegen <michael_heerdegen <at> web.de>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 23 Oct 2022 02:33:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Michael Heerdegen <michael_heerdegen <at> web.de>
To: bug-gnu-emacs <at> gnu.org
Subject: 29.0.50; rx doc: Semantics of RX...
Date: Sun, 23 Oct 2022 04:32:17 +0200
Hello,

please document the semantics of multiple RXs for the RX repetition
operators (and maybe grouping operators, too).

The resulting regexps are concatenating like with an implicit `seq'.
This is not trivial, though: in stringish regexps the repetition
operators are only unary, and different interpretations would make sense
for `rx' (implicit `seq', implicit `or').

The docstring of `rx' doesn't tell anything about this.  The manual has
sentences like

| ‘(zero-or-more RX...)’
| ‘(0+ RX...)’
|      Match the RXs zero or more times.  Greedy by default.
|      Corresponding string regexp: ‘A*’ (greedy), ‘A*?’ (non-greedy)

but that suffers from the same problem that the semantics of A are not
clear: A == (seq RX...) ?

Oh, and maybe let's also make more clear that `rx' always cares about
implicit grouping when necessary.  For example, in
(info "(elisp) Rx Constructs") it's not trivial that e.g. in

‘(seq RX...)’
‘(sequence RX...)’
‘(: RX...)’
‘(and RX...)’
     Match the RXs in sequence.  Without arguments, the expression
     matches the empty string.
     Corresponding string regexp: ‘AB...’ (subexpressions in sequence).

`rx' silently adds shy grouping to the result, and the corresponding string
regexp in this case is more precisely \(?:AB...\).  I think it is enough
to mention this implicit grouping feature once, but it is important to
spell it out.
  

TIA,

Michael.






Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#58727; Package emacs. (Sun, 23 Oct 2022 16:15:01 GMT) Full text and rfc822 format available.

Message #8 received at 58727 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Michael Heerdegen <michael_heerdegen <at> web.de>
Cc: 58727 <at> debbugs.gnu.org
Subject: 29.0.50; rx doc: Semantics of RX...
Date: Sun, 23 Oct 2022 18:14:10 +0200
> The resulting regexps are concatenating like with an implicit `seq'.
> This is not trivial, though: in stringish regexps the repetition
> operators are only unary, and different interpretations would make sense
> for `rx' (implicit `seq', implicit `or').

The rule is implicit concatenation unless specified otherwise; maybe we could say that in the leading paragraph. (`or` is the only place where concatenation isn't done.)

Otherwise I think we should grant our readers some common sense. It's not a formal specification but meant for humans to understand, and I'm quite sure they do.

> Oh, and maybe let's also make more clear that `rx' always cares about
> implicit grouping when necessary.

No, there is no such thing in rx. The manual provides corresponding string-notation constructs for orientation only.
This is important -- rx forms are defined by their semantics, not by what strings they translate to.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#58727; Package emacs. (Mon, 24 Oct 2022 02:35:02 GMT) Full text and rfc822 format available.

Message #11 received at 58727 <at> debbugs.gnu.org (full text, mbox):

From: Michael Heerdegen <michael_heerdegen <at> web.de>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: 58727 <at> debbugs.gnu.org
Subject: Re: 29.0.50; rx doc: Semantics of RX...
Date: Mon, 24 Oct 2022 04:34:09 +0200
Mattias Engdegård <mattias.engdegard <at> gmail.com> writes:

> > The resulting regexps are concatenating like with an implicit `seq'.
> > This is not trivial, though: in stringish regexps the repetition
> > operators are only unary, and different interpretations would make sense
> > for `rx' (implicit `seq', implicit `or').
>
> The rule is implicit concatenation unless specified otherwise; maybe
> we could say that in the leading paragraph. (`or` is the only place
> where concatenation isn't done.)

Yes, that would be good.


> > Oh, and maybe let's also make more clear that `rx' always cares
> > about implicit grouping when necessary.
>
> No, there is no such thing in rx.

I think you misunderstood what I meant, I meant the implicit shy
grouping added in the return value, as in

  (rx (or "ab" "cd")) ==> "\\(?:ab\\|cd\\)"
                           ^^^^^       ^^^
> The manual provides corresponding string-notation constructs for
> orientation only.  This is important -- rx forms are defined by their
> semantics, not by what strings they translate to.

Is this trivial however?  Is it clear that, even for people that see rx
more as a translator to stringish regexps, `rx' is that smart?

A sentence like "rx forms are defined by their semantics" would help to
make that clear I think.
Dunno, I'm just guessing that here is a potential for misunderstanding.

Telling about the implicit concatenation of RX... is the more important
point for me.


Thanks so far,

Michael. 




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#58727; Package emacs. (Mon, 24 Oct 2022 12:50:01 GMT) Full text and rfc822 format available.

Message #14 received at 58727 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Michael Heerdegen <michael_heerdegen <at> web.de>
Cc: 58727 <at> debbugs.gnu.org
Subject: Re: 29.0.50; rx doc: Semantics of RX...
Date: Mon, 24 Oct 2022 14:49:25 +0200
24 okt. 2022 kl. 04.34 skrev Michael Heerdegen <michael_heerdegen <at> web.de>:

>> The rule is implicit concatenation unless specified otherwise; maybe
>> we could say that in the leading paragraph. (`or` is the only place
>> where concatenation isn't done.)
> 
> Yes, that would be good.

Now added.

> I meant the implicit shy
> grouping added in the return value

Yes, and this is simply not a problem in rx, nor on the abstract regexp level -- it's just a feature of the surface syntax of string regexps but that's not something that the rx docs are or should be preoccupied with.

(For that matter, 'shy grouping' is terrible terminology: it's obscure wording for something that is generally known as bracketing to the general population.)

>  (rx (or "ab" "cd")) ==> "\\(?:ab\\|cd\\)"
>                           ^^^^^       ^^^

This happens to be a cosmetic flaw in rx: in this case the brackets shouldn't be there at all, but getting rid of them is currently more trouble than it's worth. It does not affect matching performance. See it as an excess of packaging material which does not increase the shipping costs.

>> The manual provides corresponding string-notation constructs for
>> orientation only.  This is important -- rx forms are defined by their
>> semantics, not by what strings they translate to.
> 
> Is this trivial however?  Is it clear that, even for people that see rx
> more as a translator to stringish regexps, `rx' is that smart?

It's not that rx is smart, it's that it's not completely broken. Mentioning that rx adds brackets now and then is tantamount to saying that it's not buggy. 

We don't say that the byte-compiler emits jump instructions as needed, not just because it's superfluous information but also because such a statement suggests that it's not.

> A sentence like "rx forms are defined by their semantics" would help to
> make that clear I think.

Well, I added a phrase to that effect as well.

Thank you for your comments and suggestions!





Reply sent to Michael Heerdegen <michael_heerdegen <at> web.de>:
You have taken responsibility. (Tue, 25 Oct 2022 02:50:02 GMT) Full text and rfc822 format available.

Notification sent to Michael Heerdegen <michael_heerdegen <at> web.de>:
bug acknowledged by developer. (Tue, 25 Oct 2022 02:50:02 GMT) Full text and rfc822 format available.

Message #19 received at 58727-done <at> debbugs.gnu.org (full text, mbox):

From: Michael Heerdegen <michael_heerdegen <at> web.de>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: 58727-done <at> debbugs.gnu.org
Subject: Re: 29.0.50; rx doc: Semantics of RX...
Date: Tue, 25 Oct 2022 04:49:20 +0200
Mattias Engdegård <mattias.engdegard <at> gmail.com> writes:

> > A sentence like "rx forms are defined by their semantics" would help
> > to make that clear I think.
>
> Well, I added a phrase to that effect as well.

Thanks - I hope it was not too much.

> Thank you for your comments and suggestions!

And thank you for the implementation of these!


Regards,

Michael.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 22 Nov 2022 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 212 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.