GNU bug report logs - #34492
rx: ASCII-raw byte ranges comprise all of Unicode

Previous Next

Package: emacs;

Reported by: Mattias Engdegård <mattiase <at> acm.org>

Date: Fri, 15 Feb 2019 18:25:02 UTC

Severity: normal

Done: Mattias Engdegård <mattiase <at> acm.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 34492 in the body.
You can then email your comments to 34492 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#34492; Package emacs. (Fri, 15 Feb 2019 18:25:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mattias Engdegård <mattiase <at> acm.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Fri, 15 Feb 2019 18:25:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: bug-gnu-emacs <at> gnu.org
Subject: rx: ASCII-raw byte ranges comprise all of Unicode
Date: Fri, 15 Feb 2019 19:23:56 +0100
`rx' incorrectly considers character ranges between ASCII and raw bytes to cover all codes in-between, which includes all non-ASCII Unicode chars.
This causes (any "\000-\377" ?Å) to be simplified to (any "\000-\377"), which is not at all the same thing: [\000-\377] really means [\000-\177\200-\377] -- the transformation is normally made by the Emacs regexp engine. The two ranges are not contiguous on the character code level.

It's a sleeper bug that was awakened by my fixing bug#33205, so I'm to blame for not checking this.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34492; Package emacs. (Fri, 15 Feb 2019 18:30:03 GMT) Full text and rfc822 format available.

Message #8 received at 34492 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: 34492 <at> debbugs.gnu.org
Subject: Re: bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise
 all of Unicode)
Date: Fri, 15 Feb 2019 19:29:28 +0100
[Message part 1 (text/plain, inline)]
Patch.

[0001-Prevent-over-eager-rx-character-range-condensation.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34492; Package emacs. (Sat, 16 Feb 2019 07:21:01 GMT) Full text and rfc822 format available.

Message #11 received at 34492 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 34492 <at> debbugs.gnu.org
Subject: Re: bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise
 all of Unicode)
Date: Sat, 16 Feb 2019 09:20:48 +0200
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Fri, 15 Feb 2019 19:29:28 +0100
> 
> Patch.

Thanks, this LGTM, but I think this should be in NEWS.  It's arguably
a bug, but only arguably, and it changes user-visible behavior.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34492; Package emacs. (Sat, 16 Feb 2019 08:09:02 GMT) Full text and rfc822 format available.

Message #14 received at 34492 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 34492 <at> debbugs.gnu.org
Subject: Re: bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise
 all of Unicode)
Date: Sat, 16 Feb 2019 09:08:11 +0100
16 feb. 2019 kl. 08.20 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> Thanks, this LGTM, but I think this should be in NEWS.  It's arguably
> a bug, but only arguably, and it changes user-visible behavior.

I'll be happy to write a NEWS item, but for what? The change of bug #33205, or this change, which is not visible unless the other change is already applied (and it hasn't made it into a release yet)?

If you mean the #33205 fix, it might result in something like the following:

** `rx' now handles raw bytes in character alternatives correctly when
given in a string.  Previously, `(any "\x80-\xff")' would match characters
U+0080...U+00FF.  Now the expression matches raw bytes in the 128...255 range,
as expected.

Is that what you had in mind? If so, in what subsection would it go?

* Changes in Specialized Modes and Packages
* Incompatible Lisp Changes
* Lisp Changes





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34492; Package emacs. (Sat, 16 Feb 2019 10:16:02 GMT) Full text and rfc822 format available.

Message #17 received at 34492 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 34492 <at> debbugs.gnu.org
Subject: Re: bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise
 all of Unicode)
Date: Sat, 16 Feb 2019 12:14:57 +0200
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sat, 16 Feb 2019 09:08:11 +0100
> Cc: 34492 <at> debbugs.gnu.org
> 
> 16 feb. 2019 kl. 08.20 skrev Eli Zaretskii <eliz <at> gnu.org>:
> > 
> > Thanks, this LGTM, but I think this should be in NEWS.  It's arguably
> > a bug, but only arguably, and it changes user-visible behavior.
> 
> I'll be happy to write a NEWS item, but for what? The change of bug #33205, or this change, which is not visible unless the other change is already applied (and it hasn't made it into a release yet)?

I mean both.

> If you mean the #33205 fix, it might result in something like the following:
> 
> ** `rx' now handles raw bytes in character alternatives correctly when
> given in a string.  Previously, `(any "\x80-\xff")' would match characters
> U+0080...U+00FF.  Now the expression matches raw bytes in the 128...255 range,
> as expected.
> 
> Is that what you had in mind?

Yes.

> If so, in what subsection would it go?

Either make a new section for rx under "Changes in Specialized Modes
and Packages", or put it under "Incompatible Lisp Changes".

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34492; Package emacs. (Sat, 16 Feb 2019 11:06:02 GMT) Full text and rfc822 format available.

Message #20 received at 34492 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 34492 <at> debbugs.gnu.org
Subject: Re: bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise
 all of Unicode)
Date: Sat, 16 Feb 2019 12:05:09 +0100
[Message part 1 (text/plain, inline)]
16 feb. 2019 kl. 11.14 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> Either make a new section for rx under "Changes in Specialized Modes
> and Packages", or put it under "Incompatible Lisp Changes".

I picked the former --- thanks for reviewing.
Since it's my first change to NEWS, I'm attaching the modified patch here for a final look.
[0001-Prevent-over-eager-rx-character-range-condensation.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#34492; Package emacs. (Sat, 16 Feb 2019 11:42:02 GMT) Full text and rfc822 format available.

Message #23 received at 34492 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 34492 <at> debbugs.gnu.org
Subject: Re: bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise
 all of Unicode)
Date: Sat, 16 Feb 2019 13:40:49 +0200
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sat, 16 Feb 2019 12:05:09 +0100
> Cc: 34492 <at> debbugs.gnu.org
> 
> +** rx
> +
> +---
> +*** rx now handles raw bytes in character alternatives correctly,
> +when given in a string.  Previously, `(any "\x80-\xff")' would match
> +characters U+0080...U+00FF.  Now the expression matches raw bytes in
> +the 128...255 range, as expected.

This is OK, but we use quoting 'like this' in NEWS.

Thanks.




Reply sent to Mattias Engdegård <mattiase <at> acm.org>:
You have taken responsibility. (Sat, 16 Feb 2019 11:47:02 GMT) Full text and rfc822 format available.

Notification sent to Mattias Engdegård <mattiase <at> acm.org>:
bug acknowledged by developer. (Sat, 16 Feb 2019 11:47:02 GMT) Full text and rfc822 format available.

Message #28 received at 34492-done <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 34492-done <at> debbugs.gnu.org
Subject: Re: bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise
 all of Unicode)
Date: Sat, 16 Feb 2019 12:46:16 +0100
16 feb. 2019 kl. 12.40 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> This is OK, but we use quoting 'like this' in NEWS.

Thank you, pushed with that modification.





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 17 Mar 2019 11:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 95 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.