GNU bug report logs -
#70988
(read FUNCTION) uses Latin-1 [PATCH]
Previous Next
To reply to this bug, email your comments to 70988 AT debbugs.gnu.org.
There is no need to reopen the bug first.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Thu, 16 May 2024 18:14:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Mattias Engdegård <mattias.engdegard <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Thu, 16 May 2024 18:14:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
When `read` is called with a function as stream argument, the return values of that function are often interpreted as Latin-1 characters with only the 8 low bits used. Example:
(let* ((next '(?A #x12a nil))
(f (lambda (&rest args)
(if args
(push (car args) next)
(pop next)))))
(read f))
=> A* ; expected: AĪ
This is a result of `readchar` setting *multibyte to 0 on this code path.
The reader is not very consistent: inside string and character literals, the code seems to work as expected.
The fix is straightforward (attached).
[read-from-function.diff (application/octet-stream, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Thu, 16 May 2024 18:49:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 70988 <at> debbugs.gnu.org (full text, mbox):
> Cc: Stefan Monnier <monnier <at> iro.umontreal.ca>
> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Thu, 16 May 2024 20:13:18 +0200
>
> When `read` is called with a function as stream argument, the return values of that function are often interpreted as Latin-1 characters with only the 8 low bits used. Example:
>
> (let* ((next '(?A #x12a nil))
> (f (lambda (&rest args)
> (if args
> (push (car args) next)
> (pop next)))))
> (read f))
> => A* ; expected: AĪ
>
> This is a result of `readchar` setting *multibyte to 0 on this code path.
When is this situation relevant? How many uses of
function-as-a-stream are there out there?
In general, I wouldn't touch these rare cases with a 3-mile pole. The
gain is generally very small (satisfaction from some abstract sense of
correctness aside), while the risk to break some code is usually high.
It is better to document this behavior and move on.
> The fix is straightforward (attached).
>
> diff --git a/src/lread.c b/src/lread.c
> index c92b2ede932..2626272c4e2 100644
> --- a/src/lread.c
> +++ b/src/lread.c
> @@ -422,6 +422,8 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
> goto read_multibyte;
> }
>
> + if (multibyte)
> + *multibyte = 1;
> tem = call0 (readcharfun);
Is it an accident that the code does the same only _after_ the call to
readbyte?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Thu, 16 May 2024 19:48:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 70988 <at> debbugs.gnu.org (full text, mbox):
16 maj 2024 kl. 20.47 skrev Eli Zaretskii <eliz <at> gnu.org>:
> When is this situation relevant? How many uses of
> function-as-a-stream are there out there?
Not many is my guess, which is perhaps why it wasn't found before.
I'm doing some performance work on the reader, and quirks in the code like these become obvious.
> Is it an accident that the code does the same only _after_ the call to
> readbyte?
Yes, I have no reason to believe otherwise.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Thu, 16 May 2024 19:55:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 70988 <at> debbugs.gnu.org (full text, mbox):
> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Thu, 16 May 2024 21:45:56 +0200
> Cc: 70988 <at> debbugs.gnu.org,
> monnier <at> iro.umontreal.ca
>
> > Is it an accident that the code does the same only _after_ the call to
> > readbyte?
>
> Yes, I have no reason to believe otherwise.
To me, it actually looks as done on purpose.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Fri, 17 May 2024 08:11:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 70988 <at> debbugs.gnu.org (full text, mbox):
16 maj 2024 kl. 21.54 skrev Eli Zaretskii <eliz <at> gnu.org>:
>>> Is it an accident that the code does the same only _after_ the call to
>>> readbyte?
>>
>> Yes, I have no reason to believe otherwise.
>
> To me, it actually looks as done on purpose.
You could very well be right about that. What I meant is that the order doesn't matter at all.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Fri, 17 May 2024 10:49:01 GMT)
Full text and
rfc822 format available.
Message #20 received at 70988 <at> debbugs.gnu.org (full text, mbox):
> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Fri, 17 May 2024 10:08:58 +0200
> Cc: 70988 <at> debbugs.gnu.org,
> monnier <at> iro.umontreal.ca
>
> 16 maj 2024 kl. 21.54 skrev Eli Zaretskii <eliz <at> gnu.org>:
>
> >>> Is it an accident that the code does the same only _after_ the call to
> >>> readbyte?
> >>
> >> Yes, I have no reason to believe otherwise.
> >
> > To me, it actually looks as done on purpose.
>
> You could very well be right about that. What I meant is that the order doesn't matter at all.
Doesn't it affect what the readbyte call does?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Fri, 17 May 2024 17:10:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 70988 <at> debbugs.gnu.org (full text, mbox):
17 maj 2024 kl. 12.45 skrev Eli Zaretskii <eliz <at> gnu.org>:
>>>>> Is it an accident that the code does the same only _after_ the call to
>>>>> readbyte?
>>>>
>>>> Yes, I have no reason to believe otherwise.
>>>
>>> To me, it actually looks as done on purpose.
>>
>> You could very well be right about that. What I meant is that the order doesn't matter at all.
>
> Doesn't it affect what the readbyte call does?
No -- the `*multibyte = ...` assignment is just an extra return value, which indicates whether the returned values come from a unibyte or multibyte source. For any given source (READCHARFUN, in the terminology of lread.c), the characters will all be unibyte or multibyte, so this returned `multibyte` flag will typically only be used once by the caller and saved for future reference.
But you are right to question it because lread.c is a royal mess and many changes have not been made in a clean way. It is unclear whether it's worth returning the `multibyte` flag at all; it's only used in special cases.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Thu, 30 May 2024 15:45:01 GMT)
Full text and
rfc822 format available.
Message #26 received at 70988 <at> debbugs.gnu.org (full text, mbox):
After looking further into the Lisp reader/printer I found two more silent Latin-1 assumptions. In all three cases, I firmly believe the following to be true:
* The behaviour is not intended but just code accidents.
* They should hardly affect any user code at all.
* They are nevertheless clear bugs which should be fixed.
Further on this will have to wait until after Emacs 30 has been branched to avoid delaying that more important task.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Wed, 12 Feb 2025 14:54:02 GMT)
Full text and
rfc822 format available.
Message #29 received at 70988 <at> debbugs.gnu.org (full text, mbox):
Mattias Engdegård <mattias.engdegard <at> gmail.com> writes:
> After looking further into the Lisp reader/printer I found two more silent Latin-1 assumptions. In all three cases, I firmly believe the following to be true:
>
> * The behaviour is not intended but just code accidents.
> * They should hardly affect any user code at all.
> * They are nevertheless clear bugs which should be fixed.
>
> Further on this will have to wait until after Emacs 30 has been branched to avoid delaying that more important task.
FWIW, the proposed patch looks like a bug fix to me as well, so I think
we should install it.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Wed, 12 Feb 2025 16:43:02 GMT)
Full text and
rfc822 format available.
Message #32 received at 70988 <at> debbugs.gnu.org (full text, mbox):
"Stefan Kangas" <stefankangas <at> gmail.com> writes:
> Mattias Engdegård <mattias.engdegard <at> gmail.com> writes:
>
>> After looking further into the Lisp reader/printer I found two more
>> silent Latin-1 assumptions. In all three cases, I firmly believe the
>> following to be true:
>>
>> * The behaviour is not intended but just code accidents.
>> * They should hardly affect any user code at all.
>> * They are nevertheless clear bugs which should be fixed.
>>
>> Further on this will have to wait until after Emacs 30 has been branched to avoid delaying that more important task.
>
> FWIW, the proposed patch looks like a bug fix to me as well, so I think
> we should install it.
I think we should think about whether we want to force multibyte to true
for all functions, even those never returning non-ASCII chars. Also,
the code appears to use XFIXNUM on a Lisp_Object that might not be one.
IIUC, the difference is that all-ASCII strings would be unibyte strings
in some circumstances.
The alternative patch would look something like this:
From bbc65c9be7ccebf034f4d10f018a076ef1e8a4e9 Mon Sep 17 00:00:00 2001
From: Pip Cet <pipcet <at> protonmail.com>
Subject: [PATCH] Auto-detect multibyteness of readchar funs (bug#70988)
* src/lread.c (readchar): Set *MULTIBYTE if we detect a multibyte
character. Return -1 for non-characters rather than crashing.
---
src/lread.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/src/lread.c b/src/lread.c
index 6af95873bb8..c18c1be3cf5 100644
--- a/src/lread.c
+++ b/src/lread.c
@@ -398,9 +398,12 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
tem = call0 (readcharfun);
- if (NILP (tem))
+ if (!CHARACTERP (tem))
return -1;
- return XFIXNUM (tem);
+ if (multibyte && !ASCII_CHAR_P (XFIXNAT (tem)))
+ *multibyte = true;
+
+ return XFIXNAT (tem);
read_multibyte:
if (unread_char >= 0)
--
2.48.1
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Wed, 12 Feb 2025 19:27:02 GMT)
Full text and
rfc822 format available.
Message #35 received at 70988 <at> debbugs.gnu.org (full text, mbox):
> Date: Wed, 12 Feb 2025 16:42:43 +0000
> From: Pip Cet <pipcet <at> protonmail.com>
> Cc: Mattias Engdegård <mattias.engdegard <at> gmail.com>, 70988 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>, monnier <at> iro.umontreal.ca
>
> The alternative patch would look something like this:
>
> >From bbc65c9be7ccebf034f4d10f018a076ef1e8a4e9 Mon Sep 17 00:00:00 2001
> From: Pip Cet <pipcet <at> protonmail.com>
> Subject: [PATCH] Auto-detect multibyteness of readchar funs (bug#70988)
>
> * src/lread.c (readchar): Set *MULTIBYTE if we detect a multibyte
> character. Return -1 for non-characters rather than crashing.
> ---
> src/lread.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/src/lread.c b/src/lread.c
> index 6af95873bb8..c18c1be3cf5 100644
> --- a/src/lread.c
> +++ b/src/lread.c
> @@ -398,9 +398,12 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
>
> tem = call0 (readcharfun);
>
> - if (NILP (tem))
> + if (!CHARACTERP (tem))
> return -1;
> - return XFIXNUM (tem);
> + if (multibyte && !ASCII_CHAR_P (XFIXNAT (tem)))
> + *multibyte = true;
> +
> + return XFIXNAT (tem);
AFAIU, the proposed patch was just a bugfix, whereas the above also
changes behavior in backward-incompatible ways.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Wed, 12 Feb 2025 20:29:02 GMT)
Full text and
rfc822 format available.
Message #38 received at 70988 <at> debbugs.gnu.org (full text, mbox):
"Eli Zaretskii" <eliz <at> gnu.org> writes:
>> Date: Wed, 12 Feb 2025 16:42:43 +0000
>> From: Pip Cet <pipcet <at> protonmail.com>
>> Cc: Mattias Engdegård <mattias.engdegard <at> gmail.com>, 70988 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>, monnier <at> iro.umontreal.ca
>>
>> The alternative patch would look something like this:
>>
>> >From bbc65c9be7ccebf034f4d10f018a076ef1e8a4e9 Mon Sep 17 00:00:00 2001
>> From: Pip Cet <pipcet <at> protonmail.com>
>> Subject: [PATCH] Auto-detect multibyteness of readchar funs (bug#70988)
>>
>> * src/lread.c (readchar): Set *MULTIBYTE if we detect a multibyte
>> character. Return -1 for non-characters rather than crashing.
>> ---
>> src/lread.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/src/lread.c b/src/lread.c
>> index 6af95873bb8..c18c1be3cf5 100644
>> --- a/src/lread.c
>> +++ b/src/lread.c
>> @@ -398,9 +398,12 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
>>
>> tem = call0 (readcharfun);
>>
>> - if (NILP (tem))
>> + if (!CHARACTERP (tem))
>> return -1;
>> - return XFIXNUM (tem);
>> + if (multibyte && !ASCII_CHAR_P (XFIXNAT (tem)))
>> + *multibyte = true;
>> +
>> + return XFIXNAT (tem);
>
> AFAIU, the proposed patch was just a bugfix, whereas the above also
> changes behavior in backward-incompatible ways.
The other way around, I think: the first proposed patch changed the
behavior of readchar to always set the multibyte flag when a function
was used, resulting in the creation of symbols whose ASCII names are
multibyte strings. The previous behavior was never to set the multibyte
flag, which was correct for ASCII strings but not multibyte ones.
This patch retains the previous behavior for ASCII symbols, but sets the
multibyte flag for non-ASCII symbols, which seems the best we can do if
we're given a simple function.
If we want to change symbol names to always be multibyte strings, we can
do that, but then we probably want to do that or all streams.
It also fixes yet another XFIXNUM crash, but those (there are more in
lread.c, it seems) should be fixed independently. However, it does give
us the ability to extend the API so readcharfun could return a single
character string, unibyte or multibyte, to be handled appropriately.
Pip
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Thu, 13 Feb 2025 06:02:02 GMT)
Full text and
rfc822 format available.
Message #41 received at 70988 <at> debbugs.gnu.org (full text, mbox):
> Date: Wed, 12 Feb 2025 20:27:58 +0000
> From: Pip Cet <pipcet <at> protonmail.com>
> Cc: stefankangas <at> gmail.com, mattias.engdegard <at> gmail.com, 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
>
> "Eli Zaretskii" <eliz <at> gnu.org> writes:
>
> >> --- a/src/lread.c
> >> +++ b/src/lread.c
> >> @@ -398,9 +398,12 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
> >>
> >> tem = call0 (readcharfun);
> >>
> >> - if (NILP (tem))
> >> + if (!CHARACTERP (tem))
> >> return -1;
> >> - return XFIXNUM (tem);
> >> + if (multibyte && !ASCII_CHAR_P (XFIXNAT (tem)))
> >> + *multibyte = true;
> >> +
> >> + return XFIXNAT (tem);
> >
> > AFAIU, the proposed patch was just a bugfix, whereas the above also
> > changes behavior in backward-incompatible ways.
>
> The other way around, I think: the first proposed patch changed the
> behavior of readchar to always set the multibyte flag when a function
> was used, resulting in the creation of symbols whose ASCII names are
> multibyte strings. The previous behavior was never to set the multibyte
> flag, which was correct for ASCII strings but not multibyte ones.
>
> This patch retains the previous behavior for ASCII symbols, but sets the
> multibyte flag for non-ASCII symbols, which seems the best we can do if
> we're given a simple function.
I'm talking about the CHARACTERP test (why not FIXNUMP?), and the
addition of ASCII_CHAR_P test (why would we want an ASCII character
to never be considered multibyte?).
> If we want to change symbol names to always be multibyte strings, we can
> do that, but then we probably want to do that or all streams.
I don't understand why you are talking about symbols: AFAIU this code
is used in many other cases as well. But even for symbols: why change
the current behavior of making their names multibyte?
> It also fixes yet another XFIXNUM crash, but those (there are more in
> lread.c, it seems) should be fixed independently.
I'm okay with adding a FIXNUMP test (which happens in the debugging
builds anyway, so any violations probably never happen), but using
CHARACTERP changes behavior.
> However, it does give us the ability to extend the API so
> readcharfun could return a single character string, unibyte or
> multibyte, to be handled appropriately.
This is also a change in behavior.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Thu, 13 Feb 2025 10:10:02 GMT)
Full text and
rfc822 format available.
Message #44 received at 70988 <at> debbugs.gnu.org (full text, mbox):
"Eli Zaretskii" <eliz <at> gnu.org> writes:
>> Date: Wed, 12 Feb 2025 20:27:58 +0000
>> From: Pip Cet <pipcet <at> protonmail.com>
>> Cc: stefankangas <at> gmail.com, mattias.engdegard <at> gmail.com, 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
>>
>> "Eli Zaretskii" <eliz <at> gnu.org> writes:
>>
>> >> --- a/src/lread.c
>> >> +++ b/src/lread.c
>> >> @@ -398,9 +398,12 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
>> >>
>> >> tem = call0 (readcharfun);
>> >>
>> >> - if (NILP (tem))
>> >> + if (!CHARACTERP (tem))
>> >> return -1;
>> >> - return XFIXNUM (tem);
>> >> + if (multibyte && !ASCII_CHAR_P (XFIXNAT (tem)))
>> >> + *multibyte = true;
>> >> +
>> >> + return XFIXNAT (tem);
>> >
>> > AFAIU, the proposed patch was just a bugfix, whereas the above also
>> > changes behavior in backward-incompatible ways.
>>
>> The other way around, I think: the first proposed patch changed the
>> behavior of readchar to always set the multibyte flag when a function
>> was used, resulting in the creation of symbols whose ASCII names are
>> multibyte strings. The previous behavior was never to set the multibyte
>> flag, which was correct for ASCII strings but not multibyte ones.
>>
>> This patch retains the previous behavior for ASCII symbols, but sets the
>> multibyte flag for non-ASCII symbols, which seems the best we can do if
>> we're given a simple function.
>
> I'm talking about the CHARACTERP test (why not FIXNUMP?), and the
The function is supposed to return a character, not just any fixnum.
> addition of ASCII_CHAR_P test (why would we want an ASCII character
> to never be considered multibyte?).
It's the other way around, again: if there's a non-ASCII character, we
treat the stream as multibyte; if there are ONLY ASCII characters, we
treat it as unibyte.
>> If we want to change symbol names to always be multibyte strings, we can
>> do that, but then we probably want to do that or all streams.
>
> I don't understand why you are talking about symbols: AFAIU this code
> is used in many other cases as well. But even for symbols: why change
> the current behavior of making their names multibyte?
The current behavior is to make their names unibyte! The current
behavior is *changed* by the first patch, and *retained* by my patch.
>> It also fixes yet another XFIXNUM crash, but those (there are more in
>> lread.c, it seems) should be fixed independently.
>
> I'm okay with adding a FIXNUMP test (which happens in the debugging
> builds anyway, so any violations probably never happen), but using
> CHARACTERP changes behavior.
If you count "avoids further crashes" as "changes behavior", yes.
readcharfun is supposed to return a character or -1. Some callers
assume the return value is a valid character, and will crash otherwise.
I haven't checked all of them because there are many.
>> However, it does give us the ability to extend the API so
>> readcharfun could return a single character string, unibyte or
>> multibyte, to be handled appropriately.
>
> This is also a change in behavior.
Yes, of course, which is why it's a separate proposal and not part of
the patch.
Pip
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Thu, 13 Feb 2025 10:24:02 GMT)
Full text and
rfc822 format available.
Message #47 received at 70988 <at> debbugs.gnu.org (full text, mbox):
> Date: Thu, 13 Feb 2025 10:08:54 +0000
> From: Pip Cet <pipcet <at> protonmail.com>
> Cc: stefankangas <at> gmail.com, mattias.engdegard <at> gmail.com, 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
>
> "Eli Zaretskii" <eliz <at> gnu.org> writes:
[...]
This style of "discussion" leads nowhere useful, so I'm bowing out of
it.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Thu, 13 Feb 2025 14:09:01 GMT)
Full text and
rfc822 format available.
Message #50 received at 70988 <at> debbugs.gnu.org (full text, mbox):
"Stefan Kangas" <stefankangas <at> gmail.com> writes:
> Mattias Engdegård <mattias.engdegard <at> gmail.com> writes:
>
>> After looking further into the Lisp reader/printer I found two more
>> silent Latin-1 assumptions. In all three cases, I firmly believe the
>> following to be true:
>>
>> * The behaviour is not intended but just code accidents.
>> * They should hardly affect any user code at all.
>> * They are nevertheless clear bugs which should be fixed.
>>
>> Further on this will have to wait until after Emacs 30 has been branched to avoid delaying that more important task.
>
> FWIW, the proposed patch looks like a bug fix to me as well, so I think
> we should install it.
Here's a patch which avoids a few crashes in lread.c. If we change the
behavior of readcharfuns' return values to always be treated as
multibyte, but fail to verify that its return value satisfies
CHARACTERP, as proposed, further problems will arise.
In the event that there is interest in fixing Emacs not to crash in
these circumstances, let me know and I can commit this.
Pip
From e9d48abeb199db7d76386639adadba5c7e45177c Mon Sep 17 00:00:00 2001
From: Pip Cet <pipcet <at> protonmail.com>
Subject: [PATCH] Avoid crashes in lread.c when invalid characters are read
* src/lread.c (readchar): Don't crash for non-fixnum return values.
(read_filtered_event): Don't crash for invalid symbol properties.
(Fread_char):
(Fread_char_exclusive):
(character_name_to_code): Check 'FIXNUMP' before using 'XFIXNUM'.
(read_char_escape): Use 'invalid_syntax' rather than crashing for
invalid characters.
---
src/lread.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/src/lread.c b/src/lread.c
index 6af95873bb8..5875b489c97 100644
--- a/src/lread.c
+++ b/src/lread.c
@@ -398,7 +398,7 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
tem = call0 (readcharfun);
- if (NILP (tem))
+ if (!FIXNUMP (tem))
return -1;
return XFIXNUM (tem);
@@ -816,7 +816,7 @@ read_filtered_event (bool no_switch_frame, bool ascii_required,
tem1 = Fget (Fcar (tem), Qascii_character);
/* Merge this symbol's modifier bits
with the ASCII equivalent of its basic code. */
- if (!NILP (tem1))
+ if (FIXNUMP (tem1) && FIXNUMP (Fcar (Fcdr (tem))))
XSETFASTINT (val, XFIXNUM (tem1) | XFIXNUM (Fcar (Fcdr (tem))));
}
}
@@ -898,7 +898,7 @@ DEFUN ("read-char", Fread_char, Sread_char, 0, 3, 0,
}
val = read_filtered_event (1, 1, 1, ! NILP (inherit_input_method), seconds);
- return (NILP (val) ? Qnil
+ return (!FIXNUMP (val) ? Qnil
: make_fixnum (char_resolve_modifier_mask (XFIXNUM (val))));
}
@@ -976,7 +976,7 @@ DEFUN ("read-char-exclusive", Fread_char_exclusive, Sread_char_exclusive, 0, 3,
val = read_filtered_event (1, 1, 0, ! NILP (inherit_input_method), seconds);
- return (NILP (val) ? Qnil
+ return (!FIXNUMP (val) ? Qnil
: make_fixnum (char_resolve_modifier_mask (XFIXNUM (val))));
}
@@ -2820,7 +2820,7 @@ character_name_to_code (char const *name, ptrdiff_t name_len,
invalid_syntax_lisp (CALLN (Fformat, format, namestr), readcharfun);
}
- return XFIXNUM (code);
+ return FIXNUMP (code) ? XFIXNUM (code) : -1;
}
/* Bound on the length of a Unicode character name. As of
@@ -3058,7 +3058,8 @@ read_char_escape (Lisp_Object readcharfun, int next_char)
chr = c;
break;
}
- eassert (chr >= 0 && chr < (1 << CHARACTERBITS));
+ if (chr < 0 || chr > (1 << CHARACTERBITS))
+ invalid_syntax ("Invalid character", readcharfun);
/* Apply Control modifiers, using the rules:
\C-X = ascii_ctrl(nomod(X)) | mods(X) if nomod(X) is one of:
--
2.48.1
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Thu, 13 Feb 2025 16:43:02 GMT)
Full text and
rfc822 format available.
Message #53 received at 70988 <at> debbugs.gnu.org (full text, mbox):
> Date: Thu, 13 Feb 2025 14:08:02 +0000
> From: Pip Cet <pipcet <at> protonmail.com>
> Cc: Mattias Engdegård <mattias.engdegard <at> gmail.com>, 70988 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>, monnier <at> iro.umontreal.ca
>
> @@ -3058,7 +3058,8 @@ read_char_escape (Lisp_Object readcharfun, int next_char)
> chr = c;
> break;
> }
> - eassert (chr >= 0 && chr < (1 << CHARACTERBITS));
> + if (chr < 0 || chr > (1 << CHARACTERBITS))
> + invalid_syntax ("Invalid character", readcharfun);
Please leave the assertion in place here (in addition to the error
message), since an abort is easier to spot than an error message,
especially if someone catches errors at a higher level.
I'm okay with the rest of this patch, thanks.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Thu, 13 Feb 2025 17:12:02 GMT)
Full text and
rfc822 format available.
Message #56 received at 70988 <at> debbugs.gnu.org (full text, mbox):
"Eli Zaretskii" <eliz <at> gnu.org> writes:
>> Date: Thu, 13 Feb 2025 14:08:02 +0000
>> From: Pip Cet <pipcet <at> protonmail.com>
>> Cc: Mattias Engdegård <mattias.engdegard <at> gmail.com>, 70988 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>, monnier <at> iro.umontreal.ca
>>
>> @@ -3058,7 +3058,8 @@ read_char_escape (Lisp_Object readcharfun, int next_char)
>> chr = c;
>> break;
>> }
>> - eassert (chr >= 0 && chr < (1 << CHARACTERBITS));
>> + if (chr < 0 || chr > (1 << CHARACTERBITS))
>> + invalid_syntax ("Invalid character", readcharfun);
>
> Please leave the assertion in place here (in addition to the error
> message), since an abort is easier to spot than an error message,
> especially if someone catches errors at a higher level.
Just to clarify, you want
@@ -3058,7 +3058,8 @@ read_char_escape (Lisp_Object readcharfun, int next_char)
chr = c;
break;
}
eassert (chr >= 0 && chr < (1 << CHARACTERBITS));
+ if (chr < 0 || chr >= (1 << CHARACTERBITS))
+ invalid_syntax ("Invalid character", readcharfun);
?
Thanks!
Pip
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Thu, 13 Feb 2025 20:14:02 GMT)
Full text and
rfc822 format available.
Message #59 received at 70988 <at> debbugs.gnu.org (full text, mbox):
> Date: Thu, 13 Feb 2025 17:11:35 +0000
> From: Pip Cet <pipcet <at> protonmail.com>
> Cc: stefankangas <at> gmail.com, mattias.engdegard <at> gmail.com, 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
>
> "Eli Zaretskii" <eliz <at> gnu.org> writes:
>
> > Please leave the assertion in place here (in addition to the error
> > message), since an abort is easier to spot than an error message,
> > especially if someone catches errors at a higher level.
>
> Just to clarify, you want
>
> @@ -3058,7 +3058,8 @@ read_char_escape (Lisp_Object readcharfun, int next_char)
> chr = c;
> break;
> }
> eassert (chr >= 0 && chr < (1 << CHARACTERBITS));
> + if (chr < 0 || chr >= (1 << CHARACTERBITS))
> + invalid_syntax ("Invalid character", readcharfun);
>
> ?
Yes, exactly. Apologies for not making that clear.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Sat, 05 Jul 2025 11:28:03 GMT)
Full text and
rfc822 format available.
Message #62 received at 70988 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Sorry about the delay. The bugs I was talking about earlier are:
1. (read FUNCTION) assumes latin-1, as discussed earlier in this bug.
The code in readchar() just forgets to set the multibyte flag for function sources.
2. (read UNIBYTE-STRING) assumes latin-1:
(read "\"a\xff\"") -> "aÿ"
For buffer and marker sources, readchar() does
if (! ASCII_CHAR_P (c))
c = BYTE8_TO_CHAR (c);
but this is missing for string sources.
3. (print UNIBYTE-SYM) assumes latin-1;
(prin1-to-string (make-symbol "a\xff")) -> "aÿ"
Here the reason is that print_object() calls `fetch_string_char_advance` instead of `fetch_string_char_as_multibyte_advance`.
The above three bugs are clear omissions and were never intended behaviour; a lot happened in the switch to multibyte and bugs were bound to appear in the cracks. There should be no downside from fixing them.
We may want to ask ourselves whether it's reasonable that read sources have a multibyteness, which affects how symbols are read but not string literals. I don't think it should affect either. However, I'm leaving this concern out of the immediate discussion.
I also have a patch that improves reader performance while cleaning up some parts of the code, but it can be applied before or after fixing the three bugs above.
[0001-Read-characters-from-functions-as-multibyte.patch (application/octet-stream, attachment)]
[0002-Print-non-ASCII-chars-in-unibyte-symbols-as-raw-byte.patch (application/octet-stream, attachment)]
[0003-Read-non-ASCII-chars-from-unibyte-string-sources-as-.patch (application/octet-stream, attachment)]
[Message part 5 (text/plain, inline)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Sat, 05 Jul 2025 15:06:05 GMT)
Full text and
rfc822 format available.
Message #65 received at 70988 <at> debbugs.gnu.org (full text, mbox):
> The above three bugs are clear omissions and were never intended
> behaviour; a lot happened in the switch to multibyte and bugs were
> bound to appear in the cracks. There should be no downside from
> fixing them.
The three patches look good to me (thanks especially for the tests).
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#70988
; Package
emacs
.
(Mon, 07 Jul 2025 16:26:02 GMT)
Full text and
rfc822 format available.
Message #68 received at 70988 <at> debbugs.gnu.org (full text, mbox):
5 juli 2025 kl. 17.05 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
> The three patches look good to me (thanks especially for the tests).
Thank you, pushed to master. The speed-up patch mentioned earlier will be applied shortly as well.
Reply sent
to
Mattias Engdegård <mattias.engdegard <at> gmail.com>
:
You have taken responsibility.
(Thu, 10 Jul 2025 09:12:04 GMT)
Full text and
rfc822 format available.
Notification sent
to
Mattias Engdegård <mattias.engdegard <at> gmail.com>
:
bug acknowledged by developer.
(Thu, 10 Jul 2025 09:12:05 GMT)
Full text and
rfc822 format available.
Message #73 received at 70988-done <at> debbugs.gnu.org (full text, mbox):
> The speed-up patch mentioned earlier will be applied shortly as well.
That patch is now in master too, and while there is more work to be done in that regard we are long past the scope of this bug which is hereby closed.
This bug report was last modified 9 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.