GNU bug report logs - #70988
(read FUNCTION) uses Latin-1 [PATCH]

Package: emacs;

Reported by: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Date: Thu, 16 May 2024 18:14:01 UTC

Severity: normal

Tags: patch

Done: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 70988 in the body.
You can then email your comments to 70988 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Thu, 16 May 2024 18:14:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mattias Engdegård <mattias.engdegard <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 16 May 2024 18:14:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Emacs Bug Report <bug-gnu-emacs <at> gnu.org>
Cc: Stefan Monnier <monnier <at> iro.umontreal.ca>
Subject: (read FUNCTION) uses Latin-1 [PATCH]
Date: Thu, 16 May 2024 20:13:18 +0200

[Message part 1 (text/plain, inline)]

When `read` is called with a function as stream argument, the return values of that function are often interpreted as Latin-1 characters with only the 8 low bits used. Example:

(let* ((next '(?A #x12a nil))
       (f (lambda (&rest args)
            (if args
                (push (car args) next)
              (pop next)))))
  (read f))
=> A*   ; expected: AĪ

This is a result of `readchar` setting *multibyte to 0 on this code path.

The reader is not very consistent: inside string and character literals, the code seems to work as expected.

The fix is straightforward (attached).

[read-from-function.diff (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Thu, 16 May 2024 18:49:02 GMT) Full text and rfc822 format available.

Message #8 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Thu, 16 May 2024 21:47:52 +0300

> Cc: Stefan Monnier <monnier <at> iro.umontreal.ca>
> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Thu, 16 May 2024 20:13:18 +0200
> 
> When `read` is called with a function as stream argument, the return values of that function are often interpreted as Latin-1 characters with only the 8 low bits used. Example:
> 
> (let* ((next '(?A #x12a nil))
>        (f (lambda (&rest args)
>             (if args
>                 (push (car args) next)
>               (pop next)))))
>   (read f))
> => A*   ; expected: AĪ
> 
> This is a result of `readchar` setting *multibyte to 0 on this code path.

When is this situation relevant?  How many uses of
function-as-a-stream are there out there?

In general, I wouldn't touch these rare cases with a 3-mile pole.  The
gain is generally very small (satisfaction from some abstract sense of
correctness aside), while the risk to break some code is usually high.
It is better to document this behavior and move on.

> The fix is straightforward (attached).
> 
> diff --git a/src/lread.c b/src/lread.c
> index c92b2ede932..2626272c4e2 100644
> --- a/src/lread.c
> +++ b/src/lread.c
> @@ -422,6 +422,8 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
>        goto read_multibyte;
>      }
>  
> +  if (multibyte)
> +    *multibyte = 1;
>    tem = call0 (readcharfun);

Is it an accident that the code does the same only _after_ the call to
readbyte?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Thu, 16 May 2024 19:48:01 GMT) Full text and rfc822 format available.

Message #11 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Thu, 16 May 2024 21:45:56 +0200

16 maj 2024 kl. 20.47 skrev Eli Zaretskii <eliz <at> gnu.org>:

> When is this situation relevant?  How many uses of
> function-as-a-stream are there out there?

Not many is my guess, which is perhaps why it wasn't found before.
I'm doing some performance work on the reader, and quirks in the code like these become obvious.

> Is it an accident that the code does the same only _after_ the call to
> readbyte?

Yes, I have no reason to believe otherwise.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Thu, 16 May 2024 19:55:01 GMT) Full text and rfc822 format available.

Message #14 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Thu, 16 May 2024 22:54:01 +0300

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Thu, 16 May 2024 21:45:56 +0200
> Cc: 70988 <at> debbugs.gnu.org,
>  monnier <at> iro.umontreal.ca
> 
> > Is it an accident that the code does the same only _after_ the call to
> > readbyte?
> 
> Yes, I have no reason to believe otherwise.

To me, it actually looks as done on purpose.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Fri, 17 May 2024 08:11:01 GMT) Full text and rfc822 format available.

Message #17 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Fri, 17 May 2024 10:08:58 +0200

16 maj 2024 kl. 21.54 skrev Eli Zaretskii <eliz <at> gnu.org>:

>>> Is it an accident that the code does the same only _after_ the call to
>>> readbyte?
>> 
>> Yes, I have no reason to believe otherwise.
> 
> To me, it actually looks as done on purpose.

You could very well be right about that. What I meant is that the order doesn't matter at all.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Fri, 17 May 2024 10:49:01 GMT) Full text and rfc822 format available.

Message #20 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Fri, 17 May 2024 13:45:56 +0300

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Fri, 17 May 2024 10:08:58 +0200
> Cc: 70988 <at> debbugs.gnu.org,
>  monnier <at> iro.umontreal.ca
> 
> 16 maj 2024 kl. 21.54 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> >>> Is it an accident that the code does the same only _after_ the call to
> >>> readbyte?
> >> 
> >> Yes, I have no reason to believe otherwise.
> > 
> > To me, it actually looks as done on purpose.
> 
> You could very well be right about that. What I meant is that the order doesn't matter at all.

Doesn't it affect what the readbyte call does?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Fri, 17 May 2024 17:10:02 GMT) Full text and rfc822 format available.

Message #23 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Fri, 17 May 2024 19:08:15 +0200

17 maj 2024 kl. 12.45 skrev Eli Zaretskii <eliz <at> gnu.org>:

>>>>> Is it an accident that the code does the same only _after_ the call to
>>>>> readbyte?
>>>> 
>>>> Yes, I have no reason to believe otherwise.
>>> 
>>> To me, it actually looks as done on purpose.
>> 
>> You could very well be right about that. What I meant is that the order doesn't matter at all.
> 
> Doesn't it affect what the readbyte call does?

No -- the `*multibyte = ...` assignment is just an extra return value, which indicates whether the returned values come from a unibyte or multibyte source. For any given source (READCHARFUN, in the terminology of lread.c), the characters will all be unibyte or multibyte, so this returned `multibyte` flag will typically only be used once by the caller and saved for future reference.

But you are right to question it because lread.c is a royal mess and many changes have not been made in a clean way. It is unclear whether it's worth returning the `multibyte` flag at all; it's only used in special cases.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Thu, 30 May 2024 15:45:01 GMT) Full text and rfc822 format available.

Message #26 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Thu, 30 May 2024 17:43:19 +0200

After looking further into the Lisp reader/printer I found two more silent Latin-1 assumptions. In all three cases, I firmly believe the following to be true:

* The behaviour is not intended but just code accidents.
* They should hardly affect any user code at all.
* They are nevertheless clear bugs which should be fixed.

Further on this will have to wait until after Emacs 30 has been branched to avoid delaying that more important task.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Wed, 12 Feb 2025 14:54:02 GMT) Full text and rfc822 format available.

Message #29 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: 70988 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>,
 monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Wed, 12 Feb 2025 06:52:58 -0800

Mattias Engdegård <mattias.engdegard <at> gmail.com> writes:

> After looking further into the Lisp reader/printer I found two more silent Latin-1 assumptions. In all three cases, I firmly believe the following to be true:
>
> * The behaviour is not intended but just code accidents.
> * They should hardly affect any user code at all.
> * They are nevertheless clear bugs which should be fixed.
>
> Further on this will have to wait until after Emacs 30 has been branched to avoid delaying that more important task.

FWIW, the proposed patch looks like a bug fix to me as well, so I think
we should install it.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Wed, 12 Feb 2025 16:43:02 GMT) Full text and rfc822 format available.

Message #32 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> protonmail.com>
To: Stefan Kangas <stefankangas <at> gmail.com>
Cc: 70988 <at> debbugs.gnu.org,
 Mattias Engdegård <mattias.engdegard <at> gmail.com>,
 Eli Zaretskii <eliz <at> gnu.org>, monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Wed, 12 Feb 2025 16:42:43 +0000

"Stefan Kangas" <stefankangas <at> gmail.com> writes:

> Mattias Engdegård <mattias.engdegard <at> gmail.com> writes:
>
>> After looking further into the Lisp reader/printer I found two more
>> silent Latin-1 assumptions. In all three cases, I firmly believe the
>> following to be true:
>>
>> * The behaviour is not intended but just code accidents.
>> * They should hardly affect any user code at all.
>> * They are nevertheless clear bugs which should be fixed.
>>
>> Further on this will have to wait until after Emacs 30 has been branched to avoid delaying that more important task.
>
> FWIW, the proposed patch looks like a bug fix to me as well, so I think
> we should install it.

I think we should think about whether we want to force multibyte to true
for all functions, even those never returning non-ASCII chars.  Also,
the code appears to use XFIXNUM on a Lisp_Object that might not be one.

IIUC, the difference is that all-ASCII strings would be unibyte strings
in some circumstances.

The alternative patch would look something like this:

From bbc65c9be7ccebf034f4d10f018a076ef1e8a4e9 Mon Sep 17 00:00:00 2001
From: Pip Cet <pipcet <at> protonmail.com>
Subject: [PATCH] Auto-detect multibyteness of readchar funs (bug#70988)

* src/lread.c (readchar): Set *MULTIBYTE if we detect a multibyte
character.  Return -1 for non-characters rather than crashing.
---
 src/lread.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/lread.c b/src/lread.c
index 6af95873bb8..c18c1be3cf5 100644
--- a/src/lread.c
+++ b/src/lread.c
@@ -398,9 +398,12 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
 
   tem = call0 (readcharfun);
 
-  if (NILP (tem))
+  if (!CHARACTERP (tem))
     return -1;
-  return XFIXNUM (tem);
+  if (multibyte && !ASCII_CHAR_P (XFIXNAT (tem)))
+    *multibyte = true;
+
+  return XFIXNAT (tem);
 
  read_multibyte:
   if (unread_char >= 0)
-- 
2.48.1

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Wed, 12 Feb 2025 19:27:02 GMT) Full text and rfc822 format available.

Message #35 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Pip Cet <pipcet <at> protonmail.com>
Cc: 70988 <at> debbugs.gnu.org, mattias.engdegard <at> gmail.com, stefankangas <at> gmail.com,
 monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Wed, 12 Feb 2025 21:25:48 +0200

> Date: Wed, 12 Feb 2025 16:42:43 +0000
> From: Pip Cet <pipcet <at> protonmail.com>
> Cc: Mattias Engdegård <mattias.engdegard <at> gmail.com>, 70988 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>, monnier <at> iro.umontreal.ca
> 
> The alternative patch would look something like this:
> 
> >From bbc65c9be7ccebf034f4d10f018a076ef1e8a4e9 Mon Sep 17 00:00:00 2001
> From: Pip Cet <pipcet <at> protonmail.com>
> Subject: [PATCH] Auto-detect multibyteness of readchar funs (bug#70988)
> 
> * src/lread.c (readchar): Set *MULTIBYTE if we detect a multibyte
> character.  Return -1 for non-characters rather than crashing.
> ---
>  src/lread.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/src/lread.c b/src/lread.c
> index 6af95873bb8..c18c1be3cf5 100644
> --- a/src/lread.c
> +++ b/src/lread.c
> @@ -398,9 +398,12 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
>  
>    tem = call0 (readcharfun);
>  
> -  if (NILP (tem))
> +  if (!CHARACTERP (tem))
>      return -1;
> -  return XFIXNUM (tem);
> +  if (multibyte && !ASCII_CHAR_P (XFIXNAT (tem)))
> +    *multibyte = true;
> +
> +  return XFIXNAT (tem);

AFAIU, the proposed patch was just a bugfix, whereas the above also
changes behavior in backward-incompatible ways.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Wed, 12 Feb 2025 20:29:02 GMT) Full text and rfc822 format available.

Message #38 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> protonmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 70988 <at> debbugs.gnu.org, mattias.engdegard <at> gmail.com, stefankangas <at> gmail.com,
 monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Wed, 12 Feb 2025 20:27:58 +0000

"Eli Zaretskii" <eliz <at> gnu.org> writes:

>> Date: Wed, 12 Feb 2025 16:42:43 +0000
>> From: Pip Cet <pipcet <at> protonmail.com>
>> Cc: Mattias Engdegård <mattias.engdegard <at> gmail.com>, 70988 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>, monnier <at> iro.umontreal.ca
>>
>> The alternative patch would look something like this:
>>
>> >From bbc65c9be7ccebf034f4d10f018a076ef1e8a4e9 Mon Sep 17 00:00:00 2001
>> From: Pip Cet <pipcet <at> protonmail.com>
>> Subject: [PATCH] Auto-detect multibyteness of readchar funs (bug#70988)
>>
>> * src/lread.c (readchar): Set *MULTIBYTE if we detect a multibyte
>> character.  Return -1 for non-characters rather than crashing.
>> ---
>>  src/lread.c | 7 +++++--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/src/lread.c b/src/lread.c
>> index 6af95873bb8..c18c1be3cf5 100644
>> --- a/src/lread.c
>> +++ b/src/lread.c
>> @@ -398,9 +398,12 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
>>
>>    tem = call0 (readcharfun);
>>
>> -  if (NILP (tem))
>> +  if (!CHARACTERP (tem))
>>      return -1;
>> -  return XFIXNUM (tem);
>> +  if (multibyte && !ASCII_CHAR_P (XFIXNAT (tem)))
>> +    *multibyte = true;
>> +
>> +  return XFIXNAT (tem);
>
> AFAIU, the proposed patch was just a bugfix, whereas the above also
> changes behavior in backward-incompatible ways.

The other way around, I think: the first proposed patch changed the
behavior of readchar to always set the multibyte flag when a function
was used, resulting in the creation of symbols whose ASCII names are
multibyte strings.  The previous behavior was never to set the multibyte
flag, which was correct for ASCII strings but not multibyte ones.

This patch retains the previous behavior for ASCII symbols, but sets the
multibyte flag for non-ASCII symbols, which seems the best we can do if
we're given a simple function.

If we want to change symbol names to always be multibyte strings, we can
do that, but then we probably want to do that or all streams.

It also fixes yet another XFIXNUM crash, but those (there are more in
lread.c, it seems) should be fixed independently.  However, it does give
us the ability to extend the API so readcharfun could return a single
character string, unibyte or multibyte, to be handled appropriately.

Pip

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Thu, 13 Feb 2025 06:02:02 GMT) Full text and rfc822 format available.

Message #41 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Pip Cet <pipcet <at> protonmail.com>
Cc: 70988 <at> debbugs.gnu.org, mattias.engdegard <at> gmail.com, stefankangas <at> gmail.com,
 monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Thu, 13 Feb 2025 08:00:57 +0200

> Date: Wed, 12 Feb 2025 20:27:58 +0000
> From: Pip Cet <pipcet <at> protonmail.com>
> Cc: stefankangas <at> gmail.com, mattias.engdegard <at> gmail.com, 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
> 
> "Eli Zaretskii" <eliz <at> gnu.org> writes:
> 
> >> --- a/src/lread.c
> >> +++ b/src/lread.c
> >> @@ -398,9 +398,12 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
> >>
> >>    tem = call0 (readcharfun);
> >>
> >> -  if (NILP (tem))
> >> +  if (!CHARACTERP (tem))
> >>      return -1;
> >> -  return XFIXNUM (tem);
> >> +  if (multibyte && !ASCII_CHAR_P (XFIXNAT (tem)))
> >> +    *multibyte = true;
> >> +
> >> +  return XFIXNAT (tem);
> >
> > AFAIU, the proposed patch was just a bugfix, whereas the above also
> > changes behavior in backward-incompatible ways.
> 
> The other way around, I think: the first proposed patch changed the
> behavior of readchar to always set the multibyte flag when a function
> was used, resulting in the creation of symbols whose ASCII names are
> multibyte strings.  The previous behavior was never to set the multibyte
> flag, which was correct for ASCII strings but not multibyte ones.
> 
> This patch retains the previous behavior for ASCII symbols, but sets the
> multibyte flag for non-ASCII symbols, which seems the best we can do if
> we're given a simple function.

I'm talking about the CHARACTERP test (why not FIXNUMP?), and the
addition of ASCII_CHAR_P test (why would we want an ASCII character
to never be considered multibyte?).

> If we want to change symbol names to always be multibyte strings, we can
> do that, but then we probably want to do that or all streams.

I don't understand why you are talking about symbols: AFAIU this code
is used in many other cases as well.  But even for symbols: why change
the current behavior of making their names multibyte?

> It also fixes yet another XFIXNUM crash, but those (there are more in
> lread.c, it seems) should be fixed independently.

I'm okay with adding a FIXNUMP test (which happens in the debugging
builds anyway, so any violations probably never happen), but using
CHARACTERP changes behavior.

> However, it does give us the ability to extend the API so
> readcharfun could return a single character string, unibyte or
> multibyte, to be handled appropriately.

This is also a change in behavior.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Thu, 13 Feb 2025 10:10:02 GMT) Full text and rfc822 format available.

Message #44 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> protonmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 70988 <at> debbugs.gnu.org, mattias.engdegard <at> gmail.com, stefankangas <at> gmail.com,
 monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Thu, 13 Feb 2025 10:08:54 +0000

"Eli Zaretskii" <eliz <at> gnu.org> writes:

>> Date: Wed, 12 Feb 2025 20:27:58 +0000
>> From: Pip Cet <pipcet <at> protonmail.com>
>> Cc: stefankangas <at> gmail.com, mattias.engdegard <at> gmail.com, 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
>>
>> "Eli Zaretskii" <eliz <at> gnu.org> writes:
>>
>> >> --- a/src/lread.c
>> >> +++ b/src/lread.c
>> >> @@ -398,9 +398,12 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
>> >>
>> >>    tem = call0 (readcharfun);
>> >>
>> >> -  if (NILP (tem))
>> >> +  if (!CHARACTERP (tem))
>> >>      return -1;
>> >> -  return XFIXNUM (tem);
>> >> +  if (multibyte && !ASCII_CHAR_P (XFIXNAT (tem)))
>> >> +    *multibyte = true;
>> >> +
>> >> +  return XFIXNAT (tem);
>> >
>> > AFAIU, the proposed patch was just a bugfix, whereas the above also
>> > changes behavior in backward-incompatible ways.
>>
>> The other way around, I think: the first proposed patch changed the
>> behavior of readchar to always set the multibyte flag when a function
>> was used, resulting in the creation of symbols whose ASCII names are
>> multibyte strings.  The previous behavior was never to set the multibyte
>> flag, which was correct for ASCII strings but not multibyte ones.
>>
>> This patch retains the previous behavior for ASCII symbols, but sets the
>> multibyte flag for non-ASCII symbols, which seems the best we can do if
>> we're given a simple function.
>
> I'm talking about the CHARACTERP test (why not FIXNUMP?), and the

The function is supposed to return a character, not just any fixnum.

> addition of ASCII_CHAR_P test (why would we want an ASCII character
> to never be considered multibyte?).

It's the other way around, again: if there's a non-ASCII character, we
treat the stream as multibyte; if there are ONLY ASCII characters, we
treat it as unibyte.

>> If we want to change symbol names to always be multibyte strings, we can
>> do that, but then we probably want to do that or all streams.
>
> I don't understand why you are talking about symbols: AFAIU this code
> is used in many other cases as well.  But even for symbols: why change
> the current behavior of making their names multibyte?

The current behavior is to make their names unibyte!  The current
behavior is *changed* by the first patch, and *retained* by my patch.

>> It also fixes yet another XFIXNUM crash, but those (there are more in
>> lread.c, it seems) should be fixed independently.
>
> I'm okay with adding a FIXNUMP test (which happens in the debugging
> builds anyway, so any violations probably never happen), but using
> CHARACTERP changes behavior.

If you count "avoids further crashes" as "changes behavior", yes.

readcharfun is supposed to return a character or -1.  Some callers
assume the return value is a valid character, and will crash otherwise.
I haven't checked all of them because there are many.

>> However, it does give us the ability to extend the API so
>> readcharfun could return a single character string, unibyte or
>> multibyte, to be handled appropriately.
>
> This is also a change in behavior.

Yes, of course, which is why it's a separate proposal and not part of
the patch.

Pip

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Thu, 13 Feb 2025 10:24:02 GMT) Full text and rfc822 format available.

Message #47 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Pip Cet <pipcet <at> protonmail.com>
Cc: 70988 <at> debbugs.gnu.org, mattias.engdegard <at> gmail.com, stefankangas <at> gmail.com,
 monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Thu, 13 Feb 2025 12:23:21 +0200

> Date: Thu, 13 Feb 2025 10:08:54 +0000
> From: Pip Cet <pipcet <at> protonmail.com>
> Cc: stefankangas <at> gmail.com, mattias.engdegard <at> gmail.com, 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
> 
> "Eli Zaretskii" <eliz <at> gnu.org> writes:

[...]

This style of "discussion" leads nowhere useful, so I'm bowing out of
it.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Thu, 13 Feb 2025 14:09:01 GMT) Full text and rfc822 format available.

Message #50 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> protonmail.com>
To: Stefan Kangas <stefankangas <at> gmail.com>
Cc: 70988 <at> debbugs.gnu.org,
 Mattias Engdegård <mattias.engdegard <at> gmail.com>,
 Eli Zaretskii <eliz <at> gnu.org>, monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Thu, 13 Feb 2025 14:08:02 +0000

"Stefan Kangas" <stefankangas <at> gmail.com> writes:

> Mattias Engdegård <mattias.engdegard <at> gmail.com> writes:
>
>> After looking further into the Lisp reader/printer I found two more
>> silent Latin-1 assumptions. In all three cases, I firmly believe the
>> following to be true:
>>
>> * The behaviour is not intended but just code accidents.
>> * They should hardly affect any user code at all.
>> * They are nevertheless clear bugs which should be fixed.
>>
>> Further on this will have to wait until after Emacs 30 has been branched to avoid delaying that more important task.
>
> FWIW, the proposed patch looks like a bug fix to me as well, so I think
> we should install it.

Here's a patch which avoids a few crashes in lread.c.  If we change the
behavior of readcharfuns' return values to always be treated as
multibyte, but fail to verify that its return value satisfies
CHARACTERP, as proposed, further problems will arise.

In the event that there is interest in fixing Emacs not to crash in
these circumstances, let me know and I can commit this.

Pip

From e9d48abeb199db7d76386639adadba5c7e45177c Mon Sep 17 00:00:00 2001
From: Pip Cet <pipcet <at> protonmail.com>
Subject: [PATCH] Avoid crashes in lread.c when invalid characters are read

* src/lread.c (readchar): Don't crash for non-fixnum return values.
(read_filtered_event): Don't crash for invalid symbol properties.
(Fread_char):
(Fread_char_exclusive):
(character_name_to_code): Check 'FIXNUMP' before using 'XFIXNUM'.
(read_char_escape): Use 'invalid_syntax' rather than crashing for
invalid characters.
---
 src/lread.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/lread.c b/src/lread.c
index 6af95873bb8..5875b489c97 100644
--- a/src/lread.c
+++ b/src/lread.c
@@ -398,7 +398,7 @@ readchar (Lisp_Object readcharfun, bool *multibyte)
 
   tem = call0 (readcharfun);
 
-  if (NILP (tem))
+  if (!FIXNUMP (tem))
     return -1;
   return XFIXNUM (tem);
 
@@ -816,7 +816,7 @@ read_filtered_event (bool no_switch_frame, bool ascii_required,
 	      tem1 = Fget (Fcar (tem), Qascii_character);
 	      /* Merge this symbol's modifier bits
 		 with the ASCII equivalent of its basic code.  */
-	      if (!NILP (tem1))
+	      if (FIXNUMP (tem1) && FIXNUMP (Fcar (Fcdr (tem))))
 		XSETFASTINT (val, XFIXNUM (tem1) | XFIXNUM (Fcar (Fcdr (tem))));
 	    }
 	}
@@ -898,7 +898,7 @@ DEFUN ("read-char", Fread_char, Sread_char, 0, 3, 0,
     }
   val = read_filtered_event (1, 1, 1, ! NILP (inherit_input_method), seconds);
 
-  return (NILP (val) ? Qnil
+  return (!FIXNUMP (val) ? Qnil
 	  : make_fixnum (char_resolve_modifier_mask (XFIXNUM (val))));
 }
 
@@ -976,7 +976,7 @@ DEFUN ("read-char-exclusive", Fread_char_exclusive, Sread_char_exclusive, 0, 3,
 
   val = read_filtered_event (1, 1, 0, ! NILP (inherit_input_method), seconds);
 
-  return (NILP (val) ? Qnil
+  return (!FIXNUMP (val) ? Qnil
 	  : make_fixnum (char_resolve_modifier_mask (XFIXNUM (val))));
 }
 
@@ -2820,7 +2820,7 @@ character_name_to_code (char const *name, ptrdiff_t name_len,
       invalid_syntax_lisp (CALLN (Fformat, format, namestr), readcharfun);
     }
 
-  return XFIXNUM (code);
+  return FIXNUMP (code) ? XFIXNUM (code) : -1;
 }
 
 /* Bound on the length of a Unicode character name.  As of
@@ -3058,7 +3058,8 @@ read_char_escape (Lisp_Object readcharfun, int next_char)
       chr = c;
       break;
     }
-  eassert (chr >= 0 && chr < (1 << CHARACTERBITS));
+  if (chr < 0 || chr > (1 << CHARACTERBITS))
+    invalid_syntax ("Invalid character", readcharfun);
 
   /* Apply Control modifiers, using the rules:
      \C-X = ascii_ctrl(nomod(X)) | mods(X)  if nomod(X) is one of:
-- 
2.48.1

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Thu, 13 Feb 2025 16:43:02 GMT) Full text and rfc822 format available.

Message #53 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Pip Cet <pipcet <at> protonmail.com>
Cc: 70988 <at> debbugs.gnu.org, mattias.engdegard <at> gmail.com, stefankangas <at> gmail.com,
 monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Thu, 13 Feb 2025 18:42:13 +0200

> Date: Thu, 13 Feb 2025 14:08:02 +0000
> From: Pip Cet <pipcet <at> protonmail.com>
> Cc: Mattias Engdegård <mattias.engdegard <at> gmail.com>, 70988 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>, monnier <at> iro.umontreal.ca
> 
> @@ -3058,7 +3058,8 @@ read_char_escape (Lisp_Object readcharfun, int next_char)
>        chr = c;
>        break;
>      }
> -  eassert (chr >= 0 && chr < (1 << CHARACTERBITS));
> +  if (chr < 0 || chr > (1 << CHARACTERBITS))
> +    invalid_syntax ("Invalid character", readcharfun);

Please leave the assertion in place here (in addition to the error
message), since an abort is easier to spot than an error message,
especially if someone catches errors at a higher level.

I'm okay with the rest of this patch, thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Thu, 13 Feb 2025 17:12:02 GMT) Full text and rfc822 format available.

Message #56 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Pip Cet <pipcet <at> protonmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 70988 <at> debbugs.gnu.org, mattias.engdegard <at> gmail.com, stefankangas <at> gmail.com,
 monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Thu, 13 Feb 2025 17:11:35 +0000

"Eli Zaretskii" <eliz <at> gnu.org> writes:

>> Date: Thu, 13 Feb 2025 14:08:02 +0000
>> From: Pip Cet <pipcet <at> protonmail.com>
>> Cc: Mattias Engdegård <mattias.engdegard <at> gmail.com>, 70988 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>, monnier <at> iro.umontreal.ca
>>
>> @@ -3058,7 +3058,8 @@ read_char_escape (Lisp_Object readcharfun, int next_char)
>>        chr = c;
>>        break;
>>      }
>> -  eassert (chr >= 0 && chr < (1 << CHARACTERBITS));
>> +  if (chr < 0 || chr > (1 << CHARACTERBITS))
>> +    invalid_syntax ("Invalid character", readcharfun);
>
> Please leave the assertion in place here (in addition to the error
> message), since an abort is easier to spot than an error message,
> especially if someone catches errors at a higher level.

Just to clarify, you want

@@ -3058,7 +3058,8 @@ read_char_escape (Lisp_Object readcharfun, int next_char)
       chr = c;
       break;
     }
   eassert (chr >= 0 && chr < (1 << CHARACTERBITS));
+  if (chr < 0 || chr >= (1 << CHARACTERBITS))
+    invalid_syntax ("Invalid character", readcharfun);

?

Thanks!

Pip

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Thu, 13 Feb 2025 20:14:02 GMT) Full text and rfc822 format available.

Message #59 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Pip Cet <pipcet <at> protonmail.com>
Cc: 70988 <at> debbugs.gnu.org, mattias.engdegard <at> gmail.com, stefankangas <at> gmail.com,
 monnier <at> iro.umontreal.ca
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Thu, 13 Feb 2025 22:13:04 +0200

> Date: Thu, 13 Feb 2025 17:11:35 +0000
> From: Pip Cet <pipcet <at> protonmail.com>
> Cc: stefankangas <at> gmail.com, mattias.engdegard <at> gmail.com, 70988 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
> 
> "Eli Zaretskii" <eliz <at> gnu.org> writes:
> 
> > Please leave the assertion in place here (in addition to the error
> > message), since an abort is easier to spot than an error message,
> > especially if someone catches errors at a higher level.
> 
> Just to clarify, you want
> 
> @@ -3058,7 +3058,8 @@ read_char_escape (Lisp_Object readcharfun, int next_char)
>        chr = c;
>        break;
>      }
>    eassert (chr >= 0 && chr < (1 << CHARACTERBITS));
> +  if (chr < 0 || chr >= (1 << CHARACTERBITS))
> +    invalid_syntax ("Invalid character", readcharfun);
> 
> ?

Yes, exactly.  Apologies for not making that clear.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Sat, 05 Jul 2025 11:28:03 GMT) Full text and rfc822 format available.

Message #62 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 70988 <at> debbugs.gnu.org, Pip Cet <pipcet <at> protonmail.com>,
 Stefan Monnier <monnier <at> iro.umontreal.ca>
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Sat, 5 Jul 2025 13:27:26 +0200

[Message part 1 (text/plain, inline)]

Sorry about the delay. The bugs I was talking about earlier are:

1. (read FUNCTION) assumes latin-1, as discussed earlier in this bug.

   The code in readchar() just forgets to set the multibyte flag for function sources.

2. (read UNIBYTE-STRING) assumes latin-1:

   (read "\"a\xff\"") -> "aÿ"

   For buffer and marker sources, readchar() does

	  if (! ASCII_CHAR_P (c))
	    c = BYTE8_TO_CHAR (c);

   but this is missing for string sources.

3. (print UNIBYTE-SYM) assumes latin-1;

   (prin1-to-string (make-symbol "a\xff")) -> "aÿ"

   Here the reason is that print_object() calls `fetch_string_char_advance` instead of `fetch_string_char_as_multibyte_advance`.

The above three bugs are clear omissions and were never intended behaviour; a lot happened in the switch to multibyte and bugs were bound to appear in the cracks. There should be no downside from fixing them.

We may want to ask ourselves whether it's reasonable that read sources have a multibyteness, which affects how symbols are read but not string literals. I don't think it should affect either. However, I'm leaving this concern out of the immediate discussion.

I also have a patch that improves reader performance while cleaning up some parts of the code, but it can be applied before or after fixing the three bugs above.

[0001-Read-characters-from-functions-as-multibyte.patch (application/octet-stream, attachment)]

[0002-Print-non-ASCII-chars-in-unibyte-symbols-as-raw-byte.patch (application/octet-stream, attachment)]

[0003-Read-non-ASCII-chars-from-unibyte-string-sources-as-.patch (application/octet-stream, attachment)]

[Message part 5 (text/plain, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Sat, 05 Jul 2025 15:06:05 GMT) Full text and rfc822 format available.

Message #65 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: 70988 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>,
 Pip Cet <pipcet <at> protonmail.com>
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Sat, 05 Jul 2025 11:05:02 -0400

> The above three bugs are clear omissions and were never intended
> behaviour; a lot happened in the switch to multibyte and bugs were
> bound to appear in the cracks. There should be no downside from
> fixing them.

The three patches look good to me (thanks especially for the tests).


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#70988; Package emacs. (Mon, 07 Jul 2025 16:26:02 GMT) Full text and rfc822 format available.

Message #68 received at 70988 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 70988 <at> debbugs.gnu.org, Eli Zaretskii <eliz <at> gnu.org>,
 Pip Cet <pipcet <at> protonmail.com>
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Mon, 7 Jul 2025 18:25:13 +0200

5 juli 2025 kl. 17.05 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:

> The three patches look good to me (thanks especially for the tests).

Thank you, pushed to master. The speed-up patch mentioned earlier will be applied shortly as well.

Reply sent to Mattias Engdegård <mattias.engdegard <at> gmail.com>:
You have taken responsibility. (Thu, 10 Jul 2025 09:12:04 GMT) Full text and rfc822 format available.

Notification sent to Mattias Engdegård <mattias.engdegard <at> gmail.com>:
bug acknowledged by developer. (Thu, 10 Jul 2025 09:12:05 GMT) Full text and rfc822 format available.

Message #73 received at 70988-done <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 70988-done <at> debbugs.gnu.org,
 Pip Cet <pipcet <at> protonmail.com>
Subject: Re: bug#70988: (read FUNCTION) uses Latin-1 [PATCH]
Date: Thu, 10 Jul 2025 11:11:43 +0200

> The speed-up patch mentioned earlier will be applied shortly as well.

That patch is now in master too, and while there is more work to be done in that regard we are long past the scope of this bug which is hereby closed.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 07 Aug 2025 11:24:13 GMT) Full text and rfc822 format available.

This bug report was last modified 36 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #70988 (read FUNCTION) uses Latin-1 [PATCH]

GNU bug report logs - #70988
(read FUNCTION) uses Latin-1 [PATCH]