GNU bug report logs - #79376
[PATCH] [WIP] Fix mm multibyte

Previous Next

Package: emacs;

Reported by: Manuel Giraud <manuel <at> ledu-giraud.fr>

Date: Wed, 3 Sep 2025 09:34:02 UTC

Severity: normal

Tags: patch

To reply to this bug, email your comments to 79376 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to larsi <at> gnus.org, morioka <at> jaist.ac.jp, bug-gnu-emacs <at> gnu.org:
bug#79376; Package emacs. (Wed, 03 Sep 2025 09:34:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Manuel Giraud <manuel <at> ledu-giraud.fr>:
New bug report received and forwarded. Copy sent to larsi <at> gnus.org, morioka <at> jaist.ac.jp, bug-gnu-emacs <at> gnu.org. (Wed, 03 Sep 2025 09:34:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Manuel Giraud <manuel <at> ledu-giraud.fr>
To: bug-gnu-emacs <at> gnu.org
Subject: [PATCH] [WIP] Fix mm multibyte
Date: Wed, 03 Sep 2025 11:33:33 +0200
[Message part 1 (text/plain, inline)]
Tags: patch

Hi,

I'm trying to fix an issue in Gnus where some Atom sources (namely
planet.emacslife.com/atom.xml, here) are not rendered correctly.

This seems to be related to multibyte/unibyte buffer.  Here is a minimal
exemple to reproduce what I see:

--8<---------------cut here---------------start------------->8---
(defun my/gen-handle ()
  (with-current-buffer (get-buffer-create " foo")
    (erase-buffer)
    (insert "’…")
    (list (current-buffer) '("text/html"))))

(defun my/test ()
  (let ((handle (my/gen-handle)))
    (mm-with-part handle
      (buffer-string))))
--8<---------------cut here---------------end--------------->8---

When evaluating (my/test), see that the buffer string content does not
have the correct characters.

I get the behaviour I wanted with the attached patch but I don't know if
this is the way to handle this.

In GNU Emacs 31.0.50 (build 36, x86_64-unknown-openbsd7.7) of 2025-09-03
 built on computer
Repository revision: 6762ffca6b387df73b62db1adcec127317328604
Repository branch: mgi/mm-multibyte-wip
Windowing system distributor 'The X.Org Foundation', version 11.0.12101018
System Description: OpenBSD computer 7.7 GENERIC.MP#10 amd64

Configured using:
 'configure CPPFLAGS=-I/usr/local/include LDFLAGS=-L/usr/local/lib
 MAKEINFO=gmakeinfo --prefix=/home/manuel/emacs
 --bindir=/home/manuel/bin --with-x-toolkit=no
 --with-toolkit-scroll-bars=no --without-cairo --without-dbus
 --without-gconf --without-gsettings --without-compress-install'

[0001-WIP-Fix-mm-multibyte.patch (text/x-patch, attachment)]
[Message part 3 (text/plain, inline)]
-- 
Manuel Giraud

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79376; Package emacs. (Wed, 03 Sep 2025 12:58:02 GMT) Full text and rfc822 format available.

Message #8 received at 79376 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Manuel Giraud <manuel <at> ledu-giraud.fr>
Cc: larsi <at> gnus.org, morioka <at> jaist.ac.jp, 79376 <at> debbugs.gnu.org
Subject: Re: bug#79376: [PATCH] [WIP] Fix mm multibyte
Date: Wed, 03 Sep 2025 15:57:36 +0300
> Cc: Lars Magne Ingebrigtsen <larsi <at> gnus.org>,
>  MORIOKA Tomohiko <morioka <at> jaist.ac.jp>
> From: Manuel Giraud <manuel <at> ledu-giraud.fr>
> Date: Wed, 03 Sep 2025 11:33:33 +0200
> 
> I'm trying to fix an issue in Gnus where some Atom sources (namely
> planet.emacslife.com/atom.xml, here) are not rendered correctly.
> 
> This seems to be related to multibyte/unibyte buffer.  Here is a minimal
> exemple to reproduce what I see:
> 
> --8<---------------cut here---------------start------------->8---
> (defun my/gen-handle ()
>   (with-current-buffer (get-buffer-create " foo")
>     (erase-buffer)
>     (insert "’…")
>     (list (current-buffer) '("text/html"))))
> 
> (defun my/test ()
>   (let ((handle (my/gen-handle)))
>     (mm-with-part handle
>       (buffer-string))))
> --8<---------------cut here---------------end--------------->8---
> 
> When evaluating (my/test), see that the buffer string content does not
> have the correct characters.

Hmm...  I'm not familiar with this code, but the comment in
mm-with-part says:

  ;; The handle-buffer's content is a sequence of bytes, not a sequence of
  ;; chars, so the buffer should be unibyte.  It may happen that the
  ;; handle-buffer is multibyte for some reason, in which case now is a good
  ;; time to adjust it, since we know at this point that it should
  ;; be unibyte.

But your test case inserts a multibyte string into the buffer, so
aren't you violating what this macro expects and should handle?  And
also, is a call to buffer-string something that this macro's body is
useful for?

Apologies if I'm not making sense.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79376; Package emacs. (Wed, 03 Sep 2025 13:54:02 GMT) Full text and rfc822 format available.

Message #11 received at 79376 <at> debbugs.gnu.org (full text, mbox):

From: Manuel Giraud <manuel <at> ledu-giraud.fr>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: larsi <at> gnus.org, morioka <at> jaist.ac.jp, 79376 <at> debbugs.gnu.org
Subject: Re: bug#79376: [PATCH] [WIP] Fix mm multibyte
Date: Wed, 03 Sep 2025 15:53:14 +0200
Eli Zaretskii <eliz <at> gnu.org> writes:

>> Cc: Lars Magne Ingebrigtsen <larsi <at> gnus.org>,
>>  MORIOKA Tomohiko <morioka <at> jaist.ac.jp>
>> From: Manuel Giraud <manuel <at> ledu-giraud.fr>
>> Date: Wed, 03 Sep 2025 11:33:33 +0200
>> 
>> I'm trying to fix an issue in Gnus where some Atom sources (namely
>> planet.emacslife.com/atom.xml, here) are not rendered correctly.
>> 
>> This seems to be related to multibyte/unibyte buffer.  Here is a minimal
>> exemple to reproduce what I see:
>> 
>> --8<---------------cut here---------------start------------->8---
>> (defun my/gen-handle ()
>>   (with-current-buffer (get-buffer-create " foo")
>>     (erase-buffer)
>>     (insert "’…")
>>     (list (current-buffer) '("text/html"))))
>> 
>> (defun my/test ()
>>   (let ((handle (my/gen-handle)))
>>     (mm-with-part handle
>>       (buffer-string))))
>> --8<---------------cut here---------------end--------------->8---
>> 
>> When evaluating (my/test), see that the buffer string content does not
>> have the correct characters.
>
> Hmm...  I'm not familiar with this code, but the comment in
> mm-with-part says:
>
>   ;; The handle-buffer's content is a sequence of bytes, not a sequence of
>   ;; chars, so the buffer should be unibyte.  It may happen that the
>   ;; handle-buffer is multibyte for some reason, in which case now is a good
>   ;; time to adjust it, since we know at this point that it should
>   ;; be unibyte.
>
> But your test case inserts a multibyte string into the buffer, so
> aren't you violating what this macro expects and should handle?

Yes, I've seen this comment and I do think that I'm violating what is
expected here… but then so does `mm-shr' (in my example trying to read
"planet.emacslife.com/atom.xml").

> And also, is a call to buffer-string something that this macro's body
> is useful for?

In mm-decode.el:1903, there is the following code:

(decode-coding-string (buffer-string) coding)

> Apologies if I'm not making sense.

No, I think you're perfectly on point.
-- 
Manuel Giraud




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79376; Package emacs. (Thu, 04 Sep 2025 09:48:02 GMT) Full text and rfc822 format available.

Message #14 received at 79376 <at> debbugs.gnu.org (full text, mbox):

From: Manuel Giraud <manuel <at> ledu-giraud.fr>
To: 79376 <at> debbugs.gnu.org
Cc: Lars Magne Ingebrigtsen <larsi <at> gnus.org>, Eli Zaretskii <eliz <at> gnu.org>,
 MORIOKA Tomohiko <morioka <at> jaist.ac.jp>
Subject: Re: bug#79376: [PATCH] [WIP] Fix mm multibyte
Date: Thu, 04 Sep 2025 11:47:48 +0200
[Message part 1 (text/plain, inline)]
Hi,

Hopefully, this new patch is a better fix.  AFAIU, with this, the
content of the temporary MIME buffer is preserved as unibyte (as it
should?) and its content is encoded from a possibly multibyte buffer.

FWIW, I did not used `insert-buffer-substring' anymore as this is using
`string-make-unibyte' that does not do TRT.

[0001-Do-preserve-MIME-buffer-as-unibyte.patch (text/x-patch, attachment)]
[Message part 3 (text/plain, inline)]
-- 
Manuel Giraud

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79376; Package emacs. (Sat, 13 Sep 2025 08:18:01 GMT) Full text and rfc822 format available.

Message #17 received at 79376 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Manuel Giraud <manuel <at> ledu-giraud.fr>
Cc: larsi <at> gnus.org, morioka <at> jaist.ac.jp, 79376 <at> debbugs.gnu.org
Subject: Re: bug#79376: [PATCH] [WIP] Fix mm multibyte
Date: Sat, 13 Sep 2025 11:17:21 +0300
> From: Manuel Giraud <manuel <at> ledu-giraud.fr>
> Cc: Lars Magne Ingebrigtsen <larsi <at> gnus.org>,  MORIOKA Tomohiko
>  <morioka <at> jaist.ac.jp>,
>     Eli Zaretskii <eliz <at> gnu.org>
> Date: Thu, 04 Sep 2025 11:47:48 +0200
> 
> Hopefully, this new patch is a better fix.  AFAIU, with this, the
> content of the temporary MIME buffer is preserved as unibyte (as it
> should?) and its content is encoded from a possibly multibyte buffer.

I'm still not convinced this is the correct fix, see below.

> FWIW, I did not used `insert-buffer-substring' anymore as this is using
> `string-make-unibyte' that does not do TRT.

How is that not TRT, can you tell the details?  (In any case, the doc
string of insert-buffer-substring is misleading, since the function
doesn't call string-make-unibyte, at least not directly.

I feel that we should take a step back and examine your original
problem in more detail.  In your OP, you said "I'm trying to fix an
issue in Gnus where some Atom sources (namely
planet.emacslife.com/atom.xml, here) are not rendered correctly", but
never told the details.  Can we please see those details?

I'm asking because it is not clear to me that unconditionally making
the buffer returned by mm-copy-to-buffer unibyte is TRT.  And if it
must be unibyte, it isn't clear to me how why inserting stuff there
like it does in the existing code base is incorrect.

>  (defun mm-copy-to-buffer ()
>    "Copy the contents of the current buffer to a fresh buffer."
> -  (let ((obuf (current-buffer))
> -        (mb enable-multibyte-characters)
> -        beg)
> +  (let (content)
>      (goto-char (point-min))
>      (search-forward-regexp "^\n" nil 'move) ;; There might be no body.
> -    (setq beg (point))
> +    (setq content (buffer-substring (point) (point-max)))
>      (with-current-buffer
>            (generate-new-buffer " *mm*")
>        ;; Preserve the data's unibyteness (for url-insert-file-contents).
> -      (set-buffer-multibyte mb)
> -      (insert-buffer-substring obuf beg)
> +      (set-buffer-multibyte nil)
> +      (insert (encode-coding-string content 'undecided))
>        (current-buffer))))

The ELisp manual explicitly recommends against using 'undecided' when
encoding, so at the very least this needs to be rethought.  Also, your
change has the disadvantage of consing a string, where the original
code doesn't.  But these details should be considered once we have a
clear understanding of the problem which prompted your to make changes
there.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79376; Package emacs. (Sat, 13 Sep 2025 10:28:01 GMT) Full text and rfc822 format available.

Message #20 received at 79376 <at> debbugs.gnu.org (full text, mbox):

From: Manuel Giraud <manuel <at> ledu-giraud.fr>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: larsi <at> gnus.org, morioka <at> jaist.ac.jp, 79376 <at> debbugs.gnu.org
Subject: Re: bug#79376: [PATCH] [WIP] Fix mm multibyte
Date: Sat, 13 Sep 2025 12:27:50 +0200
Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Manuel Giraud <manuel <at> ledu-giraud.fr>
>> Cc: Lars Magne Ingebrigtsen <larsi <at> gnus.org>,  MORIOKA Tomohiko
>>  <morioka <at> jaist.ac.jp>,
>>     Eli Zaretskii <eliz <at> gnu.org>
>> Date: Thu, 04 Sep 2025 11:47:48 +0200
>> 
>> Hopefully, this new patch is a better fix.  AFAIU, with this, the
>> content of the temporary MIME buffer is preserved as unibyte (as it
>> should?) and its content is encoded from a possibly multibyte buffer.
>
> I'm still not convinced this is the correct fix, see below.
>
>> FWIW, I did not used `insert-buffer-substring' anymore as this is using
>> `string-make-unibyte' that does not do TRT.
>
> How is that not TRT, can you tell the details?  (In any case, the doc
> string of insert-buffer-substring is misleading, since the function
> doesn't call string-make-unibyte, at least not directly.

Ok my assumption was based on the docstring only so never mind.

> I feel that we should take a step back and examine your original
> problem in more detail.  In your OP, you said "I'm trying to fix an
> issue in Gnus where some Atom sources (namely
> planet.emacslife.com/atom.xml, here) are not rendered correctly", but
> never told the details.  Can we please see those details?

Yes of course.  When I want to read an entry from
planet.emacslife.com/atom.xml, the article buffer contains, for example,
the following excerpt:

--8<---------------cut here---------------start------------->8---
Roman Numerals. On the one hand, its hard to understand why anyone cares
anymore. Some, like the late Rich Stevens considered them an anachronistic
barbarism and labeled his books Volume 1, 2, & rather than the more
conventional Volume I, II, &. Others continue to label volumes with the
conventional Roman numerals and, of course, theres all those buildings with
their erection date labeled, of course, with Roman numerals on their facade. 
--8<---------------cut here---------------end--------------->8---

I expect to see : "On the one hand, it’s hard to understand..." and
"books “Volume 1, 2, …” rather".  This is what I'm trying to fix here.

FWIW, I've opened the file which seems to have the content of an Atom
source (here: ~/News/atom/planet.emacslife.com.atom.xml.eld) and this
file is encoded in UTF-8 and such strings are displayed correctly.

> I'm asking because it is not clear to me that unconditionally making
> the buffer returned by mm-copy-to-buffer unibyte is TRT.  And if it
> must be unibyte, it isn't clear to me how why inserting stuff there
> like it does in the existing code base is incorrect.
>
>>  (defun mm-copy-to-buffer ()
>>    "Copy the contents of the current buffer to a fresh buffer."
>> -  (let ((obuf (current-buffer))
>> -        (mb enable-multibyte-characters)
>> -        beg)
>> +  (let (content)
>>      (goto-char (point-min))
>>      (search-forward-regexp "^\n" nil 'move) ;; There might be no body.
>> -    (setq beg (point))
>> +    (setq content (buffer-substring (point) (point-max)))
>>      (with-current-buffer
>>            (generate-new-buffer " *mm*")
>>        ;; Preserve the data's unibyteness (for url-insert-file-contents).
>> -      (set-buffer-multibyte mb)
>> -      (insert-buffer-substring obuf beg)
>> +      (set-buffer-multibyte nil)
>> +      (insert (encode-coding-string content 'undecided))
>>        (current-buffer))))
>
> The ELisp manual explicitly recommends against using 'undecided' when
> encoding, so at the very least this needs to be rethought.  

Ok I was not aware of this.

> Also, your change has the disadvantage of consing a string, where the
> original code doesn't.

Fair enough.

> But these details should be considered once we have a clear
> understanding of the problem which prompted your to make changes
> there.
>
> Thanks.
>
>
-- 
Manuel Giraud




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79376; Package emacs. (Sat, 13 Sep 2025 11:04:02 GMT) Full text and rfc822 format available.

Message #23 received at 79376 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Manuel Giraud <manuel <at> ledu-giraud.fr>
Cc: larsi <at> gnus.org, morioka <at> jaist.ac.jp, 79376 <at> debbugs.gnu.org
Subject: Re: bug#79376: [PATCH] [WIP] Fix mm multibyte
Date: Sat, 13 Sep 2025 14:02:50 +0300
> From: Manuel Giraud <manuel <at> ledu-giraud.fr>
> Cc: 79376 <at> debbugs.gnu.org,  larsi <at> gnus.org,  morioka <at> jaist.ac.jp
> Date: Sat, 13 Sep 2025 12:27:50 +0200
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> > I feel that we should take a step back and examine your original
> > problem in more detail.  In your OP, you said "I'm trying to fix an
> > issue in Gnus where some Atom sources (namely
> > planet.emacslife.com/atom.xml, here) are not rendered correctly", but
> > never told the details.  Can we please see those details?
> 
> Yes of course.  When I want to read an entry from
> planet.emacslife.com/atom.xml, the article buffer contains, for example,
> the following excerpt:
> 
> --8<---------------cut here---------------start------------->8---
> Roman Numerals. On the one hand, its hard to understand why anyone cares
> anymore. Some, like the late Rich Stevens considered them an anachronistic
> barbarism and labeled his books Volume 1, 2, & rather than the more
> conventional Volume I, II, &. Others continue to label volumes with the
> conventional Roman numerals and, of course, theres all those buildings with
> their erection date labeled, of course, with Roman numerals on their facade. 
> --8<---------------cut here---------------end--------------->8---
> 
> I expect to see : "On the one hand, it’s hard to understand..." and
> "books “Volume 1, 2, …” rather".  This is what I'm trying to fix here.
> 
> FWIW, I've opened the file which seems to have the content of an Atom
> source (here: ~/News/atom/planet.emacslife.com.atom.xml.eld) and this
> file is encoded in UTF-8 and such strings are displayed correctly.

Thanks, but this is not enough for me to understand the root cause(s).
Could you take me through the code involved in processing that text
until it gets to mm-copy-to-buffer, and tell what should be its
processing afterwards?

(If someone who knows the Gnus code reads this and has suggestions,
please feel free to chime in.  I'm only trying to help Manuel fix this
because no one else chimes in.)




This bug report was last modified today.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.