#74922 - 29.4; copy_string_contents doesn't always produce a valid utf-8

GNU bug report logs - #74922
29.4; copy_string_contents doesn't always produce a valid utf-8

Package: emacs;

Reported by: Evgeny Kurnevsky <kurnevsky <at> gmail.com>

Date: Tue, 17 Dec 2024 06:09:01 UTC

Severity: normal

Found in version 29.4

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

Message #23 received at 74922 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org> To: kurnevsky <at> gmail.com Cc: 74922 <at> debbugs.gnu.org Subject: Re: bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8 Date: Sat, 21 Dec 2024 14:09:24 +0200

> Cc: 74922 <at> debbugs.gnu.org > Date: Tue, 17 Dec 2024 17:10:36 +0200 > From: Eli Zaretskii <eliz <at> gnu.org> > > > From: Evgeny Kurnevsky <kurnevsky <at> gmail.com> > > Date: Tue, 17 Dec 2024 14:46:28 +0000 > > Cc: 74922 <at> debbugs.gnu.org > > > > It can definitely do it, but I guess in emacs-module-rs it's not done by default because of performance > > implications - it might be quite costly to check every string in some cases, and it wasn't really clear if emacs > > can pass an invalid string. So currently this case causes undefined behavior there which results in emacs > > crash. > > What do Rust programs do when they are told to read random files? > This is the same situation, basically. > > And what would the module do if copy_string_contents *did* signal an > error? I think I know what happened: you called copy_string_contents with a unibyte string. In that case, copy_string_contents will return you the original string without doing anything. The code in copy_string_contents that signals an error relies on the fact that encoding the input string yields nil if the input includes non-Unicode characters. But that cannot be established with unibyte strings, because a unibyte string doesn't hold characters, it holds raw bytes. What you should do is make sure the string passed to copy_string_contents is a multibyte string. If I do that, i.e. (switch-to-buffer "foo") (set-buffer-multibyte t) (insert-file-contents "/path/to/wg-private-pc.age") (setq str1 (buffer-string)) and then call copy_string_contents with the resulting string str1, I get the result you expected. You need to realize that copy_string_contents is a variant of text-encoding routines: it encodes the input multibyte string in UTF-8. The encoding routines in Emacs always return unibyte strings without doing anything, because a unibyte string is already encoded, or at least is supposed to be encoded. And before you ask: no, copy_string_contents cannot by itself signal an error if passed a unibyte string, because a unibyte string can legitimately be a valid UTF-8 string. So in this case, copy_string_contents relies on the caller to make sure the input is valid UTF-8.

This bug report was last modified 189 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #74922 29.4; copy_string_contents doesn't always produce a valid utf-8

GNU bug report logs - #74922
29.4; copy_string_contents doesn't always produce a valid utf-8