GNU bug report logs -
#74922
29.4; copy_string_contents doesn't always produce a valid utf-8
Previous Next
Reported by: Evgeny Kurnevsky <kurnevsky <at> gmail.com>
Date: Tue, 17 Dec 2024 06:09:01 UTC
Severity: normal
Found in version 29.4
Done: Eli Zaretskii <eliz <at> gnu.org>
Bug is archived. No further changes may be made.
Full log
Message #23 received at 74922 <at> debbugs.gnu.org (full text, mbox):
> Cc: 74922 <at> debbugs.gnu.org
> Date: Tue, 17 Dec 2024 17:10:36 +0200
> From: Eli Zaretskii <eliz <at> gnu.org>
>
> > From: Evgeny Kurnevsky <kurnevsky <at> gmail.com>
> > Date: Tue, 17 Dec 2024 14:46:28 +0000
> > Cc: 74922 <at> debbugs.gnu.org
> >
> > It can definitely do it, but I guess in emacs-module-rs it's not done by default because of performance
> > implications - it might be quite costly to check every string in some cases, and it wasn't really clear if emacs
> > can pass an invalid string. So currently this case causes undefined behavior there which results in emacs
> > crash.
>
> What do Rust programs do when they are told to read random files?
> This is the same situation, basically.
>
> And what would the module do if copy_string_contents *did* signal an
> error?
I think I know what happened: you called copy_string_contents with a
unibyte string. In that case, copy_string_contents will return you
the original string without doing anything. The code in
copy_string_contents that signals an error relies on the fact that
encoding the input string yields nil if the input includes non-Unicode
characters. But that cannot be established with unibyte strings,
because a unibyte string doesn't hold characters, it holds raw bytes.
What you should do is make sure the string passed to
copy_string_contents is a multibyte string. If I do that, i.e.
(switch-to-buffer "foo")
(set-buffer-multibyte t)
(insert-file-contents "/path/to/wg-private-pc.age")
(setq str1 (buffer-string))
and then call copy_string_contents with the resulting string str1, I
get the result you expected.
You need to realize that copy_string_contents is a variant of
text-encoding routines: it encodes the input multibyte string in
UTF-8. The encoding routines in Emacs always return unibyte strings
without doing anything, because a unibyte string is already encoded,
or at least is supposed to be encoded.
And before you ask: no, copy_string_contents cannot by itself signal
an error if passed a unibyte string, because a unibyte string can
legitimately be a valid UTF-8 string. So in this case,
copy_string_contents relies on the caller to make sure the input is
valid UTF-8.
This bug report was last modified 137 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.