GNU bug report logs - #74922
29.4; copy_string_contents doesn't always produce a valid utf-8

Previous Next

Package: emacs;

Reported by: Evgeny Kurnevsky <kurnevsky <at> gmail.com>

Date: Tue, 17 Dec 2024 06:09:01 UTC

Severity: normal

Found in version 29.4

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

Full log

View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: tracker <at> debbugs.gnu.org
Subject: bug#74922: closed (29.4; copy_string_contents doesn't always
 produce a valid utf-8)
Date: Sat, 04 Jan 2025 11:40:02 +0000

[Message part 1 (text/plain, inline)]

Your message dated Sat, 04 Jan 2025 13:39:25 +0200
with message-id <86o70merki.fsf <at> gnu.org>
and subject line Re: bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8
has caused the debbugs.gnu.org bug report #74922,
regarding 29.4; copy_string_contents doesn't always produce a valid utf-8
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)


-- 
74922: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=74922
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems

[Message part 2 (message/rfc822, inline)]

From: Evgeny Kurnevsky <kurnevsky <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 29.4; copy_string_contents doesn't always produce a valid utf-8
Date: Tue, 17 Dec 2024 06:08:30 +0000

[Message part 3 (text/plain, inline)]

According to the docs and comment inside module_copy_string_contents it
should always produce a valid utf-8 string that can be used in dynamic
modules, but it seems it's not always the case. I encountered an emacs
crash when using emacs-module-rs because it always expects a valid utf-8
for strings. To reproduce you can call:

(some-function-from-dynamic-library (encode-coding-string (f-read-text
"wg-private-pc.age") 'utf-8 t))

The file is
https://github.com/kurnevsky/nixfiles/raw/0b3de016dac551398627a55788b80d4809afcbf9/secrets/wg-private-pc.age

See https://github.com/ubolonton/emacs-module-rs/issues/58 for additional
details.

[Message part 4 (text/html, inline)]

[Message part 5 (message/rfc822, inline)]

From: Eli Zaretskii <eliz <at> gnu.org>
To: kurnevsky <at> gmail.com
Cc: 74922-done <at> debbugs.gnu.org
Subject: Re: bug#74922: Fwd: bug#74922: 29.4;
 copy_string_contents doesn't always produce a valid utf-8
Date: Sat, 04 Jan 2025 13:39:25 +0200

> Cc: 74922 <at> debbugs.gnu.org
> Date: Sat, 21 Dec 2024 14:09:24 +0200
> From: Eli Zaretskii <eliz <at> gnu.org>
> 
> > Cc: 74922 <at> debbugs.gnu.org
> > Date: Tue, 17 Dec 2024 17:10:36 +0200
> > From: Eli Zaretskii <eliz <at> gnu.org>
> > 
> > > From: Evgeny Kurnevsky <kurnevsky <at> gmail.com>
> > > Date: Tue, 17 Dec 2024 14:46:28 +0000
> > > Cc: 74922 <at> debbugs.gnu.org
> > > 
> > > It can definitely do it, but I guess in emacs-module-rs it's not done by default because of performance
> > > implications - it might be quite costly to check every string in some cases, and it wasn't really clear if emacs
> > > can pass an invalid string. So currently this case causes undefined behavior there which results in emacs
> > > crash.
> > 
> > What do Rust programs do when they are told to read random files?
> > This is the same situation, basically.
> > 
> > And what would the module do if copy_string_contents *did* signal an
> > error?
> 
> I think I know what happened: you called copy_string_contents with a
> unibyte string.  In that case, copy_string_contents will return you
> the original string without doing anything.  The code in
> copy_string_contents that signals an error relies on the fact that
> encoding the input string yields nil if the input includes non-Unicode
> characters. But that cannot be established with unibyte strings,
> because a unibyte string doesn't hold characters, it holds raw bytes.
> 
> What you should do is make sure the string passed to
> copy_string_contents is a multibyte string.  If I do that, i.e.
> 
>   (switch-to-buffer "foo")
>   (set-buffer-multibyte t)
>   (insert-file-contents "/path/to/wg-private-pc.age")
>   (setq str1 (buffer-string))
> 
> and then call copy_string_contents with the resulting string str1, I
> get the result you expected.
> 
> You need to realize that copy_string_contents is a variant of
> text-encoding routines: it encodes the input multibyte string in
> UTF-8.  The encoding routines in Emacs always return unibyte strings
> without doing anything, because a unibyte string is already encoded,
> or at least is supposed to be encoded.
> 
> And before you ask: no, copy_string_contents cannot by itself signal
> an error if passed a unibyte string, because a unibyte string can
> legitimately be a valid UTF-8 string. So in this case,
> copy_string_contents relies on the caller to make sure the input is
> valid UTF-8.

I believe the above explains the problem and the solution, so I'm now
closing this bug.

This bug report was last modified 235 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #74922 29.4; copy_string_contents doesn't always produce a valid utf-8

GNU bug report logs - #74922
29.4; copy_string_contents doesn't always produce a valid utf-8