Package: emacs;
Reported by: Richard Hansen <rhansen <at> rhansen.org>
Date: Fri, 3 Jun 2022 06:21:02 UTC
Severity: minor
Tags: patch
Done: Stefan Kangas <stefankangas <at> gmail.com>
Bug is archived. No further changes may be made.
View this message in rfc822 format
From: Eli Zaretskii <eliz <at> gnu.org> To: Richard Hansen <rhansen <at> rhansen.org> Cc: 55777 <at> debbugs.gnu.org Subject: bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' Date: Sat, 04 Jun 2022 10:09:42 +0300
> Date: Fri, 3 Jun 2022 23:28:51 -0400 > Cc: 55777 <at> debbugs.gnu.org > From: Richard Hansen <rhansen <at> rhansen.org> > > > If there was some situation where you needed these details for some > > Lisp program, please describe that situation. > I'm trying to understand some inconsistent behavior I'm observing > while writing code to process binary data, and I found the existing > documentation lacking. You are digging into low-level details of how Emacs keeps strings in memory, and the higher-level context of _why_ you need to understand these details is left untold. In general, Lisp programs are well advised to stay away of manipulating unibyte strings, and definitely to refrain from comparing unibyte and multibyte strings -- because these are supposed to be never needed in Lisp applications, and because doing TRT with those requires non-trivial knowledge of the Emacs internals. I see no reason to complicate the documentation for the very rare occasions where these issues unfortunately leak to higher-than-expected levels. > ;; Unibyte vs. multibyte characters: > (eq ?\xff ?\x3fffff) ; t (ok) > (eq (aref "\x3fffff" 0) (aref "\xff" 0)) ; t (ok) > (eq (aref "\x3fffff 馃榾" 0) (aref "\xff 馃榾" 0)) ; t (ok) > (eq (aref "\xff" 0) (aref "\xff 馃榾" 0)) ; nil (expected t) > > ;; Unibyte vs. multibyte strings: > (multibyte-string-p "\xff") ; nil (ok) > (multibyte-string-p "\x3fffff") ; nil (ok???) > (string= "\xff" (string-to-multibyte "\xff")) ; nil (expected t) > > ;; Char code vs. Unicode codepoint: > (string= "馃榾\xff" "馃榾\x3fffff") ; t (ok) > (string= "馃榾\N{U+ff}" "馃榾\xff") ; nil (ok) > (string= "馃榾\N{U+ff}" "馃榾\x3fffff") ; nil (ok) > (string= "馃榾每" "馃榾\N{U+ff}") ; t (ok) > (string= "馃榾每" "馃榾\xff") ; nil (ok) > (string= "馃榾每" "馃榾\x3fffff") ; nil (ok) > (eq ?\N{U+ff} ?\xff) ; t (expected nil) > (eq ?\N{U+ff} ?\x3fffff) ; t (expected nil) > (eq ?每 ?\xff) ; t (expected nil) > (eq ?每 ?\x3fffff) ; t (expected nil) If you still don't understand some of these, please feel free to ask questions, and we will gladly answer them. But I see no reason to change the documentation on that behalf. > @@ -271,20 +271,19 @@ Converting Representations > @defun string-to-multibyte string > This function returns a multibyte string containing the same sequence > of characters as @var{string}. If @var{string} is a multibyte string, > -it is returned unchanged. The function assumes that @var{string} > -includes only @acronym{ASCII} characters and raw 8-bit bytes; the > -latter are converted to their multibyte representation corresponding > -to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive > -(@pxref{Text Representations, codepoints}). > +it is returned unchanged. Otherwise, byte values are transformed to > +their corresponding multibyte codepoints (@acronym{ASCII} characters > +and characters in the @code{eight-bit} charset). @xref{Text > +Representations, codepoints}. This loses information, so I don't think we should make this change. It might be trivially clear to you that unibyte string can only contain ASCII and raw bytes, but it isn't necessarily clear to everyone. > @defun string-to-unibyte string > This function returns a unibyte string containing the same sequence of > -characters as @var{string}. It signals an error if @var{string} > -contains a non-@acronym{ASCII} character. If @var{string} is a > -unibyte string, it is returned unchanged. Use this function for > -@var{string} arguments that contain only @acronym{ASCII} and eight-bit > -characters. > +characters as @var{string}. If @var{string} is a unibyte string, it > +is returned unchanged. Otherwise, @acronym{ASCII} characters and > +characters in the @code{eight-bit} charset are converted to their > +corresponding byte values. It signals an error if any other character > +is encountered. @xref{Text Representations, codepoints}. This basically rearranges the existing text, and adds just one sentence: Otherwise, @acronym{ASCII} characters and characters in the @code{eight-bit} charset are converted to their corresponding byte values. The cross-reference is identical to the one we already have a few lines above this text, so it is redundant. I've made a change to add the above sentence, and slightly rearranged the text to be more clear and logically complete. Here's how this text looks now on the emacs-28 branch (and will appear in Emacs 28.2 and later): @defun string-to-unibyte string This function returns a unibyte string containing the same sequence of characters as @var{string}. If @var{string} is a unibyte string, it is returned unchanged. Otherwise, @acronym{ASCII} characters and characters in the @code{eight-bit} charset are converted to their corresponding byte values. Use this function for @var{string} arguments that contain only @acronym{ASCII} and eight-bit characters; the function signals an error if any other characters are encountered. @end defun Thanks.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.