Package: emacs;
Reported by: Richard Hansen <rhansen <at> rhansen.org>
Date: Fri, 3 Jun 2022 06:21:02 UTC
Severity: minor
Tags: patch
Done: Stefan Kangas <stefankangas <at> gmail.com>
Bug is archived. No further changes may be made.
Message #26 received at 55777 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Richard Hansen <rhansen <at> rhansen.org> Cc: 55777 <at> debbugs.gnu.org Subject: Re: bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte' Date: Mon, 06 Jun 2022 14:29:19 +0300
> Date: Sun, 5 Jun 2022 22:00:35 -0400 > Cc: 55777 <at> debbugs.gnu.org > From: Richard Hansen <rhansen <at> rhansen.org> > > On 6/5/22 01:37, Eli Zaretskii wrote: > > Could you please state what is confusing in the current wording? > > * "Raw 8-bit bytes" isn't really defined. It's mentioned earlier in > the chapter -- the term is even in a @dfn{} -- but there's no > definition there. It is defined as best we could without confusing the readers: Occasionally, Emacs needs to hold and manipulate encoded text or binary non-text data in its buffers or strings. For example, when Emacs visits a file, it first reads the file’s text verbatim into a buffer, and only then converts it to the internal representation. Before the conversion, the buffer holds encoded text. Encoded text is not really text, as far as Emacs is concerned, but rather a sequence of raw 8-bit bytes. We call buffers and strings that hold encoded text “unibyte” buffers and strings, because Emacs treats them as a sequence of individual bytes. [...] (The @dfn part is markup used whenever new terminology is first used, it doesn't imply "definition".) You are welcome to propose a better explanation, but one thing is a non-starter: mentioning the numerical codes of those bytes, certainly as part of their "definition". This is because their numerical codes overlap Latin characters, and people were very confused about that when we mentioned them in the documentation in the past. So now we deliberately don't mention the values. The definition is effectively "bytes that have no meaning as human-readable text". > * The term "raw 8-bit bytes" is misleading. It suggests binary data > (bytes with values 0-255) but it's actually meant to only cover > 128-255. It indeed could potentially mislead. But not necessarily: it is customary to use "eight-bit" to mean "with the 8th bit set". Once again, you don't have to convince me that this area is confusing and notoriously hard to document. The challenge is to come up with something that is better than what we have and yet doesn't trigger confusion which we already had in the past. > * The term "raw 8-bit bytes" is not used consistently. Sometimes "8" > is spelled out as "eight", sometimes "raw" comes after "8-bit", > and sometimes it refers to all byte values 0-255 (see the first > sentence under `@cindex unibyte text`). I see no problem here, none at all. This is a manual, not a mathematical treatise. > * It's not clear whether "raw 8-bit bytes" is meant to refer to > bytes with values 128-255, or to the *characters* that map to > those byte values. We specifically say they are NOT characters. From the above-cited description: Encoded text is not really text, as far as Emacs is concerned, but rather a sequence of raw 8-bit bytes. > * The following phrasing is weird: "The function assumes that > @var{string} includes ASCII characters and raw 8-bit bytes". The > purpose of "raw 8-bit bytes" is to cover non-ASCII byte values, so > by definition that assumption is always true. No, it isn't true "by definition". We are trying to make it very clear that we distinguish between "characters" and "raw bytes". "Characters" are units of human-readable text, and each character has a set of attributes that Emacs uses when processing text. Characters have letter-case, general category, directionality, numerical value, etc. By contrast, "raw bytes" don't have any such attributes: it is meaningless to ask whether a given raw byte is upper- or lower-case, or if its directionality is right-to-left, etc. I hope you now better understand what the sentence above attempts to say; it doesn't say things that are trivially true. > By saying "the > function assumes", the reader is left wondering about the cases > where that assumption is not true, Those other cases are multibyte strings, of course. We could add that in parentheses, e.g.: The function assumes that @var{string} includes ASCII characters and raw 8-bit bytes (as opposed to multibyte text). > Maybe something like this: > > By definition, unibyte strings contain only @acronym{ASCII} > characters (bytes with values 0-127) and raw 8-bit bytes > (bytes with values 128-255); the latter are converted to their > corresponding multibyte representations in the > @code{eight-bit} character set (@pxref{Text Representations, > codepoints}). As I tried to explain above, using the numerical codes of the bytes is a step backward: we've been there and done that, and found that people get confused by that, because the byte codes overlap the Unicode codepoints of Latin characters. Explaining the difference rigorously is IME impossible without delving into the internal representation of each one of them, since that is how Emacs _really_ distinguishes between them. But having all that in the ELisp Reference manual is completely unjustified (let alone not future-proof, since the internal representation can change). Another problem with the above text is that it implies ASCII characters are bytes: we don't want to call them that, to maintain the fundamental difference between characters and bytes. Yet another problem there is that you can have a multibyte string that is pure-ASCII, so "by definition" is also problematic. Bottom line: I think the manual describes this reasonably well, and, given the past experience, any change will have to be tangibly better before we make it. Thanks.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.