GNU bug report logs - #28179
Fix use of string-to-multibyte in ispell.el

Previous Next

Package: emacs;

Reported by: Reuben Thomas <rrt <at> sc3d.org>

Date: Tue, 22 Aug 2017 00:53:01 UTC

Severity: minor

Done: Reuben Thomas <rrt <at> sc3d.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 28179 in the body.
You can then email your comments to 28179 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#28179; Package emacs. (Tue, 22 Aug 2017 00:53:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Reuben Thomas <rrt <at> sc3d.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Tue, 22 Aug 2017 00:53:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Reuben Thomas <rrt <at> sc3d.org>
To: bug-emacs <bug-emacs <at> gnu.org>
Subject: Fix use of string-to-multibyte in ispell.el
Date: Tue, 22 Aug 2017 01:51:53 +0100
[Message part 1 (text/plain, inline)]
The attached patch removes a use of string-to-multibyte in ispell.el.

It turned out to be possible to simplify the surrounding code quite a
bit too.

-- 
https://rrt.sc3d.org
[0001-Avoid-using-string-to-multibyte-in-ispell.el.patch (text/x-patch, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#28179; Package emacs. (Tue, 22 Aug 2017 16:40:02 GMT) Full text and rfc822 format available.

Message #8 received at 28179 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Reuben Thomas <rrt <at> sc3d.org>
Cc: 28179 <at> debbugs.gnu.org
Subject: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
Date: Tue, 22 Aug 2017 19:38:49 +0300
> From: Reuben Thomas <rrt <at> sc3d.org>
> Date: Tue, 22 Aug 2017 01:51:53 +0100
> 
>  ;; Return a string decoded from Nth element of the current dictionary.
>  (defun ispell-get-decoded-string (n)
>    "Get the decoded string in slot N of the descriptor of the current dict."
>    (let* ((slot (or
>  		(assoc ispell-current-dictionary ispell-local-dictionary-alist)
>  		(assoc ispell-current-dictionary ispell-dictionary-alist)
> -		(error "No data for dictionary \"%s\", neither in `ispell-local-dictionary-alist' nor in `ispell-dictionary-alist'"
> -		       ispell-current-dictionary)))
> -	 (str (nth n slot)))
> -    (when (and (> (length str) 0)
> -	       (not (multibyte-string-p str)))
> -      (setq str (ispell-decode-string str))
> -      (or (multibyte-string-p str)
> -	  (setq str (string-to-multibyte str))))

Are you sure we don't need to ensure ispell-get-decoded-string always
returns a multibyte string?  What if decode-coding-string returns a
pure ASCII string, which is therefore unibyte?

IOW, it looks like you didn't replace the call to string-to-multibyte
with something equivalent, you simply removed that call.  If you
analyzed the code and concluded that this call is redundant, please
tell the details of your analysis.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#28179; Package emacs. (Tue, 22 Aug 2017 17:05:01 GMT) Full text and rfc822 format available.

Message #11 received at 28179 <at> debbugs.gnu.org (full text, mbox):

From: Reuben Thomas <rrt <at> sc3d.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 28179 <at> debbugs.gnu.org
Subject: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
Date: Tue, 22 Aug 2017 18:04:11 +0100
[Message part 1 (text/plain, inline)]
On 22/08/17 17:38, Eli Zaretskii wrote:
>> From: Reuben Thomas <rrt <at> sc3d.org>
>> Date: Tue, 22 Aug 2017 01:51:53 +0100
>>
>>  ;; Return a string decoded from Nth element of the current dictionary.
>>  (defun ispell-get-decoded-string (n)
>>    "Get the decoded string in slot N of the descriptor of the current dict."
>>    (let* ((slot (or
>>  		(assoc ispell-current-dictionary ispell-local-dictionary-alist)
>>  		(assoc ispell-current-dictionary ispell-dictionary-alist)
>> -		(error "No data for dictionary \"%s\", neither in `ispell-local-dictionary-alist' nor in `ispell-dictionary-alist'"
>> -		       ispell-current-dictionary)))
>> -	 (str (nth n slot)))
>> -    (when (and (> (length str) 0)
>> -	       (not (multibyte-string-p str)))
>> -      (setq str (ispell-decode-string str))
>> -      (or (multibyte-string-p str)
>> -	  (setq str (string-to-multibyte str))))
> Are you sure we don't need to ensure ispell-get-decoded-string always
> returns a multibyte string?  What if decode-coding-string returns a
> pure ASCII string, which is therefore unibyte?

This is multibyte too, no? The Emacs manual says:

> Rather, Emacs uses a variable-length internal representation of
> characters, that stores each character as a sequence of 1 to 5 8-bit
> bytes, depending on the magnitude of its codepoint(1).  For example,
> *any**
> **ASCII character takes up only 1 byte*, a Latin-1 character takes up 2
> bytes, etc.  We call this representation of text “multibyte”.

(My bold.)

The reason I am using decode-coding-string is because that is what the
obsolescence message in subr.el says to use.

If I've overlooked something, then it would be nice to know what I've
missed in the documentation, or whether the documentation could be improved.

-- 
https://rrt.sc3d.org

[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#28179; Package emacs. (Tue, 22 Aug 2017 17:24:01 GMT) Full text and rfc822 format available.

Message #14 received at 28179 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Reuben Thomas <rrt <at> sc3d.org>
Cc: 28179 <at> debbugs.gnu.org
Subject: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
Date: Tue, 22 Aug 2017 20:23:44 +0300
> Cc: 28179 <at> debbugs.gnu.org
> From: Reuben Thomas <rrt <at> sc3d.org>
> Date: Tue, 22 Aug 2017 18:04:11 +0100
> 
> Are you sure we don't need to ensure ispell-get-decoded-string always
> returns a multibyte string?  What if decode-coding-string returns a
> pure ASCII string, which is therefore unibyte?
> 
> This is multibyte too, no? The Emacs manual says:
> 
>  Rather, Emacs uses a variable-length internal representation of
>  characters, that stores each character as a sequence of 1 to 5 8-bit
>  bytes, depending on the magnitude of its codepoint(1). For example, any
>  ASCII character takes up only 1 byte, a Latin-1 character takes up 2
>  bytes, etc. We call this representation of text “multibyte”.

This is a misunderstanding, caused by the overloaded meaning of
"multibyte string".  The way I meant it, it has to do with the
internal flag marking a string either unibyte or multibyte.  Observe:

  (multibyte-string-p "abcd") => nil

but

  (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t

although

  (string= "abcd" (decode-coding-string "abcd" 'utf-8)) => t

> The reason I am using decode-coding-string is because that is what the obsolescence message in subr.el
> says to use.

Yes, but the code already used decode-coding-string, in the function
ispell-decode-string, which you replaced with its body.  The call to
string-to-multibyte worked on the result of decoding, not instead of
the decoding.  So actually the call to string-to-multibyte was not
replaced, it was removed.

> If I've overlooked something, then it would be nice to know what I've missed in the documentation, or whether
> the documentation could be improved.

Is the issue more clear now?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#28179; Package emacs. (Wed, 23 Aug 2017 11:00:02 GMT) Full text and rfc822 format available.

Message #17 received at 28179 <at> debbugs.gnu.org (full text, mbox):

From: Reuben Thomas <rrt <at> sc3d.org>
To: Eli Zaretskii <eliz <at> gnu.org>, 28179 <at> debbugs.gnu.org
Subject: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
Date: Wed, 23 Aug 2017 11:59:41 +0100
On 22/08/17 18:23, Eli Zaretskii wrote:

>> Cc: 28179 <at> debbugs.gnu.org
>> From: Reuben Thomas <rrt <at> sc3d.org>
>> Date: Tue, 22 Aug 2017 18:04:11 +0100
>>
>> Are you sure we don't need to ensure ispell-get-decoded-string always
>> returns a multibyte string?  What if decode-coding-string returns a
>> pure ASCII string, which is therefore unibyte?
>>
>> This is multibyte too, no? The Emacs manual says:
>>
>>  Rather, Emacs uses a variable-length internal representation of
>>  characters, that stores each character as a sequence of 1 to 5 8-bit
>>  bytes, depending on the magnitude of its codepoint(1). For example, any
>>  ASCII character takes up only 1 byte, a Latin-1 character takes up 2
>>  bytes, etc. We call this representation of text “multibyte”.
> This is a misunderstanding, caused by the overloaded meaning of
> "multibyte string".  The way I meant it, it has to do with the
> internal flag marking a string either unibyte or multibyte.  Observe:
>
>   (multibyte-string-p "abcd") => nil
>
> but
>
>   (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t

So here, running decode-coding-string on a plain ASCII string returns a
multibyte string.

> ispell-decode-string, which you replaced with its body. The call to
> string-to-multibyte worked on the result of decoding, not instead of
> the decoding.  So actually the call to string-to-multibyte was not
> replaced, it was removed.

Yes, that call seemed to be unnecessary.

> Is the issue more clear now? 

I now understand the two meanings of "multibyte", but I don't understand
how my patch is deficient. I tried even:

(multibyte-string-p (decode-coding-string "abcde" 'utf-8 t)) ; returns
t; also if I use 'us-ascii

So in fact even when the string isn't copied (as in my patch, where I
also use a third argument of t to decode-coding-string) it appears to be
changed to a multibyte string.

-- 
https://rrt.sc3d.org





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#28179; Package emacs. (Thu, 24 Aug 2017 17:00:01 GMT) Full text and rfc822 format available.

Message #20 received at 28179 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Reuben Thomas <rrt <at> sc3d.org>
Cc: 28179 <at> debbugs.gnu.org
Subject: Re: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
Date: Thu, 24 Aug 2017 19:59:08 +0300
> From: Reuben Thomas <rrt <at> sc3d.org>
> Date: Wed, 23 Aug 2017 11:59:41 +0100
> 
> I now understand the two meanings of "multibyte", but I don't understand
> how my patch is deficient.

I didn't say it was deficient, I asked whether you verified that
either (a) the result is always multibyte, or (b) that we don't need
to worry about it being multibyte if it is pure-ASCII.

> So in fact even when the string isn't copied (as in my patch, where I
> also use a third argument of t to decode-coding-string) it appears to be
> changed to a multibyte string.

Fine, if you are sure, go ahead and push.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#28179; Package emacs. (Thu, 24 Aug 2017 17:33:02 GMT) Full text and rfc822 format available.

Message #23 received at 28179 <at> debbugs.gnu.org (full text, mbox):

From: Noam Postavsky <npostavs <at> users.sourceforge.net>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 28179 <at> debbugs.gnu.org, Reuben Thomas <rrt <at> sc3d.org>
Subject: Re: bug#28179: Fwd: Re: bug#28179: Fix use of string-to-multibyte in
 ispell.el
Date: Thu, 24 Aug 2017 13:32:38 -0400
On Thu, Aug 24, 2017 at 12:59 PM, Eli Zaretskii <eliz <at> gnu.org> wrote:
>> From: Reuben Thomas <rrt <at> sc3d.org>
>> So in fact even when the string isn't copied (as in my patch, where I
>> also use a third argument of t to decode-coding-string) it appears to be
>> changed to a multibyte string.
>
> Fine, if you are sure, go ahead and push.

But please, think of the children^H^H^H^H^H^H^H^H readers (of your
patch)! Put this information in the commit message.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#28179; Package emacs. (Thu, 24 Aug 2017 17:46:01 GMT) Full text and rfc822 format available.

Message #26 received at 28179 <at> debbugs.gnu.org (full text, mbox):

From: Reuben Thomas <rrt <at> sc3d.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 28179 <at> debbugs.gnu.org
Subject: Re: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
Date: Thu, 24 Aug 2017 18:45:33 +0100
On 24/08/17 17:59, Eli Zaretskii wrote:
>> From: Reuben Thomas <rrt <at> sc3d.org>
>> Date: Wed, 23 Aug 2017 11:59:41 +0100
>>
>> I now understand the two meanings of "multibyte", but I don't understand
>> how my patch is deficient.
> I didn't say it was deficient,

Sorry, I was unclear. I meant, precisely, I don't see why you think my
patch's code returns a string that is not multibyte.

>  I asked whether you verified that
> either (a) the result is always multibyte

I believe I showed this is the case.

>
>> So in fact even when the string isn't copied (as in my patch, where I
>> also use a third argument of t to decode-coding-string) it appears to be
>> changed to a multibyte string.
> Fine, if you are sure, go ahead and push.
>

The reason I am asking again is because you first said:

> What if decode-coding-string returns a pure ASCII string, which is
> therefore unibyte?

and then later you said:

> The way I meant it, it has to do with the internal flag marking a
> string either unibyte or multibyte. Observe:
>   (multibyte-string-p "abcd") => nil
>
> but
>
>   (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t

In other words:

1. As far as I can tell from the above (and my own confirmatory
experiments and reading of the documentation), a pure ASCII string can
be multibyte (it's a matter of the multibyte flag, not the number of
bytes used to store each character).

2. decode-coding-string always returns a multibyte string.

Since these two observations seemed to mean that you contradicted
yourself, I was checking whether in fact I had misunderstood (so that
for example one of my two observations above is wrong), or if your
original understanding was incomplete (so that in fact your question
about decode-coding-string is therefore misguided, because it can return
a pure ASCII unibyte string (in the coding sense) which is nonetheless a
multibyte string (in the sense that multibyte-string-p on it returns t).

Sorry about the miscommunication. In any case, I think the code is
correct, your original question was misguided, and I shall push, with,
as Noam requested in another message, an explanation of my  assumptions.
No need to reply further unless you think there really is a problem!

-- 

https://rrt.sc3d.org






Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#28179; Package emacs. (Thu, 24 Aug 2017 18:23:01 GMT) Full text and rfc822 format available.

Message #29 received at 28179 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Reuben Thomas <rrt <at> sc3d.org>
Cc: 28179 <at> debbugs.gnu.org
Subject: Re: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
Date: Thu, 24 Aug 2017 21:20:46 +0300
> Cc: 28179 <at> debbugs.gnu.org
> From: Reuben Thomas <rrt <at> sc3d.org>
> Date: Thu, 24 Aug 2017 18:45:33 +0100
> 
> The reason I am asking again is because you first said:
> 
> > What if decode-coding-string returns a pure ASCII string, which is
> > therefore unibyte?
> 
> and then later you said:
> 
> > The way I meant it, it has to do with the internal flag marking a
> > string either unibyte or multibyte. Observe:
> >   (multibyte-string-p "abcd") => nil
> >
> > but
> >
> >   (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t

That example may be conclusive for UTF-8, but is it conclusive for
_any_ encoding?  I don't know.  E.g., what about the ISO-2022 based
encodings, where all the bytes are (AFAIR) pure ASCII?

> 1. As far as I can tell from the above (and my own confirmatory
> experiments and reading of the documentation), a pure ASCII string can
> be multibyte (it's a matter of the multibyte flag, not the number of
> bytes used to store each character).
> 
> 2. decode-coding-string always returns a multibyte string.

Can you show me why 2 is always correct?  It might be, I simply don't
know.  All I know is that in general relying on plain-ASCII strings to
be always multibyte in any given situation is risky, we were bitten by
that a few times.  But maybe it's not an issue in this case.  Which is
why I was asking you whether you have sufficient basis to believe this
to be so in this case.

> Since these two observations seemed to mean that you contradicted
> yourself, I was checking whether in fact I had misunderstood (so that
> for example one of my two observations above is wrong), or if your
> original understanding was incomplete (so that in fact your question
> about decode-coding-string is therefore misguided, because it can return
> a pure ASCII unibyte string (in the coding sense) which is nonetheless a
> multibyte string (in the sense that multibyte-string-p on it returns t).

I only used decode-coding-string because I remembered it as an easy
way of creating a multibyte ASCII string, when the coding-system is
UTF-8, that's all.  There was no contradiction in what I said, at
least not an intended one.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#28179; Package emacs. (Thu, 24 Aug 2017 18:51:01 GMT) Full text and rfc822 format available.

Message #32 received at 28179 <at> debbugs.gnu.org (full text, mbox):

From: Reuben Thomas <rrt <at> sc3d.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 28179 <at> debbugs.gnu.org
Subject: Re: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
Date: Thu, 24 Aug 2017 19:50:17 +0100
On 24 August 2017 at 19:20, Eli Zaretskii <eliz <at> gnu.org> wrote:
>> Cc: 28179 <at> debbugs.gnu.org
>> From: Reuben Thomas <rrt <at> sc3d.org>
>> Date: Thu, 24 Aug 2017 18:45:33 +0100
>>
>> The reason I am asking again is because you first said:
>>
>> > What if decode-coding-string returns a pure ASCII string, which is
>> > therefore unibyte?
>>
>> and then later you said:
>>
>> > The way I meant it, it has to do with the internal flag marking a
>> > string either unibyte or multibyte. Observe:
>> >   (multibyte-string-p "abcd") => nil
>> >
>> > but
>> >
>> >   (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t
>
> That example may be conclusive for UTF-8, but is it conclusive for
> _any_ encoding?  I don't know.  E.g., what about the ISO-2022 based
> encodings, where all the bytes are (AFAIR) pure ASCII?

(multibyte-string-p (decode-coding-string "abcd" 'iso-2022-jp)) => t

I still don't understand what you're getting at: the bytes in "abcd"
are pure ASCII, whatever coding system one is decoding from.

> Can you show me why 2 is always correct?  It might be, I simply don't
> know.  All I know is that in general relying on plain-ASCII strings to
> be always multibyte in any given situation is risky, we were bitten by
> that a few times.  But maybe it's not an issue in this case.  Which is
> why I was asking you whether you have sufficient basis to believe this
> to be so in this case.

I don't know.

As I said before, the make-obsolete notice for string-to-multibyte
says "use `decode-coding-string'". If it is as tricky as you suggest
it might be, then the notice should be updated to point to more
detailed guidance.

The relevant commit is:

commit f74d496478cd57f252817bd7437fe1b7972ce01f
Author: Stefan Monnier <monnier <at> iro.umontreal.ca>
Date:   Mon Jan 30 13:02:18 2017 -0500

    * lisp/subr.el (string-make-unibyte, string-make-multibyte): Obsolete.

diff --git a/lisp/subr.el b/lisp/subr.el
index a6ba05c..a204577 100644
--- a/lisp/subr.el
+++ b/lisp/subr.el
@@ -1417,8 +1417,10 @@ posn-object-width-height
 ;; bug#23850
 (make-obsolete 'string-to-unibyte   "use `encode-coding-string'." "26.1")
 (make-obsolete 'string-as-unibyte   "use `encode-coding-string'." "26.1")
+(make-obsolete 'string-make-unibyte   "use `encode-coding-string'." "26.1")
 (make-obsolete 'string-to-multibyte "use `decode-coding-string'." "26.1")
 (make-obsolete 'string-as-multibyte "use `decode-coding-string'." "26.1")
+(make-obsolete 'string-make-multibyte "use `decode-coding-string'." "26.1")

I'm going to close this bug; if better documentation is needed, both
for the obsolescence of string-to-multibyte and for multibyte strings
in general, that's a new bug.

-- 
https://rrt.sc3d.org




Reply sent to Reuben Thomas <rrt <at> sc3d.org>:
You have taken responsibility. (Thu, 24 Aug 2017 18:53:01 GMT) Full text and rfc822 format available.

Notification sent to Reuben Thomas <rrt <at> sc3d.org>:
bug acknowledged by developer. (Thu, 24 Aug 2017 18:53:02 GMT) Full text and rfc822 format available.

Message #37 received at 28179-done <at> debbugs.gnu.org (full text, mbox):

From: Reuben Thomas <rrt <at> sc3d.org>
To: 28179-done <at> debbugs.gnu.org
Subject: Closing bug: patch is installed
Date: Thu, 24 Aug 2017 19:52:11 +0100
I've installed the patch, so closing the bug. There are some lingering
doubts over the correctness of the use of decode-coding-string, but
they're outside the scope of this bug.

-- 
https://rrt.sc3d.org





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#28179; Package emacs. (Thu, 24 Aug 2017 19:03:02 GMT) Full text and rfc822 format available.

Message #40 received at 28179 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Reuben Thomas <rrt <at> sc3d.org>
Cc: 28179 <at> debbugs.gnu.org
Subject: Re: Fwd: Re: bug#28179: Fix use of string-to-multibyte in ispell.el
Date: Thu, 24 Aug 2017 22:02:20 +0300
> From: Reuben Thomas <rrt <at> sc3d.org>
> Date: Thu, 24 Aug 2017 19:50:17 +0100
> Cc: 28179 <at> debbugs.gnu.org
> 
> >> >   (multibyte-string-p (decode-coding-string "abcd" 'utf-8)) => t
> >
> > That example may be conclusive for UTF-8, but is it conclusive for
> > _any_ encoding?  I don't know.  E.g., what about the ISO-2022 based
> > encodings, where all the bytes are (AFAIR) pure ASCII?
> 
> (multibyte-string-p (decode-coding-string "abcd" 'iso-2022-jp)) => t

That's not what I meant, but never mind.  I only replied to tell there
was no contradiction in my previous messages, and no confusion on my
part, that's all.

Thanks.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 22 Sep 2017 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 267 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.