GNU bug report logs -
#42904
[PATCH] Non-Unicode frame title crashes Emacs on macOS
Previous Next
Reported by: Mattias Engdegård <mattiase <at> acm.org>
Date: Mon, 17 Aug 2020 14:13:02 UTC
Severity: normal
Tags: patch
Merged with 41184
Found in version 28.0.50
Done: Mattias Engdegård <mattiase <at> acm.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 42904 in the body.
You can then email your comments to 42904 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Mon, 17 Aug 2020 14:13:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Mattias Engdegård <mattiase <at> acm.org>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Mon, 17 Aug 2020 14:13:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Setting a frame title that contains non-Unicode characters causes a crash in the NS backend. (Other platforms may or may not deal with it appropriately -- if you have the opportunity to test, please report.)
Since the title is typically derived from the buffer name, this is easily reproduced by
(rename-buffer "n\351")
The crash occurs in ns_set_name_internal:
encoded_name = ENCODE_UTF_8 (name);
Here encoded_name is still "n\351" (a 2 byte unibyte string), because the \351 couldn't be encoded.
str = [NSString stringWithUTF8String: SSDATA (encoded_name)];
Now str is nil since "n\351" isn't valid UTF-8.
[[view window] setTitle: str];
Here we get an NS crash because nil isn't a valid setTitle: argument.
Proposed patch attached. I didn't find any obvious way to encode an Emacs string into valid UTF-8 (with bad parts replaced) so a new function was written. The corresponding Lisp function was marked internal because it's only there for test purposes, but it could of course be promoted to non-internal if someone wants it.
[0001-Fix-NS-crash-on-invalid-frame-title-string.patch (application/octet-stream, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Mon, 17 Aug 2020 14:55:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 42904 <at> debbugs.gnu.org (full text, mbox):
Mattias Engdegård <mattiase <at> acm.org> writes:
> Setting a frame title that contains non-Unicode characters causes a crash in the NS backend. (Other platforms may or may not deal with it appropriately -- if you have the opportunity to test, please report.)
>
> Since the title is typically derived from the buffer name, this is easily reproduced by
>
> (rename-buffer "n\351")
Looks like this is related to bug#41184
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Mon, 17 Aug 2020 15:56:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 42904 <at> debbugs.gnu.org (full text, mbox):
merge 42094 41184
17 aug. 2020 kl. 16.54 skrev Andrii Kolomoiets <andreyk.mad <at> gmail.com>:
> Looks like this is related to bug#41184
Indeed it's the same bug, thank you!
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Mon, 17 Aug 2020 15:56:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 42904 <at> debbugs.gnu.org (full text, mbox):
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Mon, 17 Aug 2020 16:11:52 +0200
> Cc: Alan Third <alan <at> idiocy.org>
>
> Proposed patch attached. I didn't find any obvious way to encode an
> Emacs string into valid UTF-8 (with bad parts replaced) so a new
> function was written.
Is something wrong with encode_string_utf_8? It has arguments that
allow you to replace invalid bytes into the likes of u+FFFD. Or did I
misunderstand the problem you are facing?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Mon, 17 Aug 2020 16:12:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 42904 <at> debbugs.gnu.org (full text, mbox):
17 aug. 2020 kl. 17.55 skrev Eli Zaretskii <eliz <at> gnu.org>:
> Is something wrong with encode_string_utf_8? It has arguments that
> allow you to replace invalid bytes into the likes of u+FFFD. Or did I
> misunderstand the problem you are facing?
No, that's a valid question. I did try that function first, but it had too many quirks: doesn't accept a unibyte non-ASCII string, sometimes replaces valid characters, doesn't always output UTF-8... It was easier to write a new function which encapsulates the common usage case. In addition, the new function is short and simple enough that it can easily be verified to be correct; encode_string_utf_8 is big and complex.
In addition, it seems likely that the same problem exists elsewhere and it's useful to have a function which solves it right away.
Merged 41184 42904.
Request was from
Mattias Engdegård <mattiase <at> acm.org>
to
control <at> debbugs.gnu.org
.
(Mon, 17 Aug 2020 16:14:01 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Mon, 17 Aug 2020 17:07:02 GMT)
Full text and
rfc822 format available.
Message #22 received at 42904 <at> debbugs.gnu.org (full text, mbox):
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Mon, 17 Aug 2020 18:11:50 +0200
> Cc: 42904 <at> debbugs.gnu.org, alan <at> idiocy.org
>
> 17 aug. 2020 kl. 17.55 skrev Eli Zaretskii <eliz <at> gnu.org>:
>
> > Is something wrong with encode_string_utf_8? It has arguments that
> > allow you to replace invalid bytes into the likes of u+FFFD. Or did I
> > misunderstand the problem you are facing?
>
> No, that's a valid question. I did try that function first, but it had too many quirks: doesn't accept a unibyte non-ASCII string, sometimes replaces valid characters, doesn't always output UTF-8... It was easier to write a new function which encapsulates the common usage case. In addition, the new function is short and simple enough that it can easily be verified to be correct; encode_string_utf_8 is big and complex.
Well, it is always easier to special-case some use case, but we have
general APIs for a reason. In particular, having several similar but
subtly different functions is confusing and causes mistakes.
And you seem to be saying that encode_string_utf_8 doesn't work as
advertised, which means it should be fixed.
So I would prefer to use encode_string_utf_8 if reasonably practical.
Thanks.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Mon, 17 Aug 2020 18:49:01 GMT)
Full text and
rfc822 format available.
Message #25 received at 42904 <at> debbugs.gnu.org (full text, mbox):
17 aug. 2020 kl. 19.05 skrev Eli Zaretskii <eliz <at> gnu.org>:
> Well, it is always easier to special-case some use case, but we have
> general APIs for a reason. In particular, having several similar but
> subtly different functions is confusing and causes mistakes.
The new function is much simpler and easier to use than encode_string_utf_8 precisely for that reason: to avoid confusion and mistakes, both of which I got in spades when trying to use it.
> And you seem to be saying that encode_string_utf_8 doesn't work as
> advertised, which means it should be fixed.
Actually I don't know exactly how it is supposed to work so I wouldn't even say that. It's probably fine code but it's not for me, not in this case.
> So I would prefer to use encode_string_utf_8 if reasonably practical.
Well, it doesn't seem to be reasonably practical. In order to fix a bug, I prefer not having to fix some unrelated but complex code, especially when it is unclear how and if that code really can and/or should be 'fixed', and exactly what that would entail.
Now if, after the proposed patch has been applied, someone wants to refactor so that string_to_valid_utf_8 disappears or becomes implemented in terms of something else, then that's perfectly fine, as long as the bug remains fixed.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Mon, 17 Aug 2020 19:57:01 GMT)
Full text and
rfc822 format available.
Message #28 received at 42904 <at> debbugs.gnu.org (full text, mbox):
On Mon, Aug 17, 2020 at 08:48:08PM +0200, Mattias Engdegård wrote:
> 17 aug. 2020 kl. 19.05 skrev Eli Zaretskii <eliz <at> gnu.org>:
>
> > Well, it is always easier to special-case some use case, but we have
> > general APIs for a reason. In particular, having several similar but
> > subtly different functions is confusing and causes mistakes.
>
> The new function is much simpler and easier to use than
> encode_string_utf_8 precisely for that reason: to avoid confusion
> and mistakes, both of which I got in spades when trying to use it.
Sorry if this is a stupid question, but would using UTF-16 be easier?
This appears to work (although I'm sure it's not the right way to do this):
modified src/nsfns.m
@@ -405,11 +405,10 @@ Turn the input menu (an NSMenu) into a lisp list for tracking on lisp side.
NSString *str;
NSView *view = FRAME_NS_VIEW (f);
+ encoded_name = code_convert_string_norecord (name, Qutf_16le, 1);
- encoded_name = ENCODE_UTF_8 (name);
-
- str = [NSString stringWithUTF8String: SSDATA (encoded_name)];
-
+ str = [NSString stringWithCharacters: (const unichar *) SDATA (encoded_name)
+ length: SBYTES (encoded_name) / sizeof (unichar)];
/* Don't change the name if it's already NAME. */
if (! [[[view window] title] isEqualToString: str])
--
Alan Third
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Tue, 18 Aug 2020 08:08:02 GMT)
Full text and
rfc822 format available.
Message #31 received at 42904 <at> debbugs.gnu.org (full text, mbox):
17 aug. 2020 kl. 21.56 skrev Alan Third <alan <at> idiocy.org>:
> Sorry if this is a stupid question, but would using UTF-16 be easier?
> This appears to work (although I'm sure it's not the right way to do this):
A good question! It has the advantage of requiring no new code, but is slightly inferior in that raw bytes are not replaced with U+FFFD but with spaces; we should probably set :default-char to #xfffd for the utf-16 coding systems.
Unpaired surrogates are handled splendidly by accident: the conversion to UTF-16 preserves them, perhaps incorrectly so, but the NS libs display them as a distinctive and very suggestive glyph. Apparently [NSString stringWithCharacters:] doesn't perform any validation at all.
On the other hand, I think we still do need a subroutine for converting to UTF-8 for passing strings to system code where graceful handling of invalid encoding cannot be assumed, as there appears to be nothing in Emacs that can do this.
> + encoded_name = code_convert_string_norecord (name, Qutf_16le, 1);
Presumably this should be utf_16be on big-endian platforms. We still support PowerPC macOS, don't we?
> + str = [NSString stringWithCharacters: (const unichar *) SDATA (encoded_name)
Is SDATA guaranteed to be 16-bit aligned? Doesn't matter on x86 or PowerPC, but strictly speaking...
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Tue, 18 Aug 2020 08:44:02 GMT)
Full text and
rfc822 format available.
Message #34 received at 42904 <at> debbugs.gnu.org (full text, mbox):
On Tue, Aug 18, 2020 at 10:07:27AM +0200, Mattias Engdegård wrote:
> 17 aug. 2020 kl. 21.56 skrev Alan Third <alan <at> idiocy.org>:
>
> > + encoded_name = code_convert_string_norecord (name, Qutf_16le, 1);
>
> Presumably this should be utf_16be on big-endian platforms. We still support PowerPC macOS, don't we?
No, however I imagine we support GNUstep on big endian systems.
> > + str = [NSString stringWithCharacters: (const unichar *) SDATA (encoded_name)
>
> Is SDATA guaranteed to be 16-bit aligned? Doesn't matter on x86 or
> PowerPC, but strictly speaking...
I've no idea, I adapted the code from make_multibyte_string in
alloc.c, and one of it's callers (although I can't remember which
right now). I'm expecting Eli to appear and tell me this is the
entirely wrong way of doing this. ;)
Anyway, as I understand it the internal representation of NS strings
are UTF-16, so the conversion through UTF-8 seems a bit of a waste if
we can go direct.
--
Alan Third
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Tue, 18 Aug 2020 11:49:02 GMT)
Full text and
rfc822 format available.
Message #37 received at 42904 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
18 aug. 2020 kl. 10.43 skrev Alan Third <alan <at> idiocy.org>:
>> Presumably this should be utf_16be on big-endian platforms. We still support PowerPC macOS, don't we?
>
> No, however I imagine we support GNUstep on big endian systems.
Well then, it's easy to deal with.
>>> + str = [NSString stringWithCharacters: (const unichar *) SDATA (encoded_name)
>>
>> Is SDATA guaranteed to be 16-bit aligned? Doesn't matter on x86 or
>> PowerPC, but strictly speaking...
>
> I've no idea, I adapted the code from make_multibyte_string in
> alloc.c, and one of it's callers (although I can't remember which
> right now). I'm expecting Eli to appear and tell me this is the
> entirely wrong way of doing this. ;)
Data alignment trapping is optional on 64-bit ARM but I'd be surprised if macOS enabled it. It might be hazardous for all the GNUStep-on-MIPS workstations.
> Anyway, as I understand it the internal representation of NS strings
> are UTF-16, so the conversion through UTF-8 seems a bit of a waste if
> we can go direct.
Maybe, but the conversion to UTF-16 then has to be done on the Emacs side instead, probably less efficiently than in the NS libs. It's probably a wash.
Anyway, here is an alternative patch using your method. Tell us what you think.
[0001-UTF-16-Fix-NS-crash-on-invalid-frame-title-string-bug-42904.patch (application/octet-stream, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Tue, 18 Aug 2020 12:23:02 GMT)
Full text and
rfc822 format available.
Message #40 received at 42904 <at> debbugs.gnu.org (full text, mbox):
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Tue, 18 Aug 2020 13:48:10 +0200
> Cc: Eli Zaretskii <eliz <at> gnu.org>, 42904 <at> debbugs.gnu.org
>
> Anyway, here is an alternative patch using your method. Tell us what you think.
Thanks, but I don't think we should modify the :default-char attribute
of the UTF-* encodings as part of this change. It's a separate issue,
and is a backward-incompatible change of sorts. For instance, we
should consider what this will do to display on TTY frames that don't
support Unicode. So I think we should discuss this issue separately
before we make such a change.
Why is it a problem to display a space instead of invalid bytes in
this case?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Tue, 18 Aug 2020 12:25:02 GMT)
Full text and
rfc822 format available.
Message #43 received at 42904 <at> debbugs.gnu.org (full text, mbox):
> Date: Tue, 18 Aug 2020 10:43:10 +0200 (CEST)
> From: Alan Third <alan <at> idiocy.org>
> Cc: Eli Zaretskii <eliz <at> gnu.org>, 42904 <at> debbugs.gnu.org
>
> > Is SDATA guaranteed to be 16-bit aligned? Doesn't matter on x86 or
> > PowerPC, but strictly speaking...
>
> I've no idea, I adapted the code from make_multibyte_string in
> alloc.c, and one of it's callers (although I can't remember which
> right now). I'm expecting Eli to appear and tell me this is the
> entirely wrong way of doing this. ;)
It isn't wrong (and there's no need to worry about alignment in this
case, AFAIK).
Thanks.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Tue, 18 Aug 2020 14:12:01 GMT)
Full text and
rfc822 format available.
Message #46 received at 42904 <at> debbugs.gnu.org (full text, mbox):
18 aug. 2020 kl. 14.24 skrev Eli Zaretskii <eliz <at> gnu.org>:
> It isn't wrong (and there's no need to worry about alignment in this
> case, AFAIK).
Do you mean that SDATA is guaranteed to be aligned, or that no NS platforms that Emacs runs on (or is likely to run on in the near future, such as macOS on arm64) trap on unaligned?
> Thanks, but I don't think we should modify the :default-char attribute
> of the UTF-* encodings as part of this change. It's a separate issue,
> and is a backward-incompatible change of sorts. For instance, we
> should consider what this will do to display on TTY frames that don't
> support Unicode. So I think we should discuss this issue separately
> before we make such a change.
Yes, we can certainly make it a separate change. All bug fixes are backward-incompatible in some respect; it is not reasonable to depend on non-Unicode characters being translated to spaces when converted to UTF-16 since that is neither documented nor reasonably expected behaviour.
> Why is it a problem to display a space instead of invalid bytes in
> this case?
A problem is not necessary for a change to be desirable. The Unicode replacement character clearly indicates that something could not be encoded correctly, and the exact position for it; it's universally recognised and valuable for users and developers alike. Space is the default value for :default-char, and that it isn't U+FFFD for UTF-16 (or other Unicode encodings) is a clear bug, since that is the correct character to use for that purpose.
My guess is that space was chosen as default because it's a character that occurs in all coding systems, but it is clearly wrong for UTF-16. 'us-ascii' uses '?' for :default-char, which is a better character in that repertoire.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Tue, 18 Aug 2020 14:41:02 GMT)
Full text and
rfc822 format available.
Message #49 received at 42904 <at> debbugs.gnu.org (full text, mbox):
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Tue, 18 Aug 2020 16:11:02 +0200
> Cc: Alan Third <alan <at> idiocy.org>, 42904 <at> debbugs.gnu.org
>
> 18 aug. 2020 kl. 14.24 skrev Eli Zaretskii <eliz <at> gnu.org>:
>
> > It isn't wrong (and there's no need to worry about alignment in this
> > case, AFAIK).
>
> Do you mean that SDATA is guaranteed to be aligned, or that no NS platforms that Emacs runs on (or is likely to run on in the near future, such as macOS on arm64) trap on unaligned?
Both, AFAIK.
> it is not reasonable to depend on non-Unicode characters being translated to spaces
We are not the only program which does that, but.
> > Why is it a problem to display a space instead of invalid bytes in
> > this case?
>
> A problem is not necessary for a change to be desirable. The Unicode replacement character clearly indicates that something could not be encoded correctly, and the exact position for it
But that character only makes sense when it can be displayed, because
otherwise no one will realize what was the problem.
Anyway, this discussion should be on emacs-devel, not as part of an
unrelated bug report.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Tue, 18 Aug 2020 15:22:02 GMT)
Full text and
rfc822 format available.
Message #52 received at 42904 <at> debbugs.gnu.org (full text, mbox):
18 aug. 2020 kl. 16.40 skrev Eli Zaretskii <eliz <at> gnu.org>:
>>> It isn't wrong (and there's no need to worry about alignment in this
>>> case, AFAIK).
>>
>> Do you mean that SDATA is guaranteed to be aligned, or that no NS platforms that Emacs runs on (or is likely to run on in the near future, such as macOS on arm64) trap on unaligned?
>
> Both, AFAIK.
Thank you. I'm still wary about making such assumptions but I suppose we commit worse sins.
> But that character only makes sense when it can be displayed, because
> otherwise no one will realize what was the problem.
Certainly. Given that TTYs aren't typically used for displaying UTF-16, and that the Unicode-capable terminals I've tried seem to show just fine (in case someone converts the UTF-16 back to UTF-8), I think we are reasonably safe.
In any case it's no different from not being able to show an any other character.
> Anyway, this discussion should be on emacs-devel, not as part of an
> unrelated bug report.
Not unrelated at all, but by all means.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Tue, 18 Aug 2020 17:29:01 GMT)
Full text and
rfc822 format available.
Message #55 received at 42904 <at> debbugs.gnu.org (full text, mbox):
On Tue, Aug 18, 2020 at 01:48:10PM +0200, Mattias Engdegård wrote:
> 18 aug. 2020 kl. 10.43 skrev Alan Third <alan <at> idiocy.org>:
> > Anyway, as I understand it the internal representation of NS strings
> > are UTF-16, so the conversion through UTF-8 seems a bit of a waste if
> > we can go direct.
>
> Maybe, but the conversion to UTF-16 then has to be done on the Emacs
> side instead, probably less efficiently than in the NS libs. It's
> probably a wash.
>
> Anyway, here is an alternative patch using your method. Tell us what
> you think.
Looks good to me. The only thought I have is that perhaps we should
consider extending NSString to handle these lisp strings rather than
making it a separate function? We could provide a method to convert to
a lisp string as well, although that's not as complex.
I believe using categories would do it without us having to create a
new EmacsString class or similar.
I don't know if this is worth it because I don't know if we really
need these clean conversions elsewhere, but the neatness of
newStr = [NSString withLispObject:str];
appeals. :)
--
Alan Third
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Thu, 20 Aug 2020 09:28:02 GMT)
Full text and
rfc822 format available.
Message #58 received at 42904 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
18 aug. 2020 kl. 19.28 skrev Alan Third <alan <at> idiocy.org>:
> Looks good to me. The only thought I have is that perhaps we should
> consider extending NSString to handle these lisp strings rather than
> making it a separate function? We could provide a method to convert to
> a lisp string as well, although that's not as complex.
>
> I believe using categories would do it without us having to create a
> new EmacsString class or similar.
Fun, I hadn't done that before! Of course we should.
As it happens I just enjoyed the HOPL paper about the history of Objective-C (https://dl.acm.org/doi/10.1145/3386332). An excellent read in general, and it has some history about how the categories came about.
Here is an updated patch: it is now self-contained and does not change anything outside the NS backend.
There is a minor imperfection: the incoming name string can actually be miscoded if it contains both non-ASCII characters and raw bytes. As an example, consider
(rename-buffer "aéb\300")
In xdisp.c:12497, the Lisp name string is created using make_string which decides that the above multibyte string should really be unibyte, and that confuses the converter. It is of no great consequence, but it makes the result look messier than it should have: "a��b��c" instead of "aéb�c".
[0001-Fix-NS-crash-on-invalid-frame-title-string-bug-42904.patch (application/octet-stream, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Thu, 20 Aug 2020 13:25:02 GMT)
Full text and
rfc822 format available.
Message #61 received at 42904 <at> debbugs.gnu.org (full text, mbox):
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Thu, 20 Aug 2020 11:27:01 +0200
> Cc: Eli Zaretskii <eliz <at> gnu.org>, 42904 <at> debbugs.gnu.org
>
> There is a minor imperfection: the incoming name string can actually be miscoded if it contains both non-ASCII characters and raw bytes. As an example, consider
>
> (rename-buffer "aéb\300")
>
> In xdisp.c:12497, the Lisp name string is created using make_string which decides that the above multibyte string should really be unibyte, and that confuses the converter. It is of no great consequence, but it makes the result look messier than it should have: "a��b��c" instead of "aéb�c".
What would you like xdisp.c to do instead in this case? If there's an
alternative way of dealing with such frame titles that is better in
some sense, we could either adopt it for all platforms, or only for
NS.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Thu, 20 Aug 2020 13:25:02 GMT)
Full text and
rfc822 format available.
Message #64 received at 42904 <at> debbugs.gnu.org (full text, mbox):
On Thu, Aug 20, 2020 at 11:27:01AM +0200, Mattias Engdegård wrote:
> 18 aug. 2020 kl. 19.28 skrev Alan Third <alan <at> idiocy.org>:
>
> > Looks good to me. The only thought I have is that perhaps we should
> > consider extending NSString to handle these lisp strings rather than
> > making it a separate function? We could provide a method to convert to
> > a lisp string as well, although that's not as complex.
> >
> > I believe using categories would do it without us having to create a
> > new EmacsString class or similar.
>
> Fun, I hadn't done that before! Of course we should.
>
> As it happens I just enjoyed the HOPL paper about the history of
> Objective-C (https://dl.acm.org/doi/10.1145/3386332). An excellent
> read in general, and it has some history about how the categories
> came about.
I haven't seen that before and am just reading through it now. Thanks.
> Here is an updated patch: it is now self-contained and does not
> change anything outside the NS backend.
It looks good to me. The only thing I'd like you to change is to move
the implementation down to the "Class implementations" part of
nsfns.m.
> There is a minor imperfection: the incoming name string can actually
> be miscoded if it contains both non-ASCII characters and raw bytes.
> As an example, consider
>
> (rename-buffer "aéb\300")
>
> In xdisp.c:12497, the Lisp name string is created using make_string
> which decides that the above multibyte string should really be
> unibyte, and that confuses the converter. It is of no great
> consequence, but it makes the result look messier than it should
> have: "a��b��c" instead of "aéb�c".
I think we can live with that, it's definitely better than a crash and
seems reasonable given that the input is junk.
Thanks!
--
Alan Third
Reply sent
to
Mattias Engdegård <mattiase <at> acm.org>
:
You have taken responsibility.
(Thu, 20 Aug 2020 17:45:03 GMT)
Full text and
rfc822 format available.
Notification sent
to
Mattias Engdegård <mattiase <at> acm.org>
:
bug acknowledged by developer.
(Thu, 20 Aug 2020 17:45:03 GMT)
Full text and
rfc822 format available.
Message #69 received at 42904-done <at> debbugs.gnu.org (full text, mbox):
20 aug. 2020 kl. 15.24 skrev Alan Third <alan <at> idiocy.org>:
> It looks good to me. The only thing I'd like you to change is to move
> the implementation down to the "Class implementations" part of
> nsfns.m.
Moved, and pushed. Thank you!
Reply sent
to
Mattias Engdegård <mattiase <at> acm.org>
:
You have taken responsibility.
(Thu, 20 Aug 2020 17:45:03 GMT)
Full text and
rfc822 format available.
Notification sent
to
Filipp Gunbin <fgunbin <at> fastmail.fm>
:
bug acknowledged by developer.
(Thu, 20 Aug 2020 17:45:03 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Thu, 20 Aug 2020 18:47:02 GMT)
Full text and
rfc822 format available.
Message #77 received at 42904 <at> debbugs.gnu.org (full text, mbox):
20 aug. 2020 kl. 15.24 skrev Eli Zaretskii <eliz <at> gnu.org>:
> What would you like xdisp.c to do instead in this case? If there's an
> alternative way of dealing with such frame titles that is better in
> some sense, we could either adopt it for all platforms, or only for
> NS.
Not sure how to deal with it, but maybe it's just a matter of settling on multibyte representation when building the title (as in mode_line_noprop_buf and so on)? I presume that the current ambiguity comes from when there were good reasons to build these strings in various unibyte encodings, but maybe it isn't motivated today?
If it is at all any trouble at all, just leave it as it is. On the other hand, perhaps we have found a way to simplify the code by accident.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Thu, 20 Aug 2020 19:14:02 GMT)
Full text and
rfc822 format available.
Message #80 received at 42904 <at> debbugs.gnu.org (full text, mbox):
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Thu, 20 Aug 2020 20:46:17 +0200
> Cc: alan <at> idiocy.org, 42904 <at> debbugs.gnu.org
>
> 20 aug. 2020 kl. 15.24 skrev Eli Zaretskii <eliz <at> gnu.org>:
>
> > What would you like xdisp.c to do instead in this case? If there's an
> > alternative way of dealing with such frame titles that is better in
> > some sense, we could either adopt it for all platforms, or only for
> > NS.
>
> Not sure how to deal with it, but maybe it's just a matter of settling on multibyte representation when building the title (as in mode_line_noprop_buf and so on)?
I don't think I understand. mode_line_noprop_buf gets the bytes, and
then we call make_string on it, so the result is the same as the one
you'd like to avoid. Or am I missing something?
By "settling on multibyte representation", do you mean that we should
convert raw bytes to their multibyte form? Or do you mean something
else?
> I presume that the current ambiguity comes from when there were good reasons to build these strings in various unibyte encodings, but maybe it isn't motivated today?
Again, what would you like to have instead? Would calling
str_as_multibyte do what you want?
The reason we build a unibyte string is that the presence of raw bytes
generally means a unibyte string is desired; it's a heuristic. It is
also the simplest thing to do in this case, and always works because
it doesn't change the byte sequence of the original string.
> If it is at all any trouble at all, just leave it as it is. On the other hand, perhaps we have found a way to simplify the code by accident.
See above: maybe str_as_multibyte is what you want?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Fri, 21 Aug 2020 09:40:02 GMT)
Full text and
rfc822 format available.
Message #83 received at 42904 <at> debbugs.gnu.org (full text, mbox):
20 aug. 2020 kl. 21.13 skrev Eli Zaretskii <eliz <at> gnu.org>:
> I don't think I understand. mode_line_noprop_buf gets the bytes, and
> then we call make_string on it, so the result is the same as the one
> you'd like to avoid. Or am I missing something?
>
> By "settling on multibyte representation", do you mean that we should
> convert raw bytes to their multibyte form? Or do you mean something
> else?
No, I think we are talking about the same thing. Basically, it's about how the bytes end up in mode_line_noprop_buf in the first place, since currently the information of whether it should be interpreted as unibyte or multibyte gets lost as soon as data from the strings it is composed of (like the buffer name for %b, file name for %f etc) is added to it. Then make_string tries to restore that information by looking at the bytes, and it is not always accurate.
One way of doing this is to always make sure that the input strings (buffer name, file name, frame-title-format etc) are always in multibyte form. Another would be to convert to multibyte as those strings are used, presumably in decode_mode_spec. You know this code a lot better than I do, but the former may be slightly more workable (and efficient).
> Again, what would you like to have instead? Would calling
> str_as_multibyte do what you want?
No, I don't think so -- once the unibyte/multibyte bit is lost, it can only be restored imperfectly if all we have is the sequence of bytes. In mathematical terms, the function that maps an arbitrary string object to its bytes has no inverse. (Consider the unibyte string "\xc3\xa5" -- should the bytes {c3, a5} be recreated as that unibyte string, or as the multibyte string "å"?)
Again we are talking about trivialities here, but perhaps the same syndrome will arise in other contexts where it matters more. If we wrote Emacs from scratch we likely wouldn't have unibyte strings at all: they are only there for compatibility and various niche uses and performance hacks. I don't think it's unreasonable to start normalising strings to multibyte where it matters.
Thanks for your patience!
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Fri, 21 Aug 2020 13:27:02 GMT)
Full text and
rfc822 format available.
Message #86 received at 42904 <at> debbugs.gnu.org (full text, mbox):
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Fri, 21 Aug 2020 11:39:30 +0200
> Cc: alan <at> idiocy.org, 42904 <at> debbugs.gnu.org
>
> Basically, it's about how the bytes end up in mode_line_noprop_buf in the first place, since currently the information of whether it should be interpreted as unibyte or multibyte gets lost as soon as data from the strings it is composed of (like the buffer name for %b, file name for %f etc) is added to it. Then make_string tries to restore that information by looking at the bytes, and it is not always accurate.
make_string was written to work on byte sequences that don't begin as
the payload of a Lisp string. So it doesn't handle the information
you say is being lost, because it doesn't expect such information to
be available to begin with.
Which is basically just another way of saying "you want something
other than make_string" here.
> One way of doing this is to always make sure that the input strings (buffer name, file name, frame-title-format etc) are always in multibyte form.
That's what I thought I was suggesting.
> > Again, what would you like to have instead? Would calling
> > str_as_multibyte do what you want?
>
> No, I don't think so -- once the unibyte/multibyte bit is lost, it can only be restored imperfectly if all we have is the sequence of bytes.
That is true, but str_as_multibyte simply interprets any valid UTF-8
sequence as a character, and any invalid sequence as a raw bytes. I
thought this was precisely what you wanted for this use case, no?
> If we wrote Emacs from scratch we likely wouldn't have unibyte strings at all: they are only there for compatibility and various niche uses and performance hacks. I don't think it's unreasonable to start normalising strings to multibyte where it matters.
Emacs (as any other old editor) started with only unibyte strings, so
that's history for you. Some modern text-handling environments solve
this conundrum by not supporting raw bytes at all, but Emacs knows
better.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Fri, 21 Aug 2020 14:54:02 GMT)
Full text and
rfc822 format available.
Message #89 received at 42904 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
21 aug. 2020 kl. 15.26 skrev Eli Zaretskii <eliz <at> gnu.org>:
> That is true, but str_as_multibyte simply interprets any valid UTF-8
> sequence as a character, and any invalid sequence as a raw bytes. I
> thought this was precisely what you wanted for this use case, no?
Sorry, I read the comment for that function and got the impression that it would interpret raw bytes as Latin-1. Fortunately that wasn't true, and using it seems to be a clear improvement. Now a mixture of non-ASCII and raw bytes, like "a\377büc" results in the title "a��büc", which is one � too many but good enough.
What about the attached patch then? Only tested on macOS, admittedly.
[0001-Always-make-a-multibyte-string-for-the-frame-title-b.patch (application/octet-stream, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Fri, 21 Aug 2020 15:29:02 GMT)
Full text and
rfc822 format available.
Message #92 received at 42904 <at> debbugs.gnu.org (full text, mbox):
> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Fri, 21 Aug 2020 16:53:34 +0200
> Cc: alan <at> idiocy.org, 42904 <at> debbugs.gnu.org
>
> > That is true, but str_as_multibyte simply interprets any valid UTF-8
> > sequence as a character, and any invalid sequence as a raw bytes. I
> > thought this was precisely what you wanted for this use case, no?
>
> Sorry, I read the comment for that function and got the impression that it would interpret raw bytes as Latin-1.
That was a remnant from pre-Unicode Emacs; I've fixed the commentary
to accurately describe what happens now.
> Fortunately that wasn't true, and using it seems to be a clear improvement. Now a mixture of non-ASCII and raw bytes, like "a\377büc" results in the title "a��büc", which is one � too many but good enough.
>
> What about the attached patch then? Only tested on macOS, admittedly.
It looks OK, but someone should see what it does on X before we make
this change on all platforms. (On w32 frames, the display stops
before the first raw byte, but it also does that with the current
code.) If the effect on X is for the worse, we will have to condition
this by HAVE_NS.
> title = mode_line_noprop_buf + title_start;
> + /* Make sure any raw bytes in the title are properly
> + multibyte-encoded. */
It is better not to use "encoded" when talking about internal
representation. I'd say something like "represented by their
multibyte sequences" instead.
Thanks.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Fri, 21 Aug 2020 15:51:01 GMT)
Full text and
rfc822 format available.
Message #95 received at 42904 <at> debbugs.gnu.org (full text, mbox):
21 aug. 2020 kl. 17.27 skrev Eli Zaretskii <eliz <at> gnu.org>:
> That was a remnant from pre-Unicode Emacs; I've fixed the commentary
> to accurately describe what happens now.
Much appreciated.
> It looks OK, but someone should see what it does on X before we make
> this change on all platforms. (On w32 frames, the display stops
> before the first raw byte, but it also does that with the current
> code.) If the effect on X is for the worse, we will have to condition
> this by HAVE_NS.
I will be able to test on X in a few days.
> It is better not to use "encoded" when talking about internal
> representation. I'd say something like "represented by their
> multibyte sequences" instead.
Thank you, the comment will be amended.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#42904
; Package
emacs
.
(Sun, 23 Aug 2020 17:24:02 GMT)
Full text and
rfc822 format available.
Message #98 received at 42904 <at> debbugs.gnu.org (full text, mbox):
21 aug. 2020 kl. 17.50 skrev Mattias Engdegård <mattiase <at> acm.org>:
>> It looks OK, but someone should see what it does on X before we make
>> this change on all platforms. (On w32 frames, the display stops
>> before the first raw byte, but it also does that with the current
>> code.) If the effect on X is for the worse, we will have to condition
>> this by HAVE_NS.
>
> I will be able to test on X in a few days.
Now tested on X with no regression observed, thus pushed to master. Thank you!
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 21 Sep 2020 11:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 4 years and 269 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.