Package: emacs;
Reported by: Phil <p.stephani2 <at> gmail.com>
Date: Thu, 11 Aug 2016 18:57:02 UTC
Severity: normal
Found in version 25.1
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
View this message in rfc822 format
From: Eli Zaretskii <eliz <at> gnu.org> To: Paul Eggert <eggert <at> cs.ucla.edu> Cc: p.stephani2 <at> gmail.com, johnw <at> gnu.org, nicolas <at> petton.fr, 24206 <at> debbugs.gnu.org Subject: bug#24206: 25.1; Curly quotes generate invalid strings, leading to a segfault Date: Mon, 15 Aug 2016 19:09:40 +0300
> Cc: p.stephani2 <at> gmail.com, 24206 <at> debbugs.gnu.org, johnw <at> gnu.org, > nicolas <at> petton.fr > From: Paul Eggert <eggert <at> cs.ucla.edu> > Date: Sun, 14 Aug 2016 19:04:42 -0700 > > Eli Zaretskii wrote: > > Its multibyteness is entirely in Emacs's imagination. > > Sure, but Emacs should not substitute "\342\200\230" for "`". The point of > text-quoting-style is to substitute quotes, not byte string encodings of quotes. I'm not sure. We never discussed what should Emacs do when substitute-command-keys is called on a unibyte non-ASCII string which requires quote substitution. Other substitutions, including those that produce ASCII quote characters, previously would leave the unibyte string unibyte. But with your changes, any substitution converts the string into multibyte: (multibyte-string-p (substitute-command-keys "\200\\[goto-char]")) => t I think this is might be a subtle regression, because some code might just find itself mixing multibyte and unibyte strings where previously there were only unibyte strings. > >> > More generally, Fsubstitute_command_keys is quite confused about unibyte > >> > versus multibyte issues. It merges together a number of strings, and > >> > assumes that they are all multibyte iff the original string is > >> > multibyte, which is obviously not true in general. > > Could you please point out the specific places where this is done? > > OK, here's a contrived example. Run this code in emacs-25: > > (progn > (setq km (make-keymap)) > (define-key km "≠" 'global-set-key) > (substitute-command-keys "\200\\<km>\\[global-set-key]")) > > This should return a 2-character string equal to "\200≠". I'm not sure your expectations are correct: as the original string is unibyte, the output of "\200≠", which is multibyte, might not be what the users expect. They might expect "\200\342\211\240" instead. > But in Emacs 25 it dumps core, at least on my platform (Fedora 23 > x86-64). And in Emacs 24 on my platform it returns a malformed > string that prints as "\242\1340" but has length 2. I suppose we > could make Emacs 24 dump core too, though I haven't tried hard to do > that. The errors are easily fixed, though. Below I show 2 patches. The first one should go to master (after reverting yours), and IMO is also safe enough for emacs-25. But if it is deemed not safe enough for the release, the second patch is safer. The second patch doesn't produce "\200≠" in your test case, but neither did Emacs 24, so this is not a regression. Comments? Let's decide on what to do with emacs-25 first, since that blocks the release, and then discuss master if needed. Thanks. --- src/doc.c~0 2016-06-20 08:49:44.000000000 +0300 +++ src/doc.c 2016-08-15 11:24:07.894579900 +0300 @@ -738,8 +738,9 @@ Otherwise, return a new string. */) unsigned char const *start; ptrdiff_t length, length_byte; Lisp_Object name; - bool multibyte; + bool multibyte, pure_ascii; ptrdiff_t nchars; + Lisp_Object orig_string = Qnil; if (NILP (string)) return Qnil; @@ -752,6 +753,20 @@ Otherwise, return a new string. */) enum text_quoting_style quoting_style = text_quoting_style (); multibyte = STRING_MULTIBYTE (string); + /* Pure-ASCII unibyte input strings should produce unibyte strings + if substitution doesn't yield non-ASCII bytes, otherwise they + should produce multibyte strings. */ + pure_ascii = SBYTES (string) == count_size_as_multibyte (SDATA (string), + SCHARS (string)); + /* If the input string is unibyte and includes non-ASCII characters, + make a multibyte copy, so as to be able to return the original + unibyte string if no substitution eventually happens. */ + if (!multibyte && !pure_ascii) + { + orig_string = string; + string = Fstring_make_multibyte (Fcopy_sequence (string)); + multibyte = true; + } nchars = 0; /* KEYMAP is either nil (which means search all the active keymaps) @@ -933,8 +948,8 @@ Otherwise, return a new string. */) subst_string: start = SDATA (tem); - length = SCHARS (tem); length_byte = SBYTES (tem); + length = SCHARS (tem); subst: nonquotes_changed = true; subst_quote: @@ -956,8 +971,8 @@ Otherwise, return a new string. */) && quoting_style == CURVE_QUOTING_STYLE) { start = (unsigned char const *) (strp[0] == '`' ? uLSQM : uRSQM); - length = 1; length_byte = sizeof uLSQM - 1; + length = 1; idx = strp - SDATA (string) + 1; goto subst_quote; } @@ -995,6 +1010,8 @@ Otherwise, return a new string. */) } } } + else if (!NILP (orig_string)) + tem = orig_string; else tem = string; xfree (buf); --- src/doc.c~0 2016-06-20 08:49:44.000000000 +0300 +++ src/doc.c 2016-08-15 11:13:15.132137200 +0300 @@ -738,7 +738,7 @@ Otherwise, return a new string. */) unsigned char const *start; ptrdiff_t length, length_byte; Lisp_Object name; - bool multibyte; + bool multibyte, pure_ascii; ptrdiff_t nchars; if (NILP (string)) @@ -752,6 +752,11 @@ Otherwise, return a new string. */) enum text_quoting_style quoting_style = text_quoting_style (); multibyte = STRING_MULTIBYTE (string); + /* Pure-ASCII unibyte input strings should produce unibyte strings + if substitution doesn't yield non-ASCII bytes, otherwise they + should produce multibyte strings. */ + pure_ascii = SBYTES (string) == count_size_as_multibyte (SDATA (string), + SCHARS (string)); nchars = 0; /* KEYMAP is either nil (which means search all the active keymaps) @@ -933,8 +938,11 @@ Otherwise, return a new string. */) subst_string: start = SDATA (tem); - length = SCHARS (tem); length_byte = SBYTES (tem); + if (multibyte || pure_ascii) + length = SCHARS (tem); + else + length = length_byte; subst: nonquotes_changed = true; subst_quote: @@ -956,8 +964,11 @@ Otherwise, return a new string. */) && quoting_style == CURVE_QUOTING_STYLE) { start = (unsigned char const *) (strp[0] == '`' ? uLSQM : uRSQM); - length = 1; length_byte = sizeof uLSQM - 1; + if (multibyte || pure_ascii) + length = 1; + else + length = length_byte; idx = strp - SDATA (string) + 1; goto subst_quote; }
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.