GNU bug report logs - #24784
26.0.50; JSON strings with utf-16 escape codes

Previous Next

Package: emacs;

Reported by: Helmut Eller <eller.helmut <at> gmail.com>

Date: Mon, 24 Oct 2016 18:07:01 UTC

Severity: normal

Found in version 26.0.50

Done: Philipp Stephani <p.stephani2 <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24784 in the body.
You can then email your comments to 24784 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#24784; Package emacs. (Mon, 24 Oct 2016 18:07:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Helmut Eller <eller.helmut <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Mon, 24 Oct 2016 18:07:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Helmut Eller <eller.helmut <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 26.0.50; JSON strings with utf-16 escape codes
Date: Mon, 24 Oct 2016 20:06:18 +0200
json-read-from-string doesn't parse strings correctly if the the \u
syntax is used to write UTF-16 surrogates:

 (equal (json-read-from-string "\"\\uD834\\uDD1E\"") "\"\U0001D11E\"")
 => nil

The correct result t.  To quote RFC 7159[*]:

   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a 12-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   "\uD834\uDD1E".

[*] https://tools.ietf.org/html/rfc7159#section-7


In GNU Emacs 26.0.50.2 (x86_64-unknown-linux-gnu, GTK+ Version 3.14.5)
 of 2016-10-24 built on caladan
Repository revision: 26ccd19269c040ad5960a7567aa5fc88f142c709
Windowing system distributor 'The X.Org Foundation', version 11.0.11604000
System Description:	Debian GNU/Linux 8.5 (jessie)

Configured using:
 'configure --with-xpm=no --with-jpeg=no --with-gif=no --with-tiff=no'

Configured features:
PNG SOUND DBUS GSETTINGS NOTIFY GNUTLS LIBXML2 FREETYPE XFT ZLIB
TOOLKIT_SCROLL_BARS GTK3 X11

Important settings:
  value of $LANG: C.UTF-8
  locale-coding-system: utf-8-unix




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#24784; Package emacs. (Mon, 24 Oct 2016 19:58:02 GMT) Full text and rfc822 format available.

Message #8 received at 24784 <at> debbugs.gnu.org (full text, mbox):

From: Philipp Stephani <p.stephani2 <at> gmail.com>
To: Helmut Eller <eller.helmut <at> gmail.com>, 24784 <at> debbugs.gnu.org
Subject: Re: bug#24784: 26.0.50; JSON strings with utf-16 escape codes
Date: Mon, 24 Oct 2016 19:57:19 +0000
[Message part 1 (text/plain, inline)]
Helmut Eller <eller.helmut <at> gmail.com> schrieb am Mo., 24. Okt. 2016 um
20:58 Uhr:

>
> json-read-from-string doesn't parse strings correctly if the the \u
> syntax is used to write UTF-16 surrogates:
>
>  (equal (json-read-from-string "\"\\uD834\\uDD1E\"") "\"\U0001D11E\"")
>  => nil
>
> The correct result t.  To quote RFC 7159[*]:
>
>    To escape an extended character that is not in the Basic Multilingual
>    Plane, the character is represented as a 12-character sequence,
>    encoding the UTF-16 surrogate pair.  So, for example, a string
>    containing only the G clef character (U+1D11E) may be represented as
>    "\uD834\uDD1E".
>
> [*] https://tools.ietf.org/html/rfc7159#section-7
>
> Thanks for reporting, I've attached a patch.
[Message part 2 (text/html, inline)]
[0001-Fix-encoding-of-JSON-surrogate-pairs.txt (text/plain, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#24784; Package emacs. (Mon, 24 Oct 2016 23:20:01 GMT) Full text and rfc822 format available.

Message #11 received at 24784 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Philipp Stephani <p.stephani2 <at> gmail.com>,
 Helmut Eller <eller.helmut <at> gmail.com>, 24784 <at> debbugs.gnu.org
Subject: Re: bug#24784: 26.0.50; JSON strings with utf-16 escape codes
Date: Tue, 25 Oct 2016 02:19:18 +0300
Philipp,

Thanks. Some comments:

On 24.10.2016 22:57, Philipp Stephani wrote:

> +(defsubst json--decode-utf-16-surrogates (high low)

IIRC, there might be no actual benefit from making it a defsubst. If 
someone could benchmark it, I'd like to see the result.

> +     ;; Special-case UTF-16 surrogate pairs,
> +     ;; cf. https://tools.ietf.org/html/rfc7159#section-7
> +     ((looking-at
> +       (rx (group (any "Dd") (any "89ABab") (= 2 (any "0-9A-Fa-f")))
> +           "\\u" (group (any "Dd") (any "C-Fc-f") (= 2 (any "0-9A-Fa-f")))))
> +      (json-advance 10)
> +      (json--decode-utf-16-surrogates
> +       (string-to-number (match-string 1) 16)
> +       (string-to-number (match-string 2) 16)))

Shouldn't this go below the UTF-8 case, as the less-frequent one?

>  (ert-deftest test-json-encode-string ()
>    (should (equal (json-encode-string "foo") "\"foo\""))
>    (should (equal (json-encode-string "a\n\fb") "\"a\\n\\fb\""))
> -  (should (equal (json-encode-string "\nasdфыв\u001f\u007ffgh\t")
> -                 "\"\\nasdфыв\\u001f\u007ffgh\\t\"")))
> +  (should (equal (json-encode-string "\nasdфыв�\u001f\u007ffgh\t")
> +                 "\"\\nasdфыв�\\u001f\u007ffgh\\t\"")))

Why are we testing string encoding here?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#24784; Package emacs. (Wed, 26 Oct 2016 16:41:02 GMT) Full text and rfc822 format available.

Message #14 received at 24784 <at> debbugs.gnu.org (full text, mbox):

From: Helmut Eller <eller.helmut <at> gmail.com>
To: Dmitry Gutov <dgutov <at> yandex.ru>
Cc: Philipp Stephani <p.stephani2 <at> gmail.com>, 24784 <at> debbugs.gnu.org
Subject: Re: bug#24784: 26.0.50; JSON strings with utf-16 escape codes
Date: Wed, 26 Oct 2016 18:39:57 +0200
On Tue, Oct 25 2016, Dmitry Gutov wrote:

> On 24.10.2016 22:57, Philipp Stephani wrote:
>
>> +(defsubst json--decode-utf-16-surrogates (high low)
>
> IIRC, there might be no actual benefit from making it a defsubst. If
> someone could benchmark it, I'd like to see the result.

I guess it doesn't hurt but I also doubt that it makes a measurable
difference as utf-16 surrogates are rarely needed.

>
>> +     ;; Special-case UTF-16 surrogate pairs,
>> +     ;; cf. https://tools.ietf.org/html/rfc7159#section-7
>> +     ((looking-at
>> +       (rx (group (any "Dd") (any "89ABab") (= 2 (any "0-9A-Fa-f")))
>> +           "\\u" (group (any "Dd") (any "C-Fc-f") (= 2 (any "0-9A-Fa-f")))))
>> +      (json-advance 10)
>> +      (json--decode-utf-16-surrogates
>> +       (string-to-number (match-string 1) 16)
>> +       (string-to-number (match-string 2) 16)))
>
> Shouldn't this go below the UTF-8 case, as the less-frequent one?

There's also an opportunity to detect unpaired surrogates, e.g.:

(defun json-read-escaped-char ()
  "Read the JSON string escaped character at point."
  ;; Skip over the '\'
  (json-advance)
  (let* ((char (json-pop))
         (special (assq char json-special-chars)))
    (cond
     (special (cdr special))
     ((not (eq char ?u)) char)
     ((looking-at "[0-9A-Fa-f]\\{4\\}")
      (let* ((code (string-to-number (match-string 0) 16)))
        (json-advance 4)
        (cond ((<= #xD800 code #xDBFF) ; UTF-16 high surrogate
               (cond ((looking-at "\\\\u\\([Dd][C-Fc-f][0-9A-Fa-f]\\{2\\}\\)")
                      (let ((low (string-to-number (match-string 1) 16)))
                        (json-advance 6)
                        (json--decode-utf-16-surrogates code low)))
                     (t
                      ;; Expected low surrogate missing
                      (signal 'json-string-escape (list (point))))))
              ((<= #xDC00 code #xDFFF)
               ;; Unexpected low surrogate
               (signal 'json-string-escape (list (point))))
              (t
               code))))
     (t
      (signal 'json-string-escape (list (point)))))))




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#24784; Package emacs. (Wed, 26 Oct 2016 17:38:02 GMT) Full text and rfc822 format available.

Message #17 received at 24784 <at> debbugs.gnu.org (full text, mbox):

From: Drew Adams <drew.adams <at> oracle.com>
To: Helmut Eller <eller.helmut <at> gmail.com>, Dmitry Gutov <dgutov <at> yandex.ru>
Cc: Philipp Stephani <p.stephani2 <at> gmail.com>, 24784 <at> debbugs.gnu.org
Subject: RE: bug#24784: 26.0.50; JSON strings with utf-16 escape codes
Date: Wed, 26 Oct 2016 10:37:41 -0700 (PDT)
> >> +(defsubst json--decode-utf-16-surrogates (high low)
> >
> > IIRC, there might be no actual benefit from making it a defsubst.
> > If someone could benchmark it, I'd like to see the result.
> 
> I guess it doesn't hurt but I also doubt that it makes a measurable
> difference as utf-16 surrogates are rarely needed.

IMO, we should never, ever use defsubst nowadays.  Unless you can
come up with a REALLY good rationale.

Every time defsubst is used it throws a monkey wrench in the ability
of users to extend and modify Emacs behavior.  And nothing is really
gained.

Just one opinion.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#24784; Package emacs. (Sat, 31 Dec 2016 16:54:01 GMT) Full text and rfc822 format available.

Message #20 received at 24784 <at> debbugs.gnu.org (full text, mbox):

From: Philipp Stephani <p.stephani2 <at> gmail.com>
To: Dmitry Gutov <dgutov <at> yandex.ru>, Helmut Eller <eller.helmut <at> gmail.com>,
 24784 <at> debbugs.gnu.org
Subject: Re: bug#24784: 26.0.50; JSON strings with utf-16 escape codes
Date: Sat, 31 Dec 2016 16:53:27 +0000
[Message part 1 (text/plain, inline)]
Dmitry Gutov <dgutov <at> yandex.ru> schrieb am Di., 25. Okt. 2016 um 01:19 Uhr:

> Philipp,
>
> Thanks. Some comments:
>
> On 24.10.2016 22:57, Philipp Stephani wrote:
>
> > +(defsubst json--decode-utf-16-surrogates (high low)
>
> IIRC, there might be no actual benefit from making it a defsubst. If
> someone could benchmark it, I'd like to see the result.
>

Agreed; converted to defun. I've only used defsubst because some other
helper functions also used defsubst.


>
> > +     ;; Special-case UTF-16 surrogate pairs,
> > +     ;; cf. https://tools.ietf.org/html/rfc7159#section-7
> > +     ((looking-at
> > +       (rx (group (any "Dd") (any "89ABab") (= 2 (any "0-9A-Fa-f")))
> > +           "\\u" (group (any "Dd") (any "C-Fc-f") (= 2 (any
> "0-9A-Fa-f")))))
> > +      (json-advance 10)
> > +      (json--decode-utf-16-surrogates
> > +       (string-to-number (match-string 1) 16)
> > +       (string-to-number (match-string 2) 16)))
>
> Shouldn't this go below the UTF-8 case, as the less-frequent one?
>

No, the below case is more general and therefore has to come last.


>
> >  (ert-deftest test-json-encode-string ()
> >    (should (equal (json-encode-string "foo") "\"foo\""))
> >    (should (equal (json-encode-string "a\n\fb") "\"a\\n\\fb\""))
> > -  (should (equal (json-encode-string "\nasdфыв\u001f\u007ffgh\t")
> > -                 "\"\\nasdфыв\\u001f\u007ffgh\\t\"")))
> > +  (should (equal (json-encode-string "\nasdфыв𠄞\u001f\u007ffgh\t")
> > +                 "\"\\nasdфыв𠄞\\u001f\u007ffgh\\t\"")))
>
> Why are we testing string encoding here?
>

It's not 100% related to the patch, but I think it can be included for
symmetry reasons (testing encoding as well as decoding).
[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#24784; Package emacs. (Sun, 01 Jan 2017 00:46:02 GMT) Full text and rfc822 format available.

Message #23 received at 24784 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dgutov <at> yandex.ru>
To: Philipp Stephani <p.stephani2 <at> gmail.com>,
 Helmut Eller <eller.helmut <at> gmail.com>, 24784 <at> debbugs.gnu.org
Subject: Re: bug#24784: 26.0.50; JSON strings with utf-16 escape codes
Date: Sun, 1 Jan 2017 03:45:31 +0300
On 31.12.2016 19:53, Philipp Stephani wrote:

> Agreed; converted to defun. I've only used defsubst because some other
> helper functions also used defsubst.

Thanks. Those others can probably be changed as well.

> No, the below case is more general and therefore has to come last.

Makes sense.

> It's not 100% related to the patch, but I think it can be included for
> symmetry reasons (testing encoding as well as decoding).

Of course. These are testing utf-8 encoding, though, right? It would be 
better if you split them to a separate commit, I think.




Reply sent to Philipp Stephani <p.stephani2 <at> gmail.com>:
You have taken responsibility. (Sun, 01 Jan 2017 12:27:02 GMT) Full text and rfc822 format available.

Notification sent to Helmut Eller <eller.helmut <at> gmail.com>:
bug acknowledged by developer. (Sun, 01 Jan 2017 12:27:02 GMT) Full text and rfc822 format available.

Message #28 received at 24784-done <at> debbugs.gnu.org (full text, mbox):

From: Philipp Stephani <p.stephani2 <at> gmail.com>
To: Dmitry Gutov <dgutov <at> yandex.ru>, Helmut Eller <eller.helmut <at> gmail.com>,
 24784-done <at> debbugs.gnu.org
Subject: Re: bug#24784: 26.0.50; JSON strings with utf-16 escape codes
Date: Sun, 01 Jan 2017 12:26:07 +0000
[Message part 1 (text/plain, inline)]
Dmitry Gutov <dgutov <at> yandex.ru> schrieb am So., 1. Jan. 2017 um 01:45 Uhr:

> On 31.12.2016 19:53, Philipp Stephani wrote:
>
> > Agreed; converted to defun. I've only used defsubst because some other
> > helper functions also used defsubst.
>
> Thanks. Those others can probably be changed as well.
>

Yes, but rather not in this commit.


>
> > No, the below case is more general and therefore has to come last.
>
> Makes sense.
>

> > It's not 100% related to the patch, but I think it can be included for
> > symmetry reasons (testing encoding as well as decoding).
>
> Of course. These are testing utf-8 encoding, though, right? It would be
> better if you split them to a separate commit, I think.
>

OK, I've removed it from this patch and pushed it as 93be35e038.
[Message part 2 (text/html, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 30 Jan 2017 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 145 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.