GNU bug report logs -
#24784
26.0.50; JSON strings with utf-16 escape codes
Previous Next
Reported by: Helmut Eller <eller.helmut <at> gmail.com>
Date: Mon, 24 Oct 2016 18:07:01 UTC
Severity: normal
Found in version 26.0.50
Done: Philipp Stephani <p.stephani2 <at> gmail.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24784 in the body.
You can then email your comments to 24784 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24784
; Package
emacs
.
(Mon, 24 Oct 2016 18:07:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Helmut Eller <eller.helmut <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Mon, 24 Oct 2016 18:07:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
json-read-from-string doesn't parse strings correctly if the the \u
syntax is used to write UTF-16 surrogates:
(equal (json-read-from-string "\"\\uD834\\uDD1E\"") "\"\U0001D11E\"")
=> nil
The correct result t. To quote RFC 7159[*]:
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
[*] https://tools.ietf.org/html/rfc7159#section-7
In GNU Emacs 26.0.50.2 (x86_64-unknown-linux-gnu, GTK+ Version 3.14.5)
of 2016-10-24 built on caladan
Repository revision: 26ccd19269c040ad5960a7567aa5fc88f142c709
Windowing system distributor 'The X.Org Foundation', version 11.0.11604000
System Description: Debian GNU/Linux 8.5 (jessie)
Configured using:
'configure --with-xpm=no --with-jpeg=no --with-gif=no --with-tiff=no'
Configured features:
PNG SOUND DBUS GSETTINGS NOTIFY GNUTLS LIBXML2 FREETYPE XFT ZLIB
TOOLKIT_SCROLL_BARS GTK3 X11
Important settings:
value of $LANG: C.UTF-8
locale-coding-system: utf-8-unix
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24784
; Package
emacs
.
(Mon, 24 Oct 2016 19:58:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 24784 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Helmut Eller <eller.helmut <at> gmail.com> schrieb am Mo., 24. Okt. 2016 um
20:58 Uhr:
>
> json-read-from-string doesn't parse strings correctly if the the \u
> syntax is used to write UTF-16 surrogates:
>
> (equal (json-read-from-string "\"\\uD834\\uDD1E\"") "\"\U0001D11E\"")
> => nil
>
> The correct result t. To quote RFC 7159[*]:
>
> To escape an extended character that is not in the Basic Multilingual
> Plane, the character is represented as a 12-character sequence,
> encoding the UTF-16 surrogate pair. So, for example, a string
> containing only the G clef character (U+1D11E) may be represented as
> "\uD834\uDD1E".
>
> [*] https://tools.ietf.org/html/rfc7159#section-7
>
> Thanks for reporting, I've attached a patch.
[Message part 2 (text/html, inline)]
[0001-Fix-encoding-of-JSON-surrogate-pairs.txt (text/plain, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24784
; Package
emacs
.
(Mon, 24 Oct 2016 23:20:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 24784 <at> debbugs.gnu.org (full text, mbox):
Philipp,
Thanks. Some comments:
On 24.10.2016 22:57, Philipp Stephani wrote:
> +(defsubst json--decode-utf-16-surrogates (high low)
IIRC, there might be no actual benefit from making it a defsubst. If
someone could benchmark it, I'd like to see the result.
> + ;; Special-case UTF-16 surrogate pairs,
> + ;; cf. https://tools.ietf.org/html/rfc7159#section-7
> + ((looking-at
> + (rx (group (any "Dd") (any "89ABab") (= 2 (any "0-9A-Fa-f")))
> + "\\u" (group (any "Dd") (any "C-Fc-f") (= 2 (any "0-9A-Fa-f")))))
> + (json-advance 10)
> + (json--decode-utf-16-surrogates
> + (string-to-number (match-string 1) 16)
> + (string-to-number (match-string 2) 16)))
Shouldn't this go below the UTF-8 case, as the less-frequent one?
> (ert-deftest test-json-encode-string ()
> (should (equal (json-encode-string "foo") "\"foo\""))
> (should (equal (json-encode-string "a\n\fb") "\"a\\n\\fb\""))
> - (should (equal (json-encode-string "\nasdфыв\u001f\u007ffgh\t")
> - "\"\\nasdфыв\\u001f\u007ffgh\\t\"")))
> + (should (equal (json-encode-string "\nasdфыв�\u001f\u007ffgh\t")
> + "\"\\nasdфыв�\\u001f\u007ffgh\\t\"")))
Why are we testing string encoding here?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24784
; Package
emacs
.
(Wed, 26 Oct 2016 16:41:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 24784 <at> debbugs.gnu.org (full text, mbox):
On Tue, Oct 25 2016, Dmitry Gutov wrote:
> On 24.10.2016 22:57, Philipp Stephani wrote:
>
>> +(defsubst json--decode-utf-16-surrogates (high low)
>
> IIRC, there might be no actual benefit from making it a defsubst. If
> someone could benchmark it, I'd like to see the result.
I guess it doesn't hurt but I also doubt that it makes a measurable
difference as utf-16 surrogates are rarely needed.
>
>> + ;; Special-case UTF-16 surrogate pairs,
>> + ;; cf. https://tools.ietf.org/html/rfc7159#section-7
>> + ((looking-at
>> + (rx (group (any "Dd") (any "89ABab") (= 2 (any "0-9A-Fa-f")))
>> + "\\u" (group (any "Dd") (any "C-Fc-f") (= 2 (any "0-9A-Fa-f")))))
>> + (json-advance 10)
>> + (json--decode-utf-16-surrogates
>> + (string-to-number (match-string 1) 16)
>> + (string-to-number (match-string 2) 16)))
>
> Shouldn't this go below the UTF-8 case, as the less-frequent one?
There's also an opportunity to detect unpaired surrogates, e.g.:
(defun json-read-escaped-char ()
"Read the JSON string escaped character at point."
;; Skip over the '\'
(json-advance)
(let* ((char (json-pop))
(special (assq char json-special-chars)))
(cond
(special (cdr special))
((not (eq char ?u)) char)
((looking-at "[0-9A-Fa-f]\\{4\\}")
(let* ((code (string-to-number (match-string 0) 16)))
(json-advance 4)
(cond ((<= #xD800 code #xDBFF) ; UTF-16 high surrogate
(cond ((looking-at "\\\\u\\([Dd][C-Fc-f][0-9A-Fa-f]\\{2\\}\\)")
(let ((low (string-to-number (match-string 1) 16)))
(json-advance 6)
(json--decode-utf-16-surrogates code low)))
(t
;; Expected low surrogate missing
(signal 'json-string-escape (list (point))))))
((<= #xDC00 code #xDFFF)
;; Unexpected low surrogate
(signal 'json-string-escape (list (point))))
(t
code))))
(t
(signal 'json-string-escape (list (point)))))))
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24784
; Package
emacs
.
(Wed, 26 Oct 2016 17:38:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 24784 <at> debbugs.gnu.org (full text, mbox):
> >> +(defsubst json--decode-utf-16-surrogates (high low)
> >
> > IIRC, there might be no actual benefit from making it a defsubst.
> > If someone could benchmark it, I'd like to see the result.
>
> I guess it doesn't hurt but I also doubt that it makes a measurable
> difference as utf-16 surrogates are rarely needed.
IMO, we should never, ever use defsubst nowadays. Unless you can
come up with a REALLY good rationale.
Every time defsubst is used it throws a monkey wrench in the ability
of users to extend and modify Emacs behavior. And nothing is really
gained.
Just one opinion.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24784
; Package
emacs
.
(Sat, 31 Dec 2016 16:54:01 GMT)
Full text and
rfc822 format available.
Message #20 received at 24784 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Dmitry Gutov <dgutov <at> yandex.ru> schrieb am Di., 25. Okt. 2016 um 01:19 Uhr:
> Philipp,
>
> Thanks. Some comments:
>
> On 24.10.2016 22:57, Philipp Stephani wrote:
>
> > +(defsubst json--decode-utf-16-surrogates (high low)
>
> IIRC, there might be no actual benefit from making it a defsubst. If
> someone could benchmark it, I'd like to see the result.
>
Agreed; converted to defun. I've only used defsubst because some other
helper functions also used defsubst.
>
> > + ;; Special-case UTF-16 surrogate pairs,
> > + ;; cf. https://tools.ietf.org/html/rfc7159#section-7
> > + ((looking-at
> > + (rx (group (any "Dd") (any "89ABab") (= 2 (any "0-9A-Fa-f")))
> > + "\\u" (group (any "Dd") (any "C-Fc-f") (= 2 (any
> "0-9A-Fa-f")))))
> > + (json-advance 10)
> > + (json--decode-utf-16-surrogates
> > + (string-to-number (match-string 1) 16)
> > + (string-to-number (match-string 2) 16)))
>
> Shouldn't this go below the UTF-8 case, as the less-frequent one?
>
No, the below case is more general and therefore has to come last.
>
> > (ert-deftest test-json-encode-string ()
> > (should (equal (json-encode-string "foo") "\"foo\""))
> > (should (equal (json-encode-string "a\n\fb") "\"a\\n\\fb\""))
> > - (should (equal (json-encode-string "\nasdфыв\u001f\u007ffgh\t")
> > - "\"\\nasdфыв\\u001f\u007ffgh\\t\"")))
> > + (should (equal (json-encode-string "\nasdфыв𠄞\u001f\u007ffgh\t")
> > + "\"\\nasdфыв𠄞\\u001f\u007ffgh\\t\"")))
>
> Why are we testing string encoding here?
>
It's not 100% related to the patch, but I think it can be included for
symmetry reasons (testing encoding as well as decoding).
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24784
; Package
emacs
.
(Sun, 01 Jan 2017 00:46:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 24784 <at> debbugs.gnu.org (full text, mbox):
On 31.12.2016 19:53, Philipp Stephani wrote:
> Agreed; converted to defun. I've only used defsubst because some other
> helper functions also used defsubst.
Thanks. Those others can probably be changed as well.
> No, the below case is more general and therefore has to come last.
Makes sense.
> It's not 100% related to the patch, but I think it can be included for
> symmetry reasons (testing encoding as well as decoding).
Of course. These are testing utf-8 encoding, though, right? It would be
better if you split them to a separate commit, I think.
Reply sent
to
Philipp Stephani <p.stephani2 <at> gmail.com>
:
You have taken responsibility.
(Sun, 01 Jan 2017 12:27:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Helmut Eller <eller.helmut <at> gmail.com>
:
bug acknowledged by developer.
(Sun, 01 Jan 2017 12:27:02 GMT)
Full text and
rfc822 format available.
Message #28 received at 24784-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Dmitry Gutov <dgutov <at> yandex.ru> schrieb am So., 1. Jan. 2017 um 01:45 Uhr:
> On 31.12.2016 19:53, Philipp Stephani wrote:
>
> > Agreed; converted to defun. I've only used defsubst because some other
> > helper functions also used defsubst.
>
> Thanks. Those others can probably be changed as well.
>
Yes, but rather not in this commit.
>
> > No, the below case is more general and therefore has to come last.
>
> Makes sense.
>
> > It's not 100% related to the patch, but I think it can be included for
> > symmetry reasons (testing encoding as well as decoding).
>
> Of course. These are testing utf-8 encoding, though, right? It would be
> better if you split them to a separate commit, I think.
>
OK, I've removed it from this patch and pushed it as 93be35e038.
[Message part 2 (text/html, inline)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 30 Jan 2017 12:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 8 years and 145 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.