GNU bug report logs - #30076
[PATCH] web: Recognize JSON content type as text.

Previous Next

Package: guile;

Reported by: Arun Isaac <arunisaac <at> systemreboot.net>

Date: Thu, 11 Jan 2018 05:33:01 UTC

Severity: normal

Tags: patch

Full log


View this message in rfc822 format

From: Mark H Weaver <mhw <at> netris.org>
To: Arun Isaac <arunisaac <at> systemreboot.net>
Cc: 30076 <at> debbugs.gnu.org
Subject: bug#30076: [PATCH] web: Recognize JSON content type as text.
Date: Tue, 30 Jan 2018 22:31:04 -0500
Hi Arun,

Arun Isaac <arunisaac <at> systemreboot.net> writes:
> * module/web/response.scm (text-content-type?): Recognize JSON content
>   type as text.

While this would seem reasonable at first glance, it seems to me that
this will result in JSON texts with non-ASCII characters being
mishandled in many cases.

Within Guile, 'text-content-type?' is currently used in two places:

* 'decode-response-body' in (web client), and
* 'response-body-port' in (web response).

In both places, if 'text-content-type?' returns true, the encoding of
the response is assumed to be "ISO-8859-1" if not otherwise specified by
an explicit 'charset' parameter.  This is what RFC 2616 specifies for
text/plain, although RFC 6657 would change the default to US-ASCII, as
it was in RFC 2046, and maybe we should look into that.

However, things are quite different for the application/json MIME type,
as specified in RFCs 4627 and 7159.  Those RFCs specify that JSON text
"SHALL" (i.e. MUST) be encoded in Unicode (UTF-8, UTF-16 or UTF-32),
that the default encoding is UTF-8, and furthermore that no charset
parameter is defined for application/json.

So, we can expect at least some conforming implementations to omit the
'charset' parameter, and yet in that case we must assume that the
encoding is Unicode, and most definitely not ISO-8859-1.

RFC 4627 makes the additional interesting observation (in section 3,
"encoding") that since the first two characters of JSON text will always
be ASCII, and since UTF-8/UTF-16/UTF-32 are the only valid encodings for
JSON text, we can reliably determine the encoding by looking at the
pattern of nul bytes in the first four octets:

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

Given that any of these encodings above are possible, and that there is
no 'charset' parameter defined for "application/json", it seems to me
that we have no choice but to be prepared to auto-detect the encoding,
as described in RFC 4627 section 3 if the 'charset' parameter is
missing.

What do you think?

      Mark




This bug report was last modified 7 years and 134 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.