GNU bug report logs -
#70007
[PATCH] native JSON encoder
Previous Next
Full log
View this message in rfc822 format
27 mars 2024 kl. 20.05 skrev Eli Zaretskii <eliz <at> gnu.org>:
>>> This rejects unibyte non-ASCII strings, AFAU, in which case I suggest
>>> to think whether we really want that. E.g., why is it wrong to encode
>>> a string to UTF-8, and then send it to JSON?
>>
>> The way I see it, that would break the JSON abstraction: it transports strings of Unicode characters, not strings of bytes.
>
> What's the difference? AFAIU, JSON expects UTF-8 encoded strings, and
> whether that is used as a sequence of bytes or a sequence of
> characters is in the eyes of the beholder: the bytestream is the same,
> only the interpretation changes.
Well no -- JSON transports Unicode strings: the JSON serialiser takes a Unicode string as input and outputs a byte sequence; the JSON parser takes a byte sequence and returns a Unicode string (assuming we are just interested in strings).
That the transport format uses UTF-8 is unrelated; if the user hands an encoded byte sequence to us then it seems more likely that it's a mistake. After all, it cannot have come from a received JSON message.
I think it was just an another artefact of the old implementation. That code incorrectly used encode_string_utf_8 even on non-ASCII unibyte strings and trusted Jansson to validate the result. That resulted in a lot of wasted work and some strange strings getting accepted.
While it's theoretically possible that there are users with code relying on this behaviour, I can't find any evidence for it in the packages that I've looked at.
> I didn't suggest to decode the input string, not at all. I suggested
> to allow unibyte strings, and process them just like you process
> pure-ASCII strings, leaving it to the caller to make sure the string
> has only valid UTF-8 sequences.
Users of this raw-bytes-input feature (if they exist at all) previously had their input validated by Jansson. While mistakes would probably be detected at the other end I'm not sure it's a good idea.
> Forcing callers to decode such
> strings is IMO too harsh and largely unjustified.
We usually force them to do so in most other contexts. To take a random example, `princ` doesn't work with encoded strings. But it's rarely a problem.
Let's see how testing goes. We'll find a solution no matter what, pass-through or separate slow-path validation, if it turns out that we really need to after all.
This bug report was last modified 249 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.