#31679 - 26.1; detect-coding-string does not detect UTF-16

GNU bug report logs - #31679
26.1; detect-coding-string does not detect UTF-16

Package: emacs;

Reported by: Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net>

Date: Fri, 1 Jun 2018 20:29:01 UTC

Severity: minor

Tags: moreinfo

Found in version 26.1

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

Message #17 received at 31679 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 31679 <at> debbugs.gnu.org, Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net> Subject: Re: bug#31679: 26.1; detect-coding-string does not detect UTF-16 Date: Thu, 12 Aug 2021 15:51:28 +0200

Eli Zaretskii <eliz <at> gnu.org> writes: >> My use-case is that I am trying to paste types other than UTF8_STRING >> from the X11 clipboard, and have them handled as automatically as >> possible. While official clipboard types probably have a documented >> encoding (and I have code for those), applications like Firefox also put >> private formats there. And Firefox seems to like UTF-16, even the >> text/html format it puts there is UTF-16. > > If you have a special application in mind, you could always write some > simple enough code in Lisp to see if UTF-16 should be tried, then tell > Emacs to try that explicitly. I ran into the same issue when dealing with X selections -- but there's even more peculiarities in that area (some selections add a spurious nul to the end, and some done), so you have to write a bit of code around this: `decode-coding-string' in itself can't be expected to deal/guess all these oddities (as you say). >> I have tried to debug the C routines that implement this (s.a.), but the >> code is somewhat hairy. I guess I'll have another look to see if I can >> understand it better. > > We could add code to detect_coding_system that looks at some short > enough prefix of the text and sees whether there's a null byte there > for each non-null byte, and try UTF-16 if so. Assuming that we want > to improve the chances of having UTF-16 detected for a small penalty, > that is. I do think that, in general, it would be nice if detect_coding_system did try a bit harder to guess at utf-16. For instance, if (in the first X bytes of the string) more than 90% of the byte pairs look like non-nul/nul pairs, then it's pretty likely to be utf-16. (And I think that would be easy enough to implement?) On the other hand, as you point out, there's a performance penalty that may not be worth it. So... uhm... does anybody have an opinion here? Try harder for utf-16 or just leave it as it is? -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no

This bug report was last modified 3 years and 312 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #31679 26.1; detect-coding-string does not detect UTF-16

GNU bug report logs - #31679
26.1; detect-coding-string does not detect UTF-16