GNU bug report logs - #31679
26.1; detect-coding-string does not detect UTF-16

Previous Next

Package: emacs;

Reported by: Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net>

Date: Fri, 1 Jun 2018 20:29:01 UTC

Severity: minor

Tags: moreinfo

Found in version 26.1

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

Full log


Message #17 received at 31679 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 31679 <at> debbugs.gnu.org,
 Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net>
Subject: Re: bug#31679: 26.1; detect-coding-string does not detect UTF-16
Date: Thu, 12 Aug 2021 15:51:28 +0200
Eli Zaretskii <eliz <at> gnu.org> writes:

>> My use-case is that I am trying to paste types other than UTF8_STRING
>> from the X11 clipboard, and have them handled as automatically as
>> possible.  While official clipboard types probably have a documented
>> encoding (and I have code for those), applications like Firefox also put
>> private formats there.  And Firefox seems to like UTF-16, even the
>> text/html format it puts there is UTF-16.
>
> If you have a special application in mind, you could always write some
> simple enough code in Lisp to see if UTF-16 should be tried, then tell
> Emacs to try that explicitly.

I ran into the same issue when dealing with X selections -- but there's
even more peculiarities in that area (some selections add a spurious nul
to the end, and some done), so you have to write a bit of code around
this: `decode-coding-string' in itself can't be expected to deal/guess
all these oddities (as you say).

>> I have tried to debug the C routines that implement this (s.a.), but the
>> code is somewhat hairy.  I guess I'll have another look to see if I can
>> understand it better.
>
> We could add code to detect_coding_system that looks at some short
> enough prefix of the text and sees whether there's a null byte there
> for each non-null byte, and try UTF-16 if so.  Assuming that we want
> to improve the chances of having UTF-16 detected for a small penalty,
> that is.

I do think that, in general, it would be nice if detect_coding_system
did try a bit harder to guess at utf-16.  For instance, if (in the first
X bytes of the string) more than 90% of the byte pairs look like
non-nul/nul pairs, then it's pretty likely to be utf-16.  (And I think
that would be easy enough to implement?)

On the other hand, as you point out, there's a performance penalty that
may not be worth it.

So...  uhm...  does anybody have an opinion here?  Try harder for utf-16
or just leave it as it is?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




This bug report was last modified 3 years and 249 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.