GNU bug report logs - #31679
26.1; detect-coding-string does not detect UTF-16

Previous Next

Package: emacs;

Reported by: Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net>

Date: Fri, 1 Jun 2018 20:29:01 UTC

Severity: minor

Tags: moreinfo

Found in version 26.1

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 31679 in the body.
You can then email your comments to 31679 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#31679; Package emacs. (Fri, 01 Jun 2018 20:29:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Fri, 01 Jun 2018 20:29:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net>
To: bug-gnu-emacs <at> gnu.org
Subject: 26.1; detect-coding-string does not detect UTF-16
Date: Fri, 01 Jun 2018 21:40:32 +0200
I have been trying this (in real life the strings are often longer, of
course):

  (detect-coding-string "h\0t\0m\0l\0")

And I was surprised that this does not detect UTF-16 but instead gives
(no-conversion).

The result of (coding-system-priority-list) is

   (utf-8 iso-2022-7bit iso-latin-1 iso-2022-7bit-lock iso-2022-8bit-ss2
    emacs-mule raw-text iso-2022-jp in-is13194-devanagari
    chinese-iso-8bit utf-8-auto utf-8-with-signature utf-16
    utf-16be-with-signature utf-16le-with-signature utf-16be utf-16le
    japanese-shift-jis chinese-big5 undecided)

Does this just not work, or am I doing something wrong?

Thanks,
benny


Recent messages:
For information about GNU Emacs and the GNU system, type C-h C-a.
next-line: End of buffer
(no-conversion)
Quit [2 times]
Type C-x 1 to delete the help window.
Mark set
delete-backward-char: Text is read-only

Configured features:
XPM JPEG TIFF GIF PNG RSVG IMAGEMAGICK SOUND GPM GSETTINGS NOTIFY
LIBSELINUX GNUTLS LIBXML2 FREETYPE M17N_FLT LIBOTF XFT ZLIB
TOOLKIT_SCROLL_BARS GTK2 X11 THREADS LCMS2

Important settings:
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  global-eldoc-mode: t
  eldoc-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Load-path shadows:
None found.

Features:
(shadow sort mail-extr emacsbug message rmc puny seq byte-opt gv
bytecomp byte-compile cconv dired dired-loaddefs format-spec rfc822 mml
mml-sec password-cache epa derived epg epg-config gnus-util rmail
rmail-loaddefs mm-decode mm-bodies mm-encode mail-parse rfc2231
mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums
mm-util mail-prsvr mail-utils cl-extra help-fns radix-tree help-mode
easymenu cl-loaddefs cl-lib term/xterm xterm time-date elec-pair
mule-util tooltip eldoc electric uniquify ediff-hook vc-hooks
lisp-float-type mwheel term/x-win x-win term/common-win x-dnd tool-bar
dnd fontset image regexp-opt fringe tabulated-list replace newcomment
text-mode elisp-mode lisp-mode prog-mode register page menu-bar
rfn-eshadow isearch timer select scroll-bar mouse jit-lock font-lock
syntax facemenu font-core term/tty-colors frame cl-generic cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese eucjp-ms cp51932 hebrew greek romanian slovak czech european
ethiopic indian cyrillic chinese composite charscript charprop
case-table epa-hook jka-cmpr-hook help simple abbrev obarray minibuffer
cl-preloaded nadvice loaddefs button faces cus-face macroexp files
text-properties overlay sha1 md5 base64 format env code-pages mule
custom widget hashtable-print-readable backquote inotify lcms2
dynamic-setting system-font-setting font-render-setting move-toolbar gtk
x-toolkit x multi-tty make-network-process emacs)

Memory information:
((conses 8 102532 5281)
 (symbols 24 20919 1)
 (miscs 20 38 212)
 (strings 16 29808 1314)
 (string-bytes 1 767826)
 (vectors 12 12354)
 (vector-slots 4 470678 7618)
 (floats 8 56 559)
 (intervals 28 260 1)
 (buffers 536 12)
 (heap 1024 30861 580))




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#31679; Package emacs. (Sat, 02 Jun 2018 07:43:01 GMT) Full text and rfc822 format available.

Message #8 received at 31679 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net>
Cc: 31679 <at> debbugs.gnu.org
Subject: Re: bug#31679: 26.1; detect-coding-string does not detect UTF-16
Date: Sat, 02 Jun 2018 10:42:22 +0300
> From: Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net>
> Date: Fri, 01 Jun 2018 21:40:32 +0200
> 
> I have been trying this (in real life the strings are often longer, of
> course):
> 
>   (detect-coding-string "h\0t\0m\0l\0")
> 
> And I was surprised that this does not detect UTF-16 but instead gives
> (no-conversion).

First, you should lose the trailing null (or add one more), since
UTF-16 strings must, by definition, have an even number of bytes.

Next, you should disable null byte detection by binding
inhibit-null-byte-detection to a non-nil value, because otherwise
Emacs's guesswork will prefer no-conversion, assuming this is binary
data.

If you do that, you get

  (let ((inhibit-null-byte-detection t))
    (detect-coding-string "h\0t\0m\0l"))
  => (undecided)

Why? because it is perfectly valid for a plain-ASCII string to include
null bytes, so Emacs prefers to guess ASCII.

As another example, try this:

  (prefer-coding-system 'utf-16)
  (let ((inhibit-null-byte-detection t))
    (detect-coding-string (encode-coding-string "áçðë" 'utf-16-be) t))
  => utf-16

but

  (let ((inhibit-null-byte-detection t))
    (detect-coding-string
      (substring (encode-coding-string "áçðë" 'utf-16-be) 2) t))
  =>iso-latin-1

So even when UTF-16 is the most preferred encoding, just removing the
BOM is enough to let Emacs prefer something other than UTF-16.

Morale: detecting an encoding in Emacs is based on heuristic
_guesswork_, which is heavily biased to what is deemed to be the most
frequent use cases.  And UTF-16 is quite infrequent, at least on Posix
hosts.

IOW, detecting encoding in Emacs is not as reliable as you seem to
expect.  If you _know_ the text is in UTF-16, just tell Emacs to use
that, don't let it guess.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#31679; Package emacs. (Sat, 02 Jun 2018 13:56:01 GMT) Full text and rfc822 format available.

Message #11 received at 31679 <at> debbugs.gnu.org (full text, mbox):

From: Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 31679 <at> debbugs.gnu.org
Subject: Re: bug#31679: 26.1; detect-coding-string does not detect UTF-16
Date: Sat, 02 Jun 2018 15:55:49 +0200
Hi Eli,


>> From: Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net>
>>   (detect-coding-string "h\0t\0m\0l\0")
>> 
>> And I was surprised that this does not detect UTF-16 but instead gives
>> (no-conversion).

Eli Zaretskii writes:
> First, you should lose the trailing null (or add one more), since
> UTF-16 strings must, by definition, have an even number of bytes.

Actually this string *has* 8 bytes, the last '\0' completes the 'l' to
form the two-byte character.

> Next, you should disable null byte detection by binding
> inhibit-null-byte-detection to a non-nil value, because otherwise
> Emacs's guesswork will prefer no-conversion, assuming this is binary
> data.

O.k. that is a good tip. 

> Why? because it is perfectly valid for a plain-ASCII string to include
> null bytes, so Emacs prefers to guess ASCII.

While NUL is a valid ASCII character according to the standard,
practically nobody uses it as a character.  So for a heuristic in this
context, it would be a bad decision to treat it just as another
character.

And indeed NUL bytes are treated as a strong indication of binary data,
it seems.  I tried to debug this.  The C routine detect_coding_utf_16
tries to distinguish between binary and UTF-16, but it is not called for
the string above.  That routine is called OTOH, when I add a non-ASCII
character as in "h\0t\0m\0l\0ü\0", but even than it decides that the
string is not UTF-16 (?).

> Morale: detecting an encoding in Emacs is based on heuristic
> _guesswork_, which is heavily biased to what is deemed to be the most
> frequent use cases.  And UTF-16 is quite infrequent, at least on Posix
> hosts.
>
> IOW, detecting encoding in Emacs is not as reliable as you seem to
> expect.  If you _know_ the text is in UTF-16, just tell Emacs to use
> that, don't let it guess.

My use-case is that I am trying to paste types other than UTF8_STRING
from the X11 clipboard, and have them handled as automatically as
possible.  While official clipboard types probably have a documented
encoding (and I have code for those), applications like Firefox also put
private formats there.  And Firefox seems to like UTF-16, even the
text/html format it puts there is UTF-16.

I have tried to debug the C routines that implement this (s.a.), but the
code is somewhat hairy.  I guess I'll have another look to see if I can
understand it better.


Thanks so far,
benny




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#31679; Package emacs. (Sat, 02 Jun 2018 14:25:01 GMT) Full text and rfc822 format available.

Message #14 received at 31679 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net>
Cc: 31679 <at> debbugs.gnu.org
Subject: Re: bug#31679: 26.1; detect-coding-string does not detect UTF-16
Date: Sat, 02 Jun 2018 17:24:19 +0300
> From: Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net>
> Cc: 31679 <at> debbugs.gnu.org
> Date: Sat, 02 Jun 2018 15:55:49 +0200
> 
> > First, you should lose the trailing null (or add one more), since
> > UTF-16 strings must, by definition, have an even number of bytes.
> 
> Actually this string *has* 8 bytes, the last '\0' completes the 'l' to
> form the two-byte character.

Oops.  I guess I modified the string while playing with the example
and ended up with one more null.

> > Why? because it is perfectly valid for a plain-ASCII string to include
> > null bytes, so Emacs prefers to guess ASCII.
> 
> While NUL is a valid ASCII character according to the standard,
> practically nobody uses it as a character.  So for a heuristic in this
> context, it would be a bad decision to treat it just as another
> character.

That's because you _know_ this is supposed to be human-readable text,
made of non-null characters.  But Emacs doesn't.

> And indeed NUL bytes are treated as a strong indication of binary data,
> it seems.  I tried to debug this.  The C routine detect_coding_utf_16
> tries to distinguish between binary and UTF-16, but it is not called for
> the string above.  That routine is called OTOH, when I add a non-ASCII
> character as in "h\0t\0m\0l\0ü\0", but even than it decides that the
> string is not UTF-16 (?).

Don't forget that decoding is supposed to be fast, because it's
something Emacs does each time it visits a file or accepts input from
a subprocess.  So it tries not to go through all the possible
encodings, but instead bails out as soon as it thinks it has found a
good guess.

> > Morale: detecting an encoding in Emacs is based on heuristic
> > _guesswork_, which is heavily biased to what is deemed to be the most
> > frequent use cases.  And UTF-16 is quite infrequent, at least on Posix
> > hosts.
> >
> > IOW, detecting encoding in Emacs is not as reliable as you seem to
> > expect.  If you _know_ the text is in UTF-16, just tell Emacs to use
> > that, don't let it guess.
> 
> My use-case is that I am trying to paste types other than UTF8_STRING
> from the X11 clipboard, and have them handled as automatically as
> possible.  While official clipboard types probably have a documented
> encoding (and I have code for those), applications like Firefox also put
> private formats there.  And Firefox seems to like UTF-16, even the
> text/html format it puts there is UTF-16.

If you have a special application in mind, you could always write some
simple enough code in Lisp to see if UTF-16 should be tried, then tell
Emacs to try that explicitly.

> I have tried to debug the C routines that implement this (s.a.), but the
> code is somewhat hairy.  I guess I'll have another look to see if I can
> understand it better.

We could add code to detect_coding_system that looks at some short
enough prefix of the text and sees whether there's a null byte there
for each non-null byte, and try UTF-16 if so.  Assuming that we want
to improve the chances of having UTF-16 detected for a small penalty,
that is.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#31679; Package emacs. (Thu, 12 Aug 2021 13:52:01 GMT) Full text and rfc822 format available.

Message #17 received at 31679 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 31679 <at> debbugs.gnu.org,
 Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net>
Subject: Re: bug#31679: 26.1; detect-coding-string does not detect UTF-16
Date: Thu, 12 Aug 2021 15:51:28 +0200
Eli Zaretskii <eliz <at> gnu.org> writes:

>> My use-case is that I am trying to paste types other than UTF8_STRING
>> from the X11 clipboard, and have them handled as automatically as
>> possible.  While official clipboard types probably have a documented
>> encoding (and I have code for those), applications like Firefox also put
>> private formats there.  And Firefox seems to like UTF-16, even the
>> text/html format it puts there is UTF-16.
>
> If you have a special application in mind, you could always write some
> simple enough code in Lisp to see if UTF-16 should be tried, then tell
> Emacs to try that explicitly.

I ran into the same issue when dealing with X selections -- but there's
even more peculiarities in that area (some selections add a spurious nul
to the end, and some done), so you have to write a bit of code around
this: `decode-coding-string' in itself can't be expected to deal/guess
all these oddities (as you say).

>> I have tried to debug the C routines that implement this (s.a.), but the
>> code is somewhat hairy.  I guess I'll have another look to see if I can
>> understand it better.
>
> We could add code to detect_coding_system that looks at some short
> enough prefix of the text and sees whether there's a null byte there
> for each non-null byte, and try UTF-16 if so.  Assuming that we want
> to improve the chances of having UTF-16 detected for a small penalty,
> that is.

I do think that, in general, it would be nice if detect_coding_system
did try a bit harder to guess at utf-16.  For instance, if (in the first
X bytes of the string) more than 90% of the byte pairs look like
non-nul/nul pairs, then it's pretty likely to be utf-16.  (And I think
that would be easy enough to implement?)

On the other hand, as you point out, there's a performance penalty that
may not be worth it.

So...  uhm...  does anybody have an opinion here?  Try harder for utf-16
or just leave it as it is?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Added tag(s) moreinfo. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Thu, 12 Aug 2021 13:53:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#31679; Package emacs. (Thu, 09 Sep 2021 15:24:02 GMT) Full text and rfc822 format available.

Message #22 received at 31679 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 31679 <at> debbugs.gnu.org,
 Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net>
Subject: Re: bug#31679: 26.1; detect-coding-string does not detect UTF-16
Date: Thu, 09 Sep 2021 17:23:04 +0200
Lars Ingebrigtsen <larsi <at> gnus.org> writes:

> On the other hand, as you point out, there's a performance penalty that
> may not be worth it.
>
> So...  uhm...  does anybody have an opinion here?  Try harder for utf-16
> or just leave it as it is?

Nobody had an opinion in a month, so I'm closing this bug report.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




bug closed, send any further explanations to 31679 <at> debbugs.gnu.org and Benjamin Riefenstahl <b.riefenstahl <at> turtle-trading.net> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Thu, 09 Sep 2021 15:24:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 08 Oct 2021 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 248 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.