From unknown Fri Jun 13 11:32:54 2025 X-Loop: help-debbugs@gnu.org Subject: bug#31679: 26.1; detect-coding-string does not detect UTF-16 Resent-From: Benjamin Riefenstahl Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Fri, 01 Jun 2018 20:29:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 31679 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: 31679@debbugs.gnu.org X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.15278849347133 (code B ref -1); Fri, 01 Jun 2018 20:29:01 +0000 Received: (at submit) by debbugs.gnu.org; 1 Jun 2018 20:28:54 +0000 Received: from localhost ([127.0.0.1]:58206 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fOqfN-0001qz-Sz for submit@debbugs.gnu.org; Fri, 01 Jun 2018 16:28:54 -0400 Received: from eggs.gnu.org ([208.118.235.92]:36605) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fOqfM-0001qk-4e for submit@debbugs.gnu.org; Fri, 01 Jun 2018 16:28:52 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fOqfF-0007Ab-Q3 for submit@debbugs.gnu.org; Fri, 01 Jun 2018 16:28:46 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,NO_DNS_FOR_FROM autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:35953) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1fOqfF-00079u-Mm for submit@debbugs.gnu.org; Fri, 01 Jun 2018 16:28:45 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55804) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fOqfE-0006wz-AJ for bug-gnu-emacs@gnu.org; Fri, 01 Jun 2018 16:28:45 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fOqf9-0006p9-8m for bug-gnu-emacs@gnu.org; Fri, 01 Jun 2018 16:28:44 -0400 Received: from odoacer.turtle-trading.net ([217.91.34.180]:49283) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1fOqf8-0006ca-U1 for bug-gnu-emacs@gnu.org; Fri, 01 Jun 2018 16:28:39 -0400 Received: from justinian.turtle-trading.net ([192.168.2.118]) by odoacer.turtle-trading.net with esmtp (Exim 4.80) (envelope-from ) id 1fOpub-0000VH-1l; Fri, 01 Jun 2018 21:40:33 +0200 Received: from benny by justinian.turtle-trading.net with local (Exim 4.84_2) (envelope-from ) id 1fOpua-0001DT-Uy; Fri, 01 Jun 2018 21:40:32 +0200 From: Benjamin Riefenstahl Date: Fri, 01 Jun 2018 21:40:32 +0200 Message-ID: <87efhq47nz.fsf@justinian.i-did-not-set--mail-host-address--so-tickle-me> MIME-Version: 1.0 Content-Type: text/plain X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.4 (----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.4 (-----) I have been trying this (in real life the strings are often longer, of course): (detect-coding-string "h\0t\0m\0l\0") And I was surprised that this does not detect UTF-16 but instead gives (no-conversion). The result of (coding-system-priority-list) is (utf-8 iso-2022-7bit iso-latin-1 iso-2022-7bit-lock iso-2022-8bit-ss2 emacs-mule raw-text iso-2022-jp in-is13194-devanagari chinese-iso-8bit utf-8-auto utf-8-with-signature utf-16 utf-16be-with-signature utf-16le-with-signature utf-16be utf-16le japanese-shift-jis chinese-big5 undecided) Does this just not work, or am I doing something wrong? Thanks, benny Recent messages: For information about GNU Emacs and the GNU system, type C-h C-a. next-line: End of buffer (no-conversion) Quit [2 times] Type C-x 1 to delete the help window. Mark set delete-backward-char: Text is read-only Configured features: XPM JPEG TIFF GIF PNG RSVG IMAGEMAGICK SOUND GPM GSETTINGS NOTIFY LIBSELINUX GNUTLS LIBXML2 FREETYPE M17N_FLT LIBOTF XFT ZLIB TOOLKIT_SCROLL_BARS GTK2 X11 THREADS LCMS2 Important settings: value of $LANG: en_US.UTF-8 locale-coding-system: utf-8-unix Major mode: Lisp Interaction Minor modes in effect: tooltip-mode: t global-eldoc-mode: t eldoc-mode: t electric-indent-mode: t mouse-wheel-mode: t tool-bar-mode: t menu-bar-mode: t file-name-shadow-mode: t global-font-lock-mode: t font-lock-mode: t auto-composition-mode: t auto-encryption-mode: t auto-compression-mode: t line-number-mode: t transient-mark-mode: t Load-path shadows: None found. Features: (shadow sort mail-extr emacsbug message rmc puny seq byte-opt gv bytecomp byte-compile cconv dired dired-loaddefs format-spec rfc822 mml mml-sec password-cache epa derived epg epg-config gnus-util rmail rmail-loaddefs mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils cl-extra help-fns radix-tree help-mode easymenu cl-loaddefs cl-lib term/xterm xterm time-date elec-pair mule-util tooltip eldoc electric uniquify ediff-hook vc-hooks lisp-float-type mwheel term/x-win x-win term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe tabulated-list replace newcomment text-mode elisp-mode lisp-mode prog-mode register page menu-bar rfn-eshadow isearch timer select scroll-bar mouse jit-lock font-lock syntax facemenu font-core term/tty-colors frame cl-generic cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech european ethiopic indian cyrillic chinese composite charscript charprop case-table epa-hook jka-cmpr-hook help simple abbrev obarray minibuffer cl-preloaded nadvice loaddefs button faces cus-face macroexp files text-properties overlay sha1 md5 base64 format env code-pages mule custom widget hashtable-print-readable backquote inotify lcms2 dynamic-setting system-font-setting font-render-setting move-toolbar gtk x-toolkit x multi-tty make-network-process emacs) Memory information: ((conses 8 102532 5281) (symbols 24 20919 1) (miscs 20 38 212) (strings 16 29808 1314) (string-bytes 1 767826) (vectors 12 12354) (vector-slots 4 470678 7618) (floats 8 56 559) (intervals 28 260 1) (buffers 536 12) (heap 1024 30861 580)) From unknown Fri Jun 13 11:32:54 2025 X-Loop: help-debbugs@gnu.org Subject: bug#31679: 26.1; detect-coding-string does not detect UTF-16 Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 02 Jun 2018 07:43:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 31679 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: Benjamin Riefenstahl Cc: 31679@debbugs.gnu.org Reply-To: Eli Zaretskii Received: via spool by 31679-submit@debbugs.gnu.org id=B31679.152792535410802 (code B ref 31679); Sat, 02 Jun 2018 07:43:01 +0000 Received: (at 31679) by debbugs.gnu.org; 2 Jun 2018 07:42:34 +0000 Received: from localhost ([127.0.0.1]:58342 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP1BK-0002o9-2w for submit@debbugs.gnu.org; Sat, 02 Jun 2018 03:42:34 -0400 Received: from eggs.gnu.org ([208.118.235.92]:36906) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP1BG-0002nv-VH for 31679@debbugs.gnu.org; Sat, 02 Jun 2018 03:42:32 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fP1B7-0005BE-Qb for 31679@debbugs.gnu.org; Sat, 02 Jun 2018 03:42:25 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-0.0 required=5.0 tests=BAYES_40 autolearn=disabled version=3.3.2 Received: from fencepost.gnu.org ([2001:4830:134:3::e]:42511) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fP1B7-0005Ax-Lq; Sat, 02 Jun 2018 03:42:21 -0400 Received: from [176.228.60.248] (port=1585 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1fP1B5-0005cM-QW; Sat, 02 Jun 2018 03:42:20 -0400 Date: Sat, 02 Jun 2018 10:42:22 +0300 Message-Id: <83zi0deish.fsf@gnu.org> From: Eli Zaretskii In-reply-to: <87efhq47nz.fsf@justinian.i-did-not-set--mail-host-address--so-tickle-me> (message from Benjamin Riefenstahl on Fri, 01 Jun 2018 21:40:32 +0200) References: <87efhq47nz.fsf@justinian.i-did-not-set--mail-host-address--so-tickle-me> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-Spam-Score: -5.0 (-----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -6.0 (------) > From: Benjamin Riefenstahl > Date: Fri, 01 Jun 2018 21:40:32 +0200 > > I have been trying this (in real life the strings are often longer, of > course): > > (detect-coding-string "h\0t\0m\0l\0") > > And I was surprised that this does not detect UTF-16 but instead gives > (no-conversion). First, you should lose the trailing null (or add one more), since UTF-16 strings must, by definition, have an even number of bytes. Next, you should disable null byte detection by binding inhibit-null-byte-detection to a non-nil value, because otherwise Emacs's guesswork will prefer no-conversion, assuming this is binary data. If you do that, you get (let ((inhibit-null-byte-detection t)) (detect-coding-string "h\0t\0m\0l")) => (undecided) Why? because it is perfectly valid for a plain-ASCII string to include null bytes, so Emacs prefers to guess ASCII. As another example, try this: (prefer-coding-system 'utf-16) (let ((inhibit-null-byte-detection t)) (detect-coding-string (encode-coding-string "áçðë" 'utf-16-be) t)) => utf-16 but (let ((inhibit-null-byte-detection t)) (detect-coding-string (substring (encode-coding-string "áçðë" 'utf-16-be) 2) t)) =>iso-latin-1 So even when UTF-16 is the most preferred encoding, just removing the BOM is enough to let Emacs prefer something other than UTF-16. Morale: detecting an encoding in Emacs is based on heuristic _guesswork_, which is heavily biased to what is deemed to be the most frequent use cases. And UTF-16 is quite infrequent, at least on Posix hosts. IOW, detecting encoding in Emacs is not as reliable as you seem to expect. If you _know_ the text is in UTF-16, just tell Emacs to use that, don't let it guess. From unknown Fri Jun 13 11:32:54 2025 X-Loop: help-debbugs@gnu.org Subject: bug#31679: 26.1; detect-coding-string does not detect UTF-16 Resent-From: Benjamin Riefenstahl Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 02 Jun 2018 13:56:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 31679 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: Eli Zaretskii Cc: 31679@debbugs.gnu.org Received: via spool by 31679-submit@debbugs.gnu.org id=B31679.152794775822683 (code B ref 31679); Sat, 02 Jun 2018 13:56:01 +0000 Received: (at 31679) by debbugs.gnu.org; 2 Jun 2018 13:55:58 +0000 Received: from localhost ([127.0.0.1]:59429 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP70g-0005tn-EM for submit@debbugs.gnu.org; Sat, 02 Jun 2018 09:55:58 -0400 Received: from odoacer.turtle-trading.net ([217.91.34.180]:34146) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP70e-0005tg-2n for 31679@debbugs.gnu.org; Sat, 02 Jun 2018 09:55:56 -0400 Received: from justinian.turtle-trading.net ([192.168.2.118]) by odoacer.turtle-trading.net with esmtp (Exim 4.80) (envelope-from ) id 1fP70X-00016L-Ly; Sat, 02 Jun 2018 15:55:49 +0200 Received: from benny by justinian.turtle-trading.net with local (Exim 4.84_2) (envelope-from ) id 1fP70X-0002tC-Iv; Sat, 02 Jun 2018 15:55:49 +0200 From: Benjamin Riefenstahl References: <87efhq47nz.fsf@justinian.i-did-not-set--mail-host-address--so-tickle-me> <83zi0deish.fsf@gnu.org> Date: Sat, 02 Jun 2018 15:55:49 +0200 In-Reply-To: <83zi0deish.fsf@gnu.org> (Eli Zaretskii's message of "Sat, 02 Jun 2018 10:42:22 +0300") Message-ID: <874lilgumy.fsf@blei.turtle-trading.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) Hi Eli, >> From: Benjamin Riefenstahl >> (detect-coding-string "h\0t\0m\0l\0") >>=20 >> And I was surprised that this does not detect UTF-16 but instead gives >> (no-conversion). Eli Zaretskii writes: > First, you should lose the trailing null (or add one more), since > UTF-16 strings must, by definition, have an even number of bytes. Actually this string *has* 8 bytes, the last '\0' completes the 'l' to form the two-byte character. > Next, you should disable null byte detection by binding > inhibit-null-byte-detection to a non-nil value, because otherwise > Emacs's guesswork will prefer no-conversion, assuming this is binary > data. O.k. that is a good tip.=20 > Why? because it is perfectly valid for a plain-ASCII string to include > null bytes, so Emacs prefers to guess ASCII. While NUL is a valid ASCII character according to the standard, practically nobody uses it as a character. So for a heuristic in this context, it would be a bad decision to treat it just as another character. And indeed NUL bytes are treated as a strong indication of binary data, it seems. I tried to debug this. The C routine detect_coding_utf_16 tries to distinguish between binary and UTF-16, but it is not called for the string above. That routine is called OTOH, when I add a non-ASCII character as in "h\0t\0m\0l\0=FC\0", but even than it decides that the string is not UTF-16 (?). > Morale: detecting an encoding in Emacs is based on heuristic > _guesswork_, which is heavily biased to what is deemed to be the most > frequent use cases. And UTF-16 is quite infrequent, at least on Posix > hosts. > > IOW, detecting encoding in Emacs is not as reliable as you seem to > expect. If you _know_ the text is in UTF-16, just tell Emacs to use > that, don't let it guess. My use-case is that I am trying to paste types other than UTF8_STRING from the X11 clipboard, and have them handled as automatically as possible. While official clipboard types probably have a documented encoding (and I have code for those), applications like Firefox also put private formats there. And Firefox seems to like UTF-16, even the text/html format it puts there is UTF-16. I have tried to debug the C routines that implement this (s.a.), but the code is somewhat hairy. I guess I'll have another look to see if I can understand it better. Thanks so far, benny From unknown Fri Jun 13 11:32:54 2025 X-Loop: help-debbugs@gnu.org Subject: bug#31679: 26.1; detect-coding-string does not detect UTF-16 Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 02 Jun 2018 14:25:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 31679 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: Benjamin Riefenstahl Cc: 31679@debbugs.gnu.org Reply-To: Eli Zaretskii Received: via spool by 31679-submit@debbugs.gnu.org id=B31679.152794946725888 (code B ref 31679); Sat, 02 Jun 2018 14:25:01 +0000 Received: (at 31679) by debbugs.gnu.org; 2 Jun 2018 14:24:27 +0000 Received: from localhost ([127.0.0.1]:59465 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP7SF-0006jT-JD for submit@debbugs.gnu.org; Sat, 02 Jun 2018 10:24:27 -0400 Received: from eggs.gnu.org ([208.118.235.92]:53161) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fP7SC-0006j9-Vx for 31679@debbugs.gnu.org; Sat, 02 Jun 2018 10:24:26 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fP7S4-0008FF-NQ for 31679@debbugs.gnu.org; Sat, 02 Jun 2018 10:24:19 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from fencepost.gnu.org ([2001:4830:134:3::e]:46680) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fP7S4-0008Ez-JE; Sat, 02 Jun 2018 10:24:16 -0400 Received: from [176.228.60.248] (port=2292 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1fP7S4-00064b-0Q; Sat, 02 Jun 2018 10:24:16 -0400 Date: Sat, 02 Jun 2018 17:24:19 +0300 Message-Id: <836031e06k.fsf@gnu.org> From: Eli Zaretskii In-reply-to: <874lilgumy.fsf@blei.turtle-trading.net> (message from Benjamin Riefenstahl on Sat, 02 Jun 2018 15:55:49 +0200) References: <87efhq47nz.fsf@justinian.i-did-not-set--mail-host-address--so-tickle-me> <83zi0deish.fsf@gnu.org> <874lilgumy.fsf@blei.turtle-trading.net> MIME-version: 1.0 Content-type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-Spam-Score: -5.0 (-----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -6.0 (------) > From: Benjamin Riefenstahl > Cc: 31679@debbugs.gnu.org > Date: Sat, 02 Jun 2018 15:55:49 +0200 > > > First, you should lose the trailing null (or add one more), since > > UTF-16 strings must, by definition, have an even number of bytes. > > Actually this string *has* 8 bytes, the last '\0' completes the 'l' to > form the two-byte character. Oops. I guess I modified the string while playing with the example and ended up with one more null. > > Why? because it is perfectly valid for a plain-ASCII string to include > > null bytes, so Emacs prefers to guess ASCII. > > While NUL is a valid ASCII character according to the standard, > practically nobody uses it as a character. So for a heuristic in this > context, it would be a bad decision to treat it just as another > character. That's because you _know_ this is supposed to be human-readable text, made of non-null characters. But Emacs doesn't. > And indeed NUL bytes are treated as a strong indication of binary data, > it seems. I tried to debug this. The C routine detect_coding_utf_16 > tries to distinguish between binary and UTF-16, but it is not called for > the string above. That routine is called OTOH, when I add a non-ASCII > character as in "h\0t\0m\0l\0ü\0", but even than it decides that the > string is not UTF-16 (?). Don't forget that decoding is supposed to be fast, because it's something Emacs does each time it visits a file or accepts input from a subprocess. So it tries not to go through all the possible encodings, but instead bails out as soon as it thinks it has found a good guess. > > Morale: detecting an encoding in Emacs is based on heuristic > > _guesswork_, which is heavily biased to what is deemed to be the most > > frequent use cases. And UTF-16 is quite infrequent, at least on Posix > > hosts. > > > > IOW, detecting encoding in Emacs is not as reliable as you seem to > > expect. If you _know_ the text is in UTF-16, just tell Emacs to use > > that, don't let it guess. > > My use-case is that I am trying to paste types other than UTF8_STRING > from the X11 clipboard, and have them handled as automatically as > possible. While official clipboard types probably have a documented > encoding (and I have code for those), applications like Firefox also put > private formats there. And Firefox seems to like UTF-16, even the > text/html format it puts there is UTF-16. If you have a special application in mind, you could always write some simple enough code in Lisp to see if UTF-16 should be tried, then tell Emacs to try that explicitly. > I have tried to debug the C routines that implement this (s.a.), but the > code is somewhat hairy. I guess I'll have another look to see if I can > understand it better. We could add code to detect_coding_system that looks at some short enough prefix of the text and sees whether there's a null byte there for each non-null byte, and try UTF-16 if so. Assuming that we want to improve the chances of having UTF-16 detected for a small penalty, that is. Thanks. From unknown Fri Jun 13 11:32:54 2025 X-Loop: help-debbugs@gnu.org Subject: bug#31679: 26.1; detect-coding-string does not detect UTF-16 Resent-From: Lars Ingebrigtsen Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 12 Aug 2021 13:52:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 31679 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: Eli Zaretskii Cc: 31679@debbugs.gnu.org, Benjamin Riefenstahl Received: via spool by 31679-submit@debbugs.gnu.org id=B31679.162877630131843 (code B ref 31679); Thu, 12 Aug 2021 13:52:01 +0000 Received: (at 31679) by debbugs.gnu.org; 12 Aug 2021 13:51:41 +0000 Received: from localhost ([127.0.0.1]:37871 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1mEB7U-0008HW-Te for submit@debbugs.gnu.org; Thu, 12 Aug 2021 09:51:41 -0400 Received: from quimby.gnus.org ([95.216.78.240]:41120) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1mEB7T-0008HI-GX for 31679@debbugs.gnu.org; Thu, 12 Aug 2021 09:51:40 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnus.org; s=20200322; h=Content-Type:MIME-Version:Message-ID:In-Reply-To:Date: References:Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=tnYNJT83X4Fl4fw1MI3HqvdTgAswmxKmPcA/6J3VeLQ=; b=kvRb/+QDTnlBsLn3I+8GT+Sczf 0+hUYO3FayNqd5s0TkgYD4gnsDuIjZsZPB1qYI3sExMNQqQwtf850Rxz2Q79nExfw7XoS9agxhMcO U1JRbS56ltLFALaIPw4vT0T4fnvc96GSp0n3urY8i6lyyeivkQm4hd2cHCKm0dMjfcW4=; Received: from [84.212.220.105] (helo=elva) by quimby.gnus.org with esmtpsa (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1mEB7I-0006TN-MF; Thu, 12 Aug 2021 15:51:32 +0200 From: Lars Ingebrigtsen References: <87efhq47nz.fsf@justinian.i-did-not-set--mail-host-address--so-tickle-me> <83zi0deish.fsf@gnu.org> <874lilgumy.fsf@blei.turtle-trading.net> <836031e06k.fsf@gnu.org> Date: Thu, 12 Aug 2021 15:51:28 +0200 In-Reply-To: <836031e06k.fsf@gnu.org> (Eli Zaretskii's message of "Sat, 02 Jun 2018 17:24:19 +0300") Message-ID: <87tujukbz3.fsf@gnus.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Report: Spam detection software, running on the system "quimby.gnus.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see @@CONTACT_ADDRESS@@ for details. Content preview: Eli Zaretskii writes: >> My use-case is that I am trying to paste types other than UTF8_STRING >> from the X11 clipboard, and have them handled as automatically as >> possible. While official clipboard types probably have [...] Content analysis details: (-2.9 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Eli Zaretskii writes: >> My use-case is that I am trying to paste types other than UTF8_STRING >> from the X11 clipboard, and have them handled as automatically as >> possible. While official clipboard types probably have a documented >> encoding (and I have code for those), applications like Firefox also put >> private formats there. And Firefox seems to like UTF-16, even the >> text/html format it puts there is UTF-16. > > If you have a special application in mind, you could always write some > simple enough code in Lisp to see if UTF-16 should be tried, then tell > Emacs to try that explicitly. I ran into the same issue when dealing with X selections -- but there's even more peculiarities in that area (some selections add a spurious nul to the end, and some done), so you have to write a bit of code around this: `decode-coding-string' in itself can't be expected to deal/guess all these oddities (as you say). >> I have tried to debug the C routines that implement this (s.a.), but the >> code is somewhat hairy. I guess I'll have another look to see if I can >> understand it better. > > We could add code to detect_coding_system that looks at some short > enough prefix of the text and sees whether there's a null byte there > for each non-null byte, and try UTF-16 if so. Assuming that we want > to improve the chances of having UTF-16 detected for a small penalty, > that is. I do think that, in general, it would be nice if detect_coding_system did try a bit harder to guess at utf-16. For instance, if (in the first X bytes of the string) more than 90% of the byte pairs look like non-nul/nul pairs, then it's pretty likely to be utf-16. (And I think that would be easy enough to implement?) On the other hand, as you point out, there's a performance penalty that may not be worth it. So... uhm... does anybody have an opinion here? Try harder for utf-16 or just leave it as it is? -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 12 09:52:02 2021 Received: (at control) by debbugs.gnu.org; 12 Aug 2021 13:52:02 +0000 Received: from localhost ([127.0.0.1]:37875 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1mEB7q-0008Ig-86 for submit@debbugs.gnu.org; Thu, 12 Aug 2021 09:52:02 -0400 Received: from quimby.gnus.org ([95.216.78.240]:41136) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1mEB7n-0008I1-Oz for control@debbugs.gnu.org; Thu, 12 Aug 2021 09:52:00 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnus.org; s=20200322; h=Subject:From:To:Message-Id:Date:Sender:Reply-To:Cc: MIME-Version:Content-Type:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=jelBM8Cuc2hK2JKlFy7UtPwWzhOIXoZ259cOHLALIUk=; b=FE9UwoMGfYHCYmIHQygYlkxG33 zVlda4qUMh/GoGzfgWpAGRLgnM9QGFpnm9/mXKmxwTGmmtiMDZD8zkXrqo0HG3fDVfi1PXmLFBulE EKGWRgm/hN+ALSucN71AHLtSlCBoprz/VRSUMM75Z5O84dS14iXMSVlem/uhRIusxBOQ=; Received: from [84.212.220.105] (helo=elva) by quimby.gnus.org with esmtpsa (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1mEB7f-0006Tf-TD for control@debbugs.gnu.org; Thu, 12 Aug 2021 15:51:53 +0200 Date: Thu, 12 Aug 2021 15:51:51 +0200 Message-Id: <87r1eykbyg.fsf@gnus.org> To: control@debbugs.gnu.org From: Lars Ingebrigtsen Subject: control message for bug #31679 X-Spam-Report: Spam detection software, running on the system "quimby.gnus.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see @@CONTACT_ADDRESS@@ for details. Content preview: tags 31679 + moreinfo quit Content analysis details: (-2.9 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) tags 31679 + moreinfo quit From unknown Fri Jun 13 11:32:54 2025 X-Loop: help-debbugs@gnu.org Subject: bug#31679: 26.1; detect-coding-string does not detect UTF-16 Resent-From: Lars Ingebrigtsen Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 09 Sep 2021 15:24:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 31679 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: moreinfo To: Eli Zaretskii Cc: 31679@debbugs.gnu.org, Benjamin Riefenstahl Received: via spool by 31679-submit@debbugs.gnu.org id=B31679.16312010034897 (code B ref 31679); Thu, 09 Sep 2021 15:24:02 +0000 Received: (at 31679) by debbugs.gnu.org; 9 Sep 2021 15:23:23 +0000 Received: from localhost ([127.0.0.1]:36033 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1mOLtb-0001Gu-8J for submit@debbugs.gnu.org; Thu, 09 Sep 2021 11:23:23 -0400 Received: from quimby.gnus.org ([95.216.78.240]:44414) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1mOLtR-0001GT-JN for 31679@debbugs.gnu.org; Thu, 09 Sep 2021 11:23:21 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnus.org; s=20200322; h=Content-Type:MIME-Version:Message-ID:In-Reply-To:Date: References:Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=1psNk++lwgtCUBsnEUyKf9jmI2+NzyeSSKPSKKI9cic=; b=UlqeoYTPJApLPdZbzSuFf+DQtj gIkiFTFJa56K4xMLcc0xYEqoy/yEQdkUqiXeMUR6w306Vy+uh3lK4xLsb7gplYgiJwli47yilk8Dz Yb3Yl9TpTER+Xs2V/AbLMQiAgfQfX1LCgwD1kVxRGApAvBWjdTq3oiF7RjSqUgX5AZ2o=; Received: from [84.212.220.105] (helo=elva) by quimby.gnus.org with esmtpsa (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1mOLtI-0000Az-UR; Thu, 09 Sep 2021 17:23:07 +0200 From: Lars Ingebrigtsen References: <87efhq47nz.fsf@justinian.i-did-not-set--mail-host-address--so-tickle-me> <83zi0deish.fsf@gnu.org> <874lilgumy.fsf@blei.turtle-trading.net> <836031e06k.fsf@gnu.org> <87tujukbz3.fsf@gnus.org> Date: Thu, 09 Sep 2021 17:23:04 +0200 In-Reply-To: <87tujukbz3.fsf@gnus.org> (Lars Ingebrigtsen's message of "Thu, 12 Aug 2021 15:51:28 +0200") Message-ID: <87sfydsrhj.fsf@gnus.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Report: Spam detection software, running on the system "quimby.gnus.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see @@CONTACT_ADDRESS@@ for details. Content preview: Lars Ingebrigtsen writes: > On the other hand, as you point out, there's a performance penalty that > may not be worth it. > > So... uhm... does anybody have an opinion here? Try harder for utf-16 > or just leave it as it is? Content analysis details: (-2.9 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Lars Ingebrigtsen writes: > On the other hand, as you point out, there's a performance penalty that > may not be worth it. > > So... uhm... does anybody have an opinion here? Try harder for utf-16 > or just leave it as it is? Nobody had an opinion in a month, so I'm closing this bug report. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no From debbugs-submit-bounces@debbugs.gnu.org Thu Sep 09 11:23:25 2021 Received: (at control) by debbugs.gnu.org; 9 Sep 2021 15:23:25 +0000 Received: from localhost ([127.0.0.1]:36035 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1mOLtd-0001H9-HY for submit@debbugs.gnu.org; Thu, 09 Sep 2021 11:23:25 -0400 Received: from quimby.gnus.org ([95.216.78.240]:44428) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1mOLtU-0001GX-Rn for control@debbugs.gnu.org; Thu, 09 Sep 2021 11:23:24 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnus.org; s=20200322; h=Subject:From:To:Message-Id:Date:Sender:Reply-To:Cc: MIME-Version:Content-Type:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=Dh2YD/tFINPbnWxTfzlQaJ6a0KjEq8FEbxV+6QYXXy0=; b=uZIW/IO4TH1Ks3J4lId/3Fr9LZ gCK2FJKbD4qlOdLOD3SkF84QQUYu2jaFa1YI3ggEMnOhdtucWi1jcXeFhMo5BchiWYnEMWh9ZwQvs wcbjqqdHxCgU+6W+CnkzAWJCZk3H7kGWwFjSGS2s1hcQQzbOY7v+Bl0NqsqVEdxVlllM=; Received: from [84.212.220.105] (helo=elva) by quimby.gnus.org with esmtpsa (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1mOLtN-0000BC-Ds for control@debbugs.gnu.org; Thu, 09 Sep 2021 17:23:11 +0200 Date: Thu, 09 Sep 2021 17:23:09 +0200 Message-Id: <87r1dxsrhe.fsf@gnus.org> To: control@debbugs.gnu.org From: Lars Ingebrigtsen Subject: control message for bug #31679 X-Spam-Report: Spam detection software, running on the system "quimby.gnus.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see @@CONTACT_ADDRESS@@ for details. Content preview: close 31679 quit Content analysis details: (-2.9 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) close 31679 quit