GNU bug report logs - #35766
emacs saves utf-16 le xml files as utf-16 be

Previous Next

Package: emacs;

Reported by: J S <jszabo_98 <at> hotmail.com>

Date: Thu, 16 May 2019 17:58:01 UTC

Severity: normal

Merged with 8282, 8283

Fixed in version 27.1

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


Message #26 received at 35766 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Noam Postavsky <npostavs <at> gmail.com>
Cc: jszabo_98 <at> hotmail.com, 35766 <at> debbugs.gnu.org
Subject: Re: bug#35766: emacs saves utf-16 le xml files as utf-16 be
Date: Fri, 17 May 2019 18:34:48 +0300
> From: Noam Postavsky <npostavs <at> gmail.com>
> Cc: Eli Zaretskii <eliz <at> gnu.org>,  "35766\@debbugs.gnu.org" <35766 <at> debbugs.gnu.org>
> Date: Fri, 17 May 2019 07:48:30 -0400
> 
>     UTF-16LE    1014    [RFC2781]   [RFC2781]   csUTF16LE

Ouch, I was looking at the wrong column in that document.

The problem is that our detection of encoding of XML files is based on
the assumption that the header is in ASCII-compatible encoding, which
UTF-16 isn't.  So regexp search for the XML header fails, and the
detection fails with it.

The patch below make us at least recognize UTF-16 with BOM, and also
stop the encoding from frightening the user when she specifies UTF-16
with BOM at buffer-save time.  But by default, saving a buffer with
UTF-16BE or UTF-16LE still produces a file without BOM, and that
cannot be detected by our encoding-detection machinery, leaving it to
the user to use "C-x RET c" or "C-x RET r".

Perhaps we should by default produce encoding with BOM when XML header
specifies UTF-16?

diff --git a/lisp/international/mule-cmds.el b/lisp/international/mule-cmds.el
index dfa9e4e..a248ef8 100644
--- a/lisp/international/mule-cmds.el
+++ b/lisp/international/mule-cmds.el
@@ -1029,7 +1029,11 @@ select-safe-coding-system
 		 ;; This check perhaps isn't ideal, but is probably
 		 ;; the best thing to do.
 		 (not (auto-coding-alist-lookup (or file buffer-file-name "")))
-		 (not (coding-system-equal coding-system auto-cs)))
+		 (not (coding-system-equal coding-system auto-cs))
+                 (or (equal (coding-system-type auto-cs) 'charset)
+                     (not (coding-system-equal (coding-system-type auto-cs)
+                                               (coding-system-type
+                                                coding-system)))))
 	    (unless (yes-or-no-p
 		     (format "Selected encoding %s disagrees with \
 %s specified by file contents.  Really save (else edit coding cookies \
diff --git a/lisp/international/mule.el b/lisp/international/mule.el
index b5414de..fcdcd3c 100644
--- a/lisp/international/mule.el
+++ b/lisp/international/mule.el
@@ -2587,9 +2587,14 @@ xml-find-file-coding-system
       (let ((detected
              (with-coding-priority '(utf-8)
                (coding-system-base
-                (detect-coding-region (point-min) (point-max) t)))))
-        ;; Pure ASCII always comes back as undecided.
+                (detect-coding-region (point-min) (point-max) t))))
+            (bom (list (char-after 1) (char-after 2))))
         (cond
+         ((equal bom '(#xFE #xFF))
+          'utf-16be-with-signature)
+         ((equal bom '(#xFF #xFE))
+          'utf-16le-with-signature)
+         ;; Pure ASCII always comes back as undecided.
          ((memq detected '(utf-8 undecided))
           'utf-8)
          ((eq detected 'utf-16le-with-signature) 'utf-16le-with-signature)




This bug report was last modified 6 years and 61 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.