GNU bug report logs - #20623
XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save

Previous Next

Package: emacs;

Reported by: Simon Ledergerber <sledergerber <at> gmx.net>

Date: Thu, 21 May 2015 18:53:02 UTC

Severity: normal

Found in version 26.1

Fixed in version 26.2

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


Message #52 received at 20623 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: rgm <at> gnu.org, a.s <at> realize.ch, 20623 <at> debbugs.gnu.org, sledergerber <at> gmx.net
Subject: Re: bug#20623: XML and HTML files with
 encoding/charset="utf-8"	declaration loose BOM;
 Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Sun, 10 Dec 2017 21:17:00 +0200
> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: rgm <at> gnu.org,  a.s <at> realize.ch,  sledergerber <at> gmx.net,  20623 <at> debbugs.gnu.org
> Date: Mon, 04 Dec 2017 16:08:14 -0500
> 
> > Isn't it better to fix this in sgml-xml-auto-coding-function?  That's
> > where the root cause is, AFAIU.
> 
> I'd expect the same problem would affect all other uses.

Not sure what you meant by "all other uses".  Could you please
elaborate?

> > And I don't understand the comment about latin-1-mac: I don't think we
> > have such problems in Emacs.  The -with-signature variety is
> > different, because it is not about EOL format.
> 
> You might be right, but I don't know where/how this is handled.

I would like to propose the following alternative patch, which accepts
utf-8-with-signature and utf-8-hfs as variants of utf-8 for the
purposes of encoding of XML files.  Comments?  Do we want a similar
treatment for UTF-16?  (That doesn't seem to be required by the bug
report, and UTF-16 in XML files is non-standard anyway.  But what
about HTML?)

diff --git a/lisp/international/mule.el b/lisp/international/mule.el
index 857fa80..5ff1acf 100644
--- a/lisp/international/mule.el
+++ b/lisp/international/mule.el
@@ -2493,7 +2493,17 @@ sgml-xml-auto-coding-function
 	    (let* ((match (match-string 1))
 		   (sym (intern (downcase match))))
 	      (if (coding-system-p sym)
-		  sym
+                  ;; If the encoding tag is UTF-8 and the buffer's
+                  ;; encoding is one of the variants of UTF-8, use the
+                  ;; buffer's encoding.  This allows, e.g., saving an
+                  ;; XML file as UTF-8 with BOM when the tag says UTF-8.
+                  (if (and (coding-system-equal 'utf-8
+                                                (coding-system-type sym))
+                           (coding-system-equal sym
+                                                (coding-system-type
+                                                 buffer-file-coding-system)))
+                      buffer-file-coding-system
+		    sym)
 		(message "Warning: unknown coding system \"%s\"" match)
 		nil))
           ;; Files without an encoding tag should be UTF-8. But users
@@ -2506,7 +2516,8 @@ sgml-xml-auto-coding-function
                    (coding-system-base
                     (detect-coding-region (point-min) size t)))))
             ;; Pure ASCII always comes back as undecided.
-            (if (memq detected '(utf-8 undecided))
+            (if (memq detected
+                      '(utf-8 'utf-8-with-signature 'utf-8-hfs undecided))
                 'utf-8
               (warn "File contents detected as %s.
   Consider adding an encoding attribute to the xml declaration,




This bug report was last modified 6 years and 279 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.