From unknown Sat Aug 16 16:07:01 2025 X-Loop: help-debbugs@gnu.org Subject: bug#61005: 28.1.91; Encoding not detected in HTML files inside archives Resent-From: Benjamin Riefenstahl Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 22 Jan 2023 13:15:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 61005 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: 61005@debbugs.gnu.org X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.167439324329525 (code B ref -1); Sun, 22 Jan 2023 13:15:01 +0000 Received: (at submit) by debbugs.gnu.org; 22 Jan 2023 13:14:03 +0000 Received: from localhost ([127.0.0.1]:50906 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pJaAd-0007g9-3r for submit@debbugs.gnu.org; Sun, 22 Jan 2023 08:14:03 -0500 Received: from lists.gnu.org ([209.51.188.17]:36538) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pJaAY-0007fb-PU for submit@debbugs.gnu.org; Sun, 22 Jan 2023 08:14:01 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pJaAY-000134-JO for bug-gnu-emacs@gnu.org; Sun, 22 Jan 2023 08:13:58 -0500 Received: from odoacer.turtle-trading.net ([93.241.193.16]) by eggs.gnu.org with esmtps (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.90_1) (envelope-from ) id 1pJaAV-0001lJ-Om for bug-gnu-emacs@gnu.org; Sun, 22 Jan 2023 08:13:58 -0500 Received: from zenobia.turtle-trading.net ([192.168.2.111]) by odoacer.turtle-trading.net with esmtp (Exim 4.80) (envelope-from ) id 1pJaAQ-00077S-M1; Sun, 22 Jan 2023 14:13:50 +0100 Received: from benny by zenobia.turtle-trading.net with local (Exim 4.94.2) (envelope-from ) id 1pJaAQ-0009AD-Dq; Sun, 22 Jan 2023 14:13:50 +0100 From: Benjamin Riefenstahl Date: Sun, 22 Jan 2023 14:13:50 +0100 Message-ID: <87bkmqempd.fsf@turtle-trading.net> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Received-SPF: none client-ip=93.241.193.16; envelope-from=benny@turtle-trading.net; helo=odoacer.turtle-trading.net X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_HTML_ATTACH=0.01, T_OBFU_HTML_ATT_MALW=0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) --=-=-= Content-Type: text/plain Content-Disposition: inline Problem ---- * Given an HTML file with charset "windows-1255". * Opening the file from disk detects the encoding correctly. * Opening a ZIP archive with the same file inside and than opening the HTML archive member does not detect the encoding, instead the coding system for saving is the default according to M-x describe-coding-system. Attached are two files test.html and test.zip. Call "emacs -Q test.html test.zip" and press RET on the archive member to reproduce. --=-=-= Content-Type: text/html; charset=windows-1255 Content-Disposition: attachment; filename=test.html Content-Transfer-Encoding: quoted-printable =F9=C8=D1=EC=E5=C9=ED

=F9=C8=D1=EC=E5=C9=ED

--=-=-= Content-Type: application/zip Content-Disposition: attachment; filename=test.zip Content-Transfer-Encoding: base64 UEsDBBQAAAAIAPGdMVauwGXsbwAAAKIAAAAJABwAdGVzdC5odG1sVVQJAAM138Zj9d7GY3V4CwAB BOgDAAAE6AMAALNRdPF3DokMcFXIKMnNseOygVAKCjYZqYkpIAaQmZtakqiQnJFYVJxaYqtUnpmX kl9erGtoZGqqZGejD5KFKizJLMlJtVP4eeLim6cn3yrY6EMEQMbpw8yzScpPqYSqzzBEVgzkgVVC FAD5YKcAAFBLAQIeAxQAAAAIAPGdMVauwGXsbwAAAKIAAAAJABgAAAAAAAEAAACkgQAAAAB0ZXN0 Lmh0bWxVVAUAAzXfxmN1eAsAAQToAwAABOgDAABQSwUGAAAAAAEAAQBPAAAAsgAAAAAA --=-=-= Content-Type: text/plain Content-Disposition: inline Solution ---- The problem seems to be the function sgml-html-meta-auto-coding-function. It is missing a condition similar to the one added to code in sgml-xml-auto-coding-function with commit #df7ed10e in 2018. modified lisp/international/mule.el @@ -2539,6 +2539,10 @@ sgml-html-meta-auto-coding-function (bfcs-type (coding-system-type buffer-file-coding-system))) (if (and enable-multibyte-characters + ;; 'charset' will signal an error in + ;; coding-system-equal, since it isn't a + ;; coding-system. So test that up front. + (not (equal sym-type 'charset)) (coding-system-equal 'utf-8 sym-type) (coding-system-equal 'utf-8 bfcs-type)) buffer-file-coding-system I will send this as a patch as soon as I have a bug number to mention in the commit message. ---- In GNU Emacs 28.1.91 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.24, cairo version 1.16.0) of 2022-08-29 built on arrian Repository revision: f4168b8143008b787a11366462c928d761e90dd0 Repository branch: emacs-28 Windowing system distributor 'The X.Org Foundation', version 11.0.12011000 System Description: Debian GNU/Linux 11 (bullseye) Configured features: ACL CAIRO DBUS FREETYPE GIF GLIB GMP GNUTLS GPM GSETTINGS HARFBUZZ JPEG JSON LCMS2 LIBOTF LIBSELINUX LIBXML2 M17N_FLT MODULES NOTIFY INOTIFY PDUMPER PNG RSVG SECCOMP SOUND THREADS TIFF TOOLKIT_SCROLL_BARS X11 XDBE XIM XPM GTK3 ZLIB Important settings: value of $LANG: en_US.UTF-8 locale-coding-system: utf-8-unix Major mode: Dired by date Minor modes in effect: shell-dirtrack-mode: t desktop-save-mode: t display-time-mode: t xclip-mode: t xterm-mouse-mode: t delete-selection-mode: t cua-mode: t display-battery-mode: t tooltip-mode: t global-eldoc-mode: t show-paren-mode: t electric-indent-mode: t mouse-wheel-mode: t menu-bar-mode: t file-name-shadow-mode: t global-font-lock-mode: t font-lock-mode: t blink-cursor-mode: t auto-composition-mode: t auto-encryption-mode: t auto-compression-mode: t buffer-read-only: t column-number-mode: t line-number-mode: t transient-mark-mode: t Load-path shadows: ~/Projects/ttf-mode/arc-mode-compat hides ~/emacs/arc-mode-compat /home/benny/.emacs.d/elpa/transient-20210723.1601/transient hides /usr/local/share/emacs/28.1.91/lisp/transient /home/benny/.emacs.d/elpa/dictionary-20201001.1727/dictionary hides /usr/local/share/emacs/28.1.91/lisp/net/dictionary Features: (shadow sort mail-extr emacsbug message rmc puny rfc822 mml mml-sec epa epg rfc6068 epg-config gnus-util rmail rmail-loaddefs mm-decode mm-bodies mm-encode mailabbrev gmm-utils mailheader arc-mode archive-mode benny-images dirtrack shell pcomplete misearch multi-isearch thai-util thai-word lao-util enriched view tabify benny-auto-insert ttf-glyphs rng-xsd xsd-regexp rng-cmpct rng-nxml rng-valid rng-loc rng-uri rng-parse nxml-parse rng-match rng-dt rng-util rng-pttrn nxml-ns nxml-mode nxml-outln nxml-rap sgml-mode facemenu dom nxml-util nxml-enc xmltok mule-util jka-compr dired-aux time-date bug-reference imenu desktop frameset highline benny-calendar-cfg ange-ftp generic-x autoinsert cc-mode cc-fonts cc-guess cc-menus cc-styles cc-align cc-cmds cc-engine cc-vars cc-defs ps-print ps-print-loaddefs ps-def lpr advice cl-extra help-mode dired dired-loaddefs derived benny-x-clipboard disp-table time server protbuf xclip term/xterm xterm xt-mouse cal-china lunar solar cal-dst cal-bahai cal-islam cal-hebrew holidays hol-loaddefs vc-git diff-mode easy-mmode vc-dispatcher vc-fossil diary-lib diary-loaddefs cal-menu calendar cal-loaddefs delsel grep compile text-property-search comint ansi-color ring cua-base cus-load format-spec battery dbus xml sendmail mail-utils .loaddefs benny-tools autoload radix-tree lisp-mnt mail-parse rfc2231 rfc2047 rfc2045 mm-util ietf-drums mail-prsvr edmacro kmacro info package browse-url url url-proxy url-privacy url-expand url-methods url-history url-cookie url-domsuf url-util mailcap url-handlers url-parse auth-source cl-seq eieio eieio-core cl-macs eieio-loaddefs password-cache json subr-x map url-vars seq byte-opt gv bytecomp byte-compile cconv cl-loaddefs cl-lib iso-transl tooltip eldoc paren electric uniquify ediff-hook vc-hooks lisp-float-type elisp-mode mwheel term/x-win x-win term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe tabulated-list replace newcomment text-mode lisp-mode prog-mode register page tab-bar menu-bar rfn-eshadow isearch easymenu timer select scroll-bar mouse jit-lock font-lock syntax font-core term/tty-colors frame minibuffer cl-generic cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech european ethiopic indian cyrillic chinese composite emoji-zwj charscript charprop case-table epa-hook jka-cmpr-hook help simple abbrev obarray cl-preloaded nadvice button loaddefs faces cus-face macroexp files window text-properties overlay sha1 md5 base64 format env code-pages mule custom widget hashtable-print-readable backquote threads dbusbind inotify lcms2 dynamic-setting system-font-setting font-render-setting cairo move-toolbar gtk x-toolkit x multi-tty make-network-process emacs) Memory information: ((conses 16 273770 13520) (symbols 48 18619 1) (strings 32 66582 2920) (string-bytes 1 2318045) (vectors 16 39996) (vector-slots 8 1131973 174560) (floats 8 762 66) (intervals 56 1039 60) (buffers 992 50)) --=-=-=-- From unknown Sat Aug 16 16:07:01 2025 X-Loop: help-debbugs@gnu.org Subject: bug#61005: 28.1.91; Encoding not detected in HTML files inside archives Resent-From: Benjamin Riefenstahl Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 22 Jan 2023 13:25:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 61005 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: 61005@debbugs.gnu.org Received: via spool by 61005-submit@debbugs.gnu.org id=B61005.167439385730508 (code B ref 61005); Sun, 22 Jan 2023 13:25:02 +0000 Received: (at 61005) by debbugs.gnu.org; 22 Jan 2023 13:24:17 +0000 Received: from localhost ([127.0.0.1]:50920 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pJaKX-0007vz-8I for submit@debbugs.gnu.org; Sun, 22 Jan 2023 08:24:17 -0500 Received: from odoacer.turtle-trading.net ([93.241.193.16]:49764) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pJaKU-0007vk-JU for 61005@debbugs.gnu.org; Sun, 22 Jan 2023 08:24:15 -0500 Received: from zenobia.turtle-trading.net ([192.168.2.111]) by odoacer.turtle-trading.net with esmtp (Exim 4.80) (envelope-from ) id 1pJaKO-00077x-2X; Sun, 22 Jan 2023 14:24:08 +0100 Received: from benny by zenobia.turtle-trading.net with local (Exim 4.94.2) (envelope-from ) id 1pJaKN-0009wO-QW; Sun, 22 Jan 2023 14:24:07 +0100 From: Benjamin Riefenstahl References: <87bkmqempd.fsf@turtle-trading.net> Date: Sun, 22 Jan 2023 14:24:07 +0100 In-Reply-To: <87bkmqempd.fsf@turtle-trading.net> (Benjamin Riefenstahl's message of "Sun, 22 Jan 2023 14:13:50 +0100") Message-ID: <877cxeem88.fsf@turtle-trading.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2.50 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --=-=-= Content-Type: text/plain The promised patch. This is against master. Also a small test-suite for sgml-html-meta-auto-coding-function, if you want that. If you care, I could also add one for sgml-xml-auto-coding-function. --=-=-= Content-Type: text/x-diff Content-Disposition: attachment; filename=0001-Fix-decoding-HTML-files-from-archives.patch >From 95b63baf1bf411422c61b76470abb1aa681f2db2 Mon Sep 17 00:00:00 2001 From: Benjamin Riefenstahl Date: Tue, 17 Jan 2023 20:08:15 +0200 Subject: [PATCH 1/2] Fix decoding HTML files from archives * lisp/international/mule.el (sgml-xml-auto-coding-function): Avoid signaling an error from coding-system-equal when the XML encoding tag specifies an encoding whose type is 'charset'. (Bug#61005) This is the same fix as in #df7ed10e for sgml-xml-auto-coding-function. --- lisp/international/mule.el | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/lisp/international/mule.el b/lisp/international/mule.el index 4f6addea387..9480213be9a 100644 --- a/lisp/international/mule.el +++ b/lisp/international/mule.el @@ -2539,6 +2539,10 @@ sgml-html-meta-auto-coding-function (bfcs-type (coding-system-type buffer-file-coding-system))) (if (and enable-multibyte-characters + ;; 'charset' will signal an error in + ;; coding-system-equal, since it isn't a + ;; coding-system. So test that up front. + (not (equal sym-type 'charset)) (coding-system-equal 'utf-8 sym-type) (coding-system-equal 'utf-8 bfcs-type)) buffer-file-coding-system -- 2.30.2 --=-=-= Content-Type: text/x-diff Content-Disposition: attachment; filename=0002-Add-test-suite-for-sgml-html-meta-auto-coding-functi.patch >From 29996e07c23c9716f731dde224c8ca47e321e697 Mon Sep 17 00:00:00 2001 From: Benjamin Riefenstahl Date: Tue, 17 Jan 2023 20:13:39 +0200 Subject: [PATCH 2/2] Add test suite for sgml-html-meta-auto-coding-function * test/lisp/international/mule-tests.el (sgml-html-meta-pre) (sgml-html-meta-post, sgml-html-meta-run, sgml-html-meta-utf-8) (sgml-html-meta-windows-hebrew, sgml-html-meta-none) (sgml-html-meta-unknown-coding, sgml-html-meta-no-pre) (sgml-html-meta-no-post-less-than-10lines) (sgml-html-meta-no-post-10lines, sgml-html-meta-utf-8-with-bom): Add. --- test/lisp/international/mule-tests.el | 66 +++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) diff --git a/test/lisp/international/mule-tests.el b/test/lisp/international/mule-tests.el index 4f70b275848..6e23d8c5421 100644 --- a/test/lisp/international/mule-tests.el +++ b/test/lisp/international/mule-tests.el @@ -70,6 +70,72 @@ mule-hz ;; The chinese-hz encoding is not ASCII compatible. (should-not (coding-system-get 'chinese-hz :ascii-compatible-p))) +;;; Testing `sgml-html-meta-auto-coding-function'. + +(defconst sgml-html-meta-pre "" + "The beginning of a minimal HTML document.") + +(defconst sgml-html-meta-post "" + "The end of a minimal HTML document.") + +(defun sgml-html-meta-run (coding-system) + "Run `sgml-html-meta-auto-coding-function' on a minimal HTML. +When CODING-SYSTEM is not nil, insert it, wrapped in a '' +element. When CODING-SYSTEM contains HTML meta characters or +white space, insert it as-is, without additional formatting. Use +the variables `sgml-html-meta-pre' and `sgml-html-meta-post' to +provide HTML fragments. Some tests override those variables." + (with-temp-buffer + (insert sgml-html-meta-pre + (cond ((not coding-system) + "") + ((string-match "[<>'\"\n ]" coding-system) + coding-system) + (t + (format "" coding-system))) + sgml-html-meta-post) + (goto-char (point-min)) + (sgml-html-meta-auto-coding-function (- (point-max) (point-min))))) + +(ert-deftest sgml-html-meta-utf-8 () + "Baseline: UTF-8." + (should (eq 'utf-8 (sgml-html-meta-run "utf-8")))) + +(ert-deftest sgml-html-meta-windows-hebrew () + "A non-Unicode charset." + (should (eq 'windows-1255 (sgml-html-meta-run "windows-1255")))) + +(ert-deftest sgml-html-meta-none () + (should (eq nil (sgml-html-meta-run nil)))) + +(ert-deftest sgml-html-meta-unknown-coding () + (should (eq nil (sgml-html-meta-run "XXX")))) + +(ert-deftest sgml-html-meta-no-pre () + "Without the prefix, so not HTML." + (let ((sgml-html-meta-pre "")) + (should (eq nil (sgml-html-meta-run "utf-8"))))) + +(ert-deftest sgml-html-meta-no-post-less-than-10lines () + "No '', detect charset in the first 10 lines." + (let ((sgml-html-meta-post "")) + (should (eq 'utf-8 (sgml-html-meta-run + (concat "\n\n\n\n\n\n\n\n\n" + "")))))) + +(ert-deftest sgml-html-meta-no-post-10lines () + "No '', do not detect charset after the first 10 lines." + (let ((sgml-html-meta-post "")) + (should (eq nil (sgml-html-meta-run + (concat "\n\n\n\n\n\n\n\n\n\n" + "")))))) + +(ert-deftest sgml-html-meta-utf-8-with-bom () + "Requesting 'UTF-8' does not override `utf-8-with-signature'. +Check fix for Bug#20623." + (let ((buffer-file-coding-system 'utf-8-with-signature)) + (should (eq 'utf-8-with-signature (sgml-html-meta-run "utf-8"))))) + ;; Stop "Local Variables" above causing confusion when visiting this file. -- 2.30.2 --=-=-=-- From unknown Sat Aug 16 16:07:01 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Benjamin Riefenstahl Subject: bug#61005: closed (Re: bug#61005: 28.1.91; Encoding not detected in HTML files inside archives) Message-ID: References: <83leluk6dw.fsf@gnu.org> <87bkmqempd.fsf@turtle-trading.net> X-Gnu-PR-Message: they-closed 61005 X-Gnu-PR-Package: emacs Reply-To: 61005@debbugs.gnu.org Date: Sun, 22 Jan 2023 14:11:02 +0000 Content-Type: multipart/mixed; boundary="----------=_1674396662-2342-1" This is a multi-part message in MIME format... ------------=_1674396662-2342-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #61005: 28.1.91; Encoding not detected in HTML files inside archives which was filed against the emacs package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 61005@debbugs.gnu.org. --=20 61005: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D61005 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1674396662-2342-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 61005-done) by debbugs.gnu.org; 22 Jan 2023 14:10:12 +0000 Received: from localhost ([127.0.0.1]:50959 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pJb2y-0000ak-GW for submit@debbugs.gnu.org; Sun, 22 Jan 2023 09:10:12 -0500 Received: from eggs.gnu.org ([209.51.188.92]:42232) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pJb2u-0000aR-EF for 61005-done@debbugs.gnu.org; Sun, 22 Jan 2023 09:10:10 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pJb2U-0008D8-O4; Sun, 22 Jan 2023 09:09:55 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=References:Subject:In-Reply-To:To:From:Date: mime-version; bh=3O/NexPgajED4XU6HlmZ/Uvz0C9WP77ekPPDufzHueY=; b=J02wPSDnsqSo 9AZnXs5sgYNSxxh4Qh8WkZQJ/lPN41DoOlALId87yrgKsx3pHuXDny0si2fN1WX9QH9s4J8E1cxcs k08EhIq+31vsjj99XWR/f8CVGsK52kMEtDcxzrSI2fVgvXVcF96iomr44/XnWYDSUnqvrrsaAKVOl MQMsoUiEd5QROr33IQ8Fu5irrVg0DTIQenwAqHVVVjtbbN6QrQLcJNxE2NXzeiya+e+5reY62P0gR Vq0OxBwEG/UGSb5b+AFsURXQvnAEdbs/5HwS2/alSbBqi3dIJ7hFL9MSp5oAL833zRG9eGr1tjwRf ew96tseVLBy7WLnG4jaGtA==; Received: from [87.69.77.57] (helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pJb2T-0008NF-Q6; Sun, 22 Jan 2023 09:09:42 -0500 Date: Sun, 22 Jan 2023 16:09:47 +0200 Message-Id: <83leluk6dw.fsf@gnu.org> From: Eli Zaretskii To: Benjamin Riefenstahl In-Reply-To: <877cxeem88.fsf@turtle-trading.net> (message from Benjamin Riefenstahl on Sun, 22 Jan 2023 14:24:07 +0100) Subject: Re: bug#61005: 28.1.91; Encoding not detected in HTML files inside archives References: <87bkmqempd.fsf@turtle-trading.net> <877cxeem88.fsf@turtle-trading.net> X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 61005-done Cc: 61005-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > From: Benjamin Riefenstahl > Date: Sun, 22 Jan 2023 14:24:07 +0100 > > The promised patch. This is against master. > > Also a small test-suite for sgml-html-meta-auto-coding-function, if you > want that. If you care, I could also add one for > sgml-xml-auto-coding-function. Thanks, I installed this on the emacs-29 branch, and I'm closing the bug. ------------=_1674396662-2342-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 22 Jan 2023 13:14:03 +0000 Received: from localhost ([127.0.0.1]:50906 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pJaAd-0007g9-3r for submit@debbugs.gnu.org; Sun, 22 Jan 2023 08:14:03 -0500 Received: from lists.gnu.org ([209.51.188.17]:36538) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pJaAY-0007fb-PU for submit@debbugs.gnu.org; Sun, 22 Jan 2023 08:14:01 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pJaAY-000134-JO for bug-gnu-emacs@gnu.org; Sun, 22 Jan 2023 08:13:58 -0500 Received: from odoacer.turtle-trading.net ([93.241.193.16]) by eggs.gnu.org with esmtps (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.90_1) (envelope-from ) id 1pJaAV-0001lJ-Om for bug-gnu-emacs@gnu.org; Sun, 22 Jan 2023 08:13:58 -0500 Received: from zenobia.turtle-trading.net ([192.168.2.111]) by odoacer.turtle-trading.net with esmtp (Exim 4.80) (envelope-from ) id 1pJaAQ-00077S-M1; Sun, 22 Jan 2023 14:13:50 +0100 Received: from benny by zenobia.turtle-trading.net with local (Exim 4.94.2) (envelope-from ) id 1pJaAQ-0009AD-Dq; Sun, 22 Jan 2023 14:13:50 +0100 From: Benjamin Riefenstahl To: bug-gnu-emacs@gnu.org Subject: 28.1.91; Encoding not detected in HTML files inside archives Date: Sun, 22 Jan 2023 14:13:50 +0100 Message-ID: <87bkmqempd.fsf@turtle-trading.net> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Received-SPF: none client-ip=93.241.193.16; envelope-from=benny@turtle-trading.net; helo=odoacer.turtle-trading.net X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_HTML_ATTACH=0.01, T_OBFU_HTML_ATT_MALW=0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) --=-=-= Content-Type: text/plain Content-Disposition: inline Problem ---- * Given an HTML file with charset "windows-1255". * Opening the file from disk detects the encoding correctly. * Opening a ZIP archive with the same file inside and than opening the HTML archive member does not detect the encoding, instead the coding system for saving is the default according to M-x describe-coding-system. Attached are two files test.html and test.zip. Call "emacs -Q test.html test.zip" and press RET on the archive member to reproduce. --=-=-= Content-Type: text/html; charset=windows-1255 Content-Disposition: attachment; filename=test.html Content-Transfer-Encoding: quoted-printable =F9=C8=D1=EC=E5=C9=ED

=F9=C8=D1=EC=E5=C9=ED

--=-=-= Content-Type: application/zip Content-Disposition: attachment; filename=test.zip Content-Transfer-Encoding: base64 UEsDBBQAAAAIAPGdMVauwGXsbwAAAKIAAAAJABwAdGVzdC5odG1sVVQJAAM138Zj9d7GY3V4CwAB BOgDAAAE6AMAALNRdPF3DokMcFXIKMnNseOygVAKCjYZqYkpIAaQmZtakqiQnJFYVJxaYqtUnpmX kl9erGtoZGqqZGejD5KFKizJLMlJtVP4eeLim6cn3yrY6EMEQMbpw8yzScpPqYSqzzBEVgzkgVVC FAD5YKcAAFBLAQIeAxQAAAAIAPGdMVauwGXsbwAAAKIAAAAJABgAAAAAAAEAAACkgQAAAAB0ZXN0 Lmh0bWxVVAUAAzXfxmN1eAsAAQToAwAABOgDAABQSwUGAAAAAAEAAQBPAAAAsgAAAAAA --=-=-= Content-Type: text/plain Content-Disposition: inline Solution ---- The problem seems to be the function sgml-html-meta-auto-coding-function. It is missing a condition similar to the one added to code in sgml-xml-auto-coding-function with commit #df7ed10e in 2018. modified lisp/international/mule.el @@ -2539,6 +2539,10 @@ sgml-html-meta-auto-coding-function (bfcs-type (coding-system-type buffer-file-coding-system))) (if (and enable-multibyte-characters + ;; 'charset' will signal an error in + ;; coding-system-equal, since it isn't a + ;; coding-system. So test that up front. + (not (equal sym-type 'charset)) (coding-system-equal 'utf-8 sym-type) (coding-system-equal 'utf-8 bfcs-type)) buffer-file-coding-system I will send this as a patch as soon as I have a bug number to mention in the commit message. ---- In GNU Emacs 28.1.91 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.24, cairo version 1.16.0) of 2022-08-29 built on arrian Repository revision: f4168b8143008b787a11366462c928d761e90dd0 Repository branch: emacs-28 Windowing system distributor 'The X.Org Foundation', version 11.0.12011000 System Description: Debian GNU/Linux 11 (bullseye) Configured features: ACL CAIRO DBUS FREETYPE GIF GLIB GMP GNUTLS GPM GSETTINGS HARFBUZZ JPEG JSON LCMS2 LIBOTF LIBSELINUX LIBXML2 M17N_FLT MODULES NOTIFY INOTIFY PDUMPER PNG RSVG SECCOMP SOUND THREADS TIFF TOOLKIT_SCROLL_BARS X11 XDBE XIM XPM GTK3 ZLIB Important settings: value of $LANG: en_US.UTF-8 locale-coding-system: utf-8-unix Major mode: Dired by date Minor modes in effect: shell-dirtrack-mode: t desktop-save-mode: t display-time-mode: t xclip-mode: t xterm-mouse-mode: t delete-selection-mode: t cua-mode: t display-battery-mode: t tooltip-mode: t global-eldoc-mode: t show-paren-mode: t electric-indent-mode: t mouse-wheel-mode: t menu-bar-mode: t file-name-shadow-mode: t global-font-lock-mode: t font-lock-mode: t blink-cursor-mode: t auto-composition-mode: t auto-encryption-mode: t auto-compression-mode: t buffer-read-only: t column-number-mode: t line-number-mode: t transient-mark-mode: t Load-path shadows: ~/Projects/ttf-mode/arc-mode-compat hides ~/emacs/arc-mode-compat /home/benny/.emacs.d/elpa/transient-20210723.1601/transient hides /usr/local/share/emacs/28.1.91/lisp/transient /home/benny/.emacs.d/elpa/dictionary-20201001.1727/dictionary hides /usr/local/share/emacs/28.1.91/lisp/net/dictionary Features: (shadow sort mail-extr emacsbug message rmc puny rfc822 mml mml-sec epa epg rfc6068 epg-config gnus-util rmail rmail-loaddefs mm-decode mm-bodies mm-encode mailabbrev gmm-utils mailheader arc-mode archive-mode benny-images dirtrack shell pcomplete misearch multi-isearch thai-util thai-word lao-util enriched view tabify benny-auto-insert ttf-glyphs rng-xsd xsd-regexp rng-cmpct rng-nxml rng-valid rng-loc rng-uri rng-parse nxml-parse rng-match rng-dt rng-util rng-pttrn nxml-ns nxml-mode nxml-outln nxml-rap sgml-mode facemenu dom nxml-util nxml-enc xmltok mule-util jka-compr dired-aux time-date bug-reference imenu desktop frameset highline benny-calendar-cfg ange-ftp generic-x autoinsert cc-mode cc-fonts cc-guess cc-menus cc-styles cc-align cc-cmds cc-engine cc-vars cc-defs ps-print ps-print-loaddefs ps-def lpr advice cl-extra help-mode dired dired-loaddefs derived benny-x-clipboard disp-table time server protbuf xclip term/xterm xterm xt-mouse cal-china lunar solar cal-dst cal-bahai cal-islam cal-hebrew holidays hol-loaddefs vc-git diff-mode easy-mmode vc-dispatcher vc-fossil diary-lib diary-loaddefs cal-menu calendar cal-loaddefs delsel grep compile text-property-search comint ansi-color ring cua-base cus-load format-spec battery dbus xml sendmail mail-utils .loaddefs benny-tools autoload radix-tree lisp-mnt mail-parse rfc2231 rfc2047 rfc2045 mm-util ietf-drums mail-prsvr edmacro kmacro info package browse-url url url-proxy url-privacy url-expand url-methods url-history url-cookie url-domsuf url-util mailcap url-handlers url-parse auth-source cl-seq eieio eieio-core cl-macs eieio-loaddefs password-cache json subr-x map url-vars seq byte-opt gv bytecomp byte-compile cconv cl-loaddefs cl-lib iso-transl tooltip eldoc paren electric uniquify ediff-hook vc-hooks lisp-float-type elisp-mode mwheel term/x-win x-win term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe tabulated-list replace newcomment text-mode lisp-mode prog-mode register page tab-bar menu-bar rfn-eshadow isearch easymenu timer select scroll-bar mouse jit-lock font-lock syntax font-core term/tty-colors frame minibuffer cl-generic cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech european ethiopic indian cyrillic chinese composite emoji-zwj charscript charprop case-table epa-hook jka-cmpr-hook help simple abbrev obarray cl-preloaded nadvice button loaddefs faces cus-face macroexp files window text-properties overlay sha1 md5 base64 format env code-pages mule custom widget hashtable-print-readable backquote threads dbusbind inotify lcms2 dynamic-setting system-font-setting font-render-setting cairo move-toolbar gtk x-toolkit x multi-tty make-network-process emacs) Memory information: ((conses 16 273770 13520) (symbols 48 18619 1) (strings 32 66582 2920) (string-bytes 1 2318045) (vectors 16 39996) (vector-slots 8 1131973 174560) (floats 8 762 66) (intervals 56 1039 60) (buffers 992 50)) --=-=-=-- ------------=_1674396662-2342-1--