GNU bug report logs - #30789
26.0.91; xml-parse-region works but libxml-parse-html-region doesn't

Previous Next

Package: emacs;

Reported by: Katsumi Yamaoka <yamaoka <at> jpl.org>

Date: Mon, 12 Mar 2018 23:40:02 UTC

Severity: wishlist

Tags: wontfix

Found in version 26.0.91

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Katsumi Yamaoka <yamaoka <at> jpl.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>, 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
Cc: 30789 <at> debbugs.gnu.org
Subject: bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't
Date: Tue, 13 Mar 2018 11:28:45 +0900
[Message part 1 (text/plain, inline)]
On Tue, 13 Mar 2018 01:44:22 +0100, Lars Ingebrigtsen wrote:
> libxml is more strict about correctness of the input than most other
> HTML parsers.  I don't think there's anything we can do about this
> problematic input other than ponder whether Emacs should use a different
> HTML parser, which I think sounds of unlikely.  :-)

I see.  I agree not to modify libxml.  Jidanni, how about trying
the following patch personally if you often get such broken mails?
Though I'm not quite sure if it does not cause another problem,
it fixes at least the mail in question.

[Message part 2 (text/x-patch, inline)]
--- mm-decode.el~	2018-02-28 02:01:37.897607000 +0000
+++ mm-decode.el	2018-03-13 02:23:04.321753900 +0000
@@ -1810,6 +1810,11 @@
       (when (and (or coding
 		     (setq coding (mm-charset-to-coding-system charset nil t)))
 		 (not (eq coding 'ascii)))
+	;; Remove extra bytes in utf-8 encoded data.
+	(when (eq coding 'utf-8)
+	  (goto-char (point-min))
+	  (while (re-search-forward "[\x00-\x7f]+\\([\x80-\xbf]\\)" nil t)
+	    (replace-match "\\1")))
 	(insert (prog1
 		    (decode-coding-string (buffer-string) coding)
 		  (erase-buffer)

This bug report was last modified 7 years and 39 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.