GNU bug report logs - #30789
26.0.91; xml-parse-region works but libxml-parse-html-region doesn't

Previous Next

Package: emacs;

Reported by: Katsumi Yamaoka <yamaoka <at> jpl.org>

Date: Mon, 12 Mar 2018 23:40:02 UTC

Severity: wishlist

Tags: wontfix

Found in version 26.0.91

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 30789 in the body.
You can then email your comments to 30789 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#30789; Package emacs. (Mon, 12 Mar 2018 23:40:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Katsumi Yamaoka <yamaoka <at> jpl.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Mon, 12 Mar 2018 23:40:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Katsumi Yamaoka <yamaoka <at> jpl.org>
To: bug-gnu-emacs <at> gnu.org
Cc: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
Subject: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't
Date: Tue, 13 Mar 2018 08:38:09 +0900
[Message part 1 (text/plain, inline)]
Hi,

Jidanni mailed me an example html mail that contains a broken
encoded text as follows:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    .......公告辦理現金救助及低利貸款\343\200
 \202因2月
 低溫危害農作物為延遲性損害,.......
  </body>
</html>

This is a part of the contents.  The original one is encoded by
utf-8 and 8-bit (attached in this mail).  Where "\343\200\n \202"
is the encoded version of "。", i.e., "\343\200\202", but broken
in the middle of the bytes.  It seems that a stupid mail software
perpetrates it because of a long encoded line.

When I read the mail using Gnus + shr, the text after the broken
point is all cut off.  That is what libxml-parse-html-region does,
whereas xml-parse-region doesn't cut it.  Moreover a web browser,
to which I send the html data using the `K H' command, shows all
the text (the broken character is shown as is, though).

This is not necessarily a libxml bug anyway, but I hope it works
like xml-parse.

Thanks.

In GNU Emacs 26.0.91 (build 1, x86_64-unknown-cygwin, GTK+ Version 3.22.28)
 of 2018-03-12 built on localhost
Windowing system distributor 'The Cygwin/X Project', version 11.0.11906000
[example-html-mail.gz (application/x-gunzip, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#30789; Package emacs. (Tue, 13 Mar 2018 00:45:02 GMT) Full text and rfc822 format available.

Message #8 received at 30789 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Katsumi Yamaoka <yamaoka <at> jpl.org>
Cc: 30789 <at> debbugs.gnu.org,
 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
Subject: Re: bug#30789: 26.0.91;
 xml-parse-region works but libxml-parse-html-region doesn't
Date: Tue, 13 Mar 2018 01:44:22 +0100
Katsumi Yamaoka <yamaoka <at> jpl.org> writes:

> When I read the mail using Gnus + shr, the text after the broken
> point is all cut off.  That is what libxml-parse-html-region does,
> whereas xml-parse-region doesn't cut it.  Moreover a web browser,
> to which I send the html data using the `K H' command, shows all
> the text (the broken character is shown as is, though).
>
> This is not necessarily a libxml bug anyway, but I hope it works
> like xml-parse.

libxml is more strict about correctness of the input than most other
HTML parsers.  I don't think there's anything we can do about this
problematic input other than ponder whether Emacs should use a different
HTML parser, which I think sounds of unlikely.  :-)

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#30789; Package emacs. (Tue, 13 Mar 2018 02:29:02 GMT) Full text and rfc822 format available.

Message #11 received at 30789 <at> debbugs.gnu.org (full text, mbox):

From: Katsumi Yamaoka <yamaoka <at> jpl.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>,
 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
Cc: 30789 <at> debbugs.gnu.org
Subject: Re: bug#30789: 26.0.91;
 xml-parse-region works but libxml-parse-html-region doesn't
Date: Tue, 13 Mar 2018 11:28:45 +0900
[Message part 1 (text/plain, inline)]
On Tue, 13 Mar 2018 01:44:22 +0100, Lars Ingebrigtsen wrote:
> libxml is more strict about correctness of the input than most other
> HTML parsers.  I don't think there's anything we can do about this
> problematic input other than ponder whether Emacs should use a different
> HTML parser, which I think sounds of unlikely.  :-)

I see.  I agree not to modify libxml.  Jidanni, how about trying
the following patch personally if you often get such broken mails?
Though I'm not quite sure if it does not cause another problem,
it fixes at least the mail in question.

[Message part 2 (text/x-patch, inline)]
--- mm-decode.el~	2018-02-28 02:01:37.897607000 +0000
+++ mm-decode.el	2018-03-13 02:23:04.321753900 +0000
@@ -1810,6 +1810,11 @@
       (when (and (or coding
 		     (setq coding (mm-charset-to-coding-system charset nil t)))
 		 (not (eq coding 'ascii)))
+	;; Remove extra bytes in utf-8 encoded data.
+	(when (eq coding 'utf-8)
+	  (goto-char (point-min))
+	  (while (re-search-forward "[\x00-\x7f]+\\([\x80-\xbf]\\)" nil t)
+	    (replace-match "\\1")))
 	(insert (prog1
 		    (decode-coding-string (buffer-string) coding)
 		  (erase-buffer)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#30789; Package emacs. (Tue, 13 Mar 2018 02:57:02 GMT) Full text and rfc822 format available.

Message #14 received at 30789 <at> debbugs.gnu.org (full text, mbox):

From: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Katsumi Yamaoka <yamaoka <at> jpl.org>, 30789 <at> debbugs.gnu.org
Subject: Re: bug#30789: 26.0.91;
 xml-parse-region works but libxml-parse-html-region doesn't
Date: Tue, 13 Mar 2018 10:55:58 +0800
Expecting perfect input is OK for compilers, but not for browsers
https://blog.codinghorror.com/its-a-malformed-world/




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#30789; Package emacs. (Tue, 13 Mar 2018 03:30:02 GMT) Full text and rfc822 format available.

Message #17 received at 30789 <at> debbugs.gnu.org (full text, mbox):

From: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
To: Katsumi Yamaoka <yamaoka <at> jpl.org>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, 30789 <at> debbugs.gnu.org
Subject: Re: bug#30789: 26.0.91;
 xml-parse-region works but libxml-parse-html-region doesn't
Date: Tue, 13 Mar 2018 11:29:50 +0800
Thank you for the patch but the real answer is to do what all other
browsers do... show as much as possible.

There is no browser out there that would dream of dying on the slightest
mistake.

Anyway if you guys are really going to use XML::LibXML::Parser (?) then
maybe loosen up some of

       recover
           /parser, html, reader/

           recover from errors; possible values are 0, 1, and 2

           A true value turns on recovery mode which allows one to parse broken XML or HTML data. The recovery mode allows the parser to return the successfully parsed
           portion of the input document. This is useful for almost well-formed documents, where for example a closing tag is missing somewhere. Still, XML::LibXML will
           only parse until the first fatal (non-recoverable) error occurs, reporting recoverable parsing errors as warnings. To suppress even these warnings, use
           recover=>2.

           Note that validation is switched off automatically in recovery mode.

       validation
           /parser, reader/

           validate with the DTD; possible values are 0 and 1


      ERROR REPORTING
       XML::LibXML throws exceptions during parsing, validation or XPath processing (and some other occasions). These errors can be caught by using eval blocks. The error
       is stored in $@. There are two implementations: the old one throws $@ which is just a message string, in the new one $@ is an object from the class
       XML::LibXML::Error; this class overrides the operator "" so that when printed, the object flattens to the usual error message.

       XML::LibXML throws errors as they occur. This is a very common misunderstanding in the use of XML::LibXML. If the eval is omitted, XML::LibXML will always halt your
       script by "croaking" (see Carp man page for details).

       Also note that an increasing number of functions throw errors if bad data is passed as arguments. If you cannot assure valid data passed to XML::LibXML you should
       eval these functions.

       Note: since version 1.59, get_last_error() is no longer available in XML::LibXML for thread-safety reasons.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#30789; Package emacs. (Tue, 13 Mar 2018 03:32:02 GMT) Full text and rfc822 format available.

Message #20 received at 30789 <at> debbugs.gnu.org (full text, mbox):

From: Katsumi Yamaoka <yamaoka <at> jpl.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>, 積丹尼 Dan
 Jacobson <jidanni <at> jidanni.org>
Cc: 30789 <at> debbugs.gnu.org
Subject: Re: bug#30789: 26.0.91;
 xml-parse-region works but libxml-parse-html-region doesn't
Date: Tue, 13 Mar 2018 12:31:09 +0900
[Message part 1 (text/plain, inline)]
On Tue, 13 Mar 2018 11:28:45 +0900, Katsumi Yamaoka wrote:
> +	;; Remove extra bytes in utf-8 encoded data.
> +	(when (eq coding 'utf-8)
> +	  (goto-char (point-min))
> +	  (while (re-search-forward "[\x00-\x7f]+\\([\x80-\xbf]\\)" nil t)
> +	    (replace-match "\\1")))

Corrected:
[Message part 2 (text/x-patch, inline)]
--- mm-decode.el~	2018-02-28 02:01:37.897607000 +0000
+++ mm-decode.el	2018-03-13 03:27:56.885844100 +0000
@@ -1810,6 +1810,13 @@
       (when (and (or coding
 		     (setq coding (mm-charset-to-coding-system charset nil t)))
 		 (not (eq coding 'ascii)))
+	;; Remove extra bytes in utf-8 encoded data.
+	(when (eq coding 'utf-8)
+	  (goto-char (point-min))
+	  (while (re-search-forward
+		  "\\([\xc2-\xf7][\x80-\xbf]?\\)[\x00-\x7f]+\\([\x80-\xbf]\\)"
+		  nil t)
+	    (replace-match "\\1\\2")))
 	(insert (prog1
 		    (decode-coding-string (buffer-string) coding)
 		  (erase-buffer)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#30789; Package emacs. (Tue, 13 Mar 2018 11:29:02 GMT) Full text and rfc822 format available.

Message #23 received at 30789 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
Cc: Katsumi Yamaoka <yamaoka <at> jpl.org>, 30789 <at> debbugs.gnu.org
Subject: Re: bug#30789: 26.0.91;
 xml-parse-region works but libxml-parse-html-region doesn't
Date: Tue, 13 Mar 2018 12:28:47 +0100
積丹尼 Dan Jacobson <jidanni <at> jidanni.org> writes:

> There is no browser out there that would dream of dying on the slightest
> mistake.

I agree, and you should report these problems to the libxml2
maintainers.

> Anyway if you guys are really going to use XML::LibXML::Parser (?) then
> maybe loosen up some of

Our calls are as loose as they get, if I recall correctly.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#30789; Package emacs. (Tue, 13 Mar 2018 20:28:02 GMT) Full text and rfc822 format available.

Message #26 received at 30789 <at> debbugs.gnu.org (full text, mbox):

From: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Katsumi Yamaoka <yamaoka <at> jpl.org>, 30789 <at> debbugs.gnu.org
Subject: Re: bug#30789: 26.0.91;
 xml-parse-region works but libxml-parse-html-region doesn't
Date: Wed, 14 Mar 2018 04:27:08 +0800
>>>>> "LI" == Lars Ingebrigtsen <larsi <at> gnus.org> writes:

LI> I agree, and you should report these problems to the libxml2
LI> maintainers.

I would not want to ruin my reputation by letting them know I was
inputting unvalidated XML and expecting whatever results.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#30789; Package emacs. (Tue, 13 Mar 2018 22:31:01 GMT) Full text and rfc822 format available.

Message #29 received at 30789 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
Cc: Katsumi Yamaoka <yamaoka <at> jpl.org>, 30789 <at> debbugs.gnu.org
Subject: Re: bug#30789: 26.0.91;
 xml-parse-region works but libxml-parse-html-region doesn't
Date: Tue, 13 Mar 2018 23:30:31 +0100
積丹尼 Dan Jacobson <jidanni <at> jidanni.org> writes:

> LI> I agree, and you should report these problems to the libxml2
> LI> maintainers.
>
> I would not want to ruin my reputation by letting them know I was
> inputting unvalidated XML and expecting whatever results.

:-)

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Added tag(s) wontfix. Request was from Glenn Morris <rgm <at> gnu.org> to control <at> debbugs.gnu.org. (Thu, 15 Mar 2018 18:59:01 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 30789 <at> debbugs.gnu.org and Katsumi Yamaoka <yamaoka <at> jpl.org> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Fri, 13 Apr 2018 22:41:01 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 12 May 2018 11:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 38 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.