GNU bug report logs -
#30789
26.0.91; xml-parse-region works but libxml-parse-html-region doesn't
Previous Next
Reported by: Katsumi Yamaoka <yamaoka <at> jpl.org>
Date: Mon, 12 Mar 2018 23:40:02 UTC
Severity: wishlist
Tags: wontfix
Found in version 26.0.91
Done: Lars Ingebrigtsen <larsi <at> gnus.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 30789 in the body.
You can then email your comments to 30789 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#30789
; Package
emacs
.
(Mon, 12 Mar 2018 23:40:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Katsumi Yamaoka <yamaoka <at> jpl.org>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Mon, 12 Mar 2018 23:40:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi,
Jidanni mailed me an example html mail that contains a broken
encoded text as follows:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
.......公告辦理現金救助及低利貸款\343\200
\202因2月
低溫危害農作物為延遲性損害,.......
</body>
</html>
This is a part of the contents. The original one is encoded by
utf-8 and 8-bit (attached in this mail). Where "\343\200\n \202"
is the encoded version of "。", i.e., "\343\200\202", but broken
in the middle of the bytes. It seems that a stupid mail software
perpetrates it because of a long encoded line.
When I read the mail using Gnus + shr, the text after the broken
point is all cut off. That is what libxml-parse-html-region does,
whereas xml-parse-region doesn't cut it. Moreover a web browser,
to which I send the html data using the `K H' command, shows all
the text (the broken character is shown as is, though).
This is not necessarily a libxml bug anyway, but I hope it works
like xml-parse.
Thanks.
In GNU Emacs 26.0.91 (build 1, x86_64-unknown-cygwin, GTK+ Version 3.22.28)
of 2018-03-12 built on localhost
Windowing system distributor 'The Cygwin/X Project', version 11.0.11906000
[example-html-mail.gz (application/x-gunzip, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#30789
; Package
emacs
.
(Tue, 13 Mar 2018 00:45:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 30789 <at> debbugs.gnu.org (full text, mbox):
Katsumi Yamaoka <yamaoka <at> jpl.org> writes:
> When I read the mail using Gnus + shr, the text after the broken
> point is all cut off. That is what libxml-parse-html-region does,
> whereas xml-parse-region doesn't cut it. Moreover a web browser,
> to which I send the html data using the `K H' command, shows all
> the text (the broken character is shown as is, though).
>
> This is not necessarily a libxml bug anyway, but I hope it works
> like xml-parse.
libxml is more strict about correctness of the input than most other
HTML parsers. I don't think there's anything we can do about this
problematic input other than ponder whether Emacs should use a different
HTML parser, which I think sounds of unlikely. :-)
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#30789
; Package
emacs
.
(Tue, 13 Mar 2018 02:29:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 30789 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Tue, 13 Mar 2018 01:44:22 +0100, Lars Ingebrigtsen wrote:
> libxml is more strict about correctness of the input than most other
> HTML parsers. I don't think there's anything we can do about this
> problematic input other than ponder whether Emacs should use a different
> HTML parser, which I think sounds of unlikely. :-)
I see. I agree not to modify libxml. Jidanni, how about trying
the following patch personally if you often get such broken mails?
Though I'm not quite sure if it does not cause another problem,
it fixes at least the mail in question.
[Message part 2 (text/x-patch, inline)]
--- mm-decode.el~ 2018-02-28 02:01:37.897607000 +0000
+++ mm-decode.el 2018-03-13 02:23:04.321753900 +0000
@@ -1810,6 +1810,11 @@
(when (and (or coding
(setq coding (mm-charset-to-coding-system charset nil t)))
(not (eq coding 'ascii)))
+ ;; Remove extra bytes in utf-8 encoded data.
+ (when (eq coding 'utf-8)
+ (goto-char (point-min))
+ (while (re-search-forward "[\x00-\x7f]+\\([\x80-\xbf]\\)" nil t)
+ (replace-match "\\1")))
(insert (prog1
(decode-coding-string (buffer-string) coding)
(erase-buffer)
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#30789
; Package
emacs
.
(Tue, 13 Mar 2018 02:57:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 30789 <at> debbugs.gnu.org (full text, mbox):
Expecting perfect input is OK for compilers, but not for browsers
https://blog.codinghorror.com/its-a-malformed-world/
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#30789
; Package
emacs
.
(Tue, 13 Mar 2018 03:30:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 30789 <at> debbugs.gnu.org (full text, mbox):
Thank you for the patch but the real answer is to do what all other
browsers do... show as much as possible.
There is no browser out there that would dream of dying on the slightest
mistake.
Anyway if you guys are really going to use XML::LibXML::Parser (?) then
maybe loosen up some of
recover
/parser, html, reader/
recover from errors; possible values are 0, 1, and 2
A true value turns on recovery mode which allows one to parse broken XML or HTML data. The recovery mode allows the parser to return the successfully parsed
portion of the input document. This is useful for almost well-formed documents, where for example a closing tag is missing somewhere. Still, XML::LibXML will
only parse until the first fatal (non-recoverable) error occurs, reporting recoverable parsing errors as warnings. To suppress even these warnings, use
recover=>2.
Note that validation is switched off automatically in recovery mode.
validation
/parser, reader/
validate with the DTD; possible values are 0 and 1
ERROR REPORTING
XML::LibXML throws exceptions during parsing, validation or XPath processing (and some other occasions). These errors can be caught by using eval blocks. The error
is stored in $@. There are two implementations: the old one throws $@ which is just a message string, in the new one $@ is an object from the class
XML::LibXML::Error; this class overrides the operator "" so that when printed, the object flattens to the usual error message.
XML::LibXML throws errors as they occur. This is a very common misunderstanding in the use of XML::LibXML. If the eval is omitted, XML::LibXML will always halt your
script by "croaking" (see Carp man page for details).
Also note that an increasing number of functions throw errors if bad data is passed as arguments. If you cannot assure valid data passed to XML::LibXML you should
eval these functions.
Note: since version 1.59, get_last_error() is no longer available in XML::LibXML for thread-safety reasons.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#30789
; Package
emacs
.
(Tue, 13 Mar 2018 03:32:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 30789 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Tue, 13 Mar 2018 11:28:45 +0900, Katsumi Yamaoka wrote:
> + ;; Remove extra bytes in utf-8 encoded data.
> + (when (eq coding 'utf-8)
> + (goto-char (point-min))
> + (while (re-search-forward "[\x00-\x7f]+\\([\x80-\xbf]\\)" nil t)
> + (replace-match "\\1")))
Corrected:
[Message part 2 (text/x-patch, inline)]
--- mm-decode.el~ 2018-02-28 02:01:37.897607000 +0000
+++ mm-decode.el 2018-03-13 03:27:56.885844100 +0000
@@ -1810,6 +1810,13 @@
(when (and (or coding
(setq coding (mm-charset-to-coding-system charset nil t)))
(not (eq coding 'ascii)))
+ ;; Remove extra bytes in utf-8 encoded data.
+ (when (eq coding 'utf-8)
+ (goto-char (point-min))
+ (while (re-search-forward
+ "\\([\xc2-\xf7][\x80-\xbf]?\\)[\x00-\x7f]+\\([\x80-\xbf]\\)"
+ nil t)
+ (replace-match "\\1\\2")))
(insert (prog1
(decode-coding-string (buffer-string) coding)
(erase-buffer)
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#30789
; Package
emacs
.
(Tue, 13 Mar 2018 11:29:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 30789 <at> debbugs.gnu.org (full text, mbox):
積丹尼 Dan Jacobson <jidanni <at> jidanni.org> writes:
> There is no browser out there that would dream of dying on the slightest
> mistake.
I agree, and you should report these problems to the libxml2
maintainers.
> Anyway if you guys are really going to use XML::LibXML::Parser (?) then
> maybe loosen up some of
Our calls are as loose as they get, if I recall correctly.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#30789
; Package
emacs
.
(Tue, 13 Mar 2018 20:28:02 GMT)
Full text and
rfc822 format available.
Message #26 received at 30789 <at> debbugs.gnu.org (full text, mbox):
>>>>> "LI" == Lars Ingebrigtsen <larsi <at> gnus.org> writes:
LI> I agree, and you should report these problems to the libxml2
LI> maintainers.
I would not want to ruin my reputation by letting them know I was
inputting unvalidated XML and expecting whatever results.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#30789
; Package
emacs
.
(Tue, 13 Mar 2018 22:31:01 GMT)
Full text and
rfc822 format available.
Message #29 received at 30789 <at> debbugs.gnu.org (full text, mbox):
積丹尼 Dan Jacobson <jidanni <at> jidanni.org> writes:
> LI> I agree, and you should report these problems to the libxml2
> LI> maintainers.
>
> I would not want to ruin my reputation by letting them know I was
> inputting unvalidated XML and expecting whatever results.
:-)
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
Added tag(s) wontfix.
Request was from
Glenn Morris <rgm <at> gnu.org>
to
control <at> debbugs.gnu.org
.
(Thu, 15 Mar 2018 18:59:01 GMT)
Full text and
rfc822 format available.
bug closed, send any further explanations to
30789 <at> debbugs.gnu.org and Katsumi Yamaoka <yamaoka <at> jpl.org>
Request was from
Lars Ingebrigtsen <larsi <at> gnus.org>
to
control <at> debbugs.gnu.org
.
(Fri, 13 Apr 2018 22:41:01 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 12 May 2018 11:24:07 GMT)
Full text and
rfc822 format available.
This bug report was last modified 7 years and 38 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.