GNU bug report logs - #37009
EWW Gets Confused on Invalid HTML

Previous Next

Package: emacs;

Reported by: Nick Daly <nick.m.daly <at> gmail.com>

Date: Mon, 12 Aug 2019 04:20:01 UTC

Severity: minor

Tags: fixed

Merged with 37397

Found in version 26.2

Fixed in version 27.1

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37009 <at> debbugs.gnu.org, Noam Postavsky <npostavs <at> gmail.com>, nick.m.daly <at> gmail.com
Subject: bug#37009: EWW Gets Confused on Invalid HTML
Date: Tue, 13 Aug 2019 11:45:22 -0700
Eli Zaretskii <eliz <at> gnu.org> writes:

>> I'm not sure how feasible it will be to fix this at all.  Eww relies on
>> libxml for parsing, and it's not as flexible as a typical web browser:
>> 
>>     (with-temp-buffer
>>       (insert "<html>
>>       <body>abc <- xyz<body>
>>     </html>")
>>       (libxml-parse-html-region (point-min) (point-max)))
>> 
>>     ;=> (html nil (body nil "abc\n"))
>
> Maybe we should report this to libxml developers and hear their
> opinion?

If libxml2 would add the standard work-arounds that most browsers use to
handle invalid HTML, that would be nice.

But it's not that difficult to add some pre-processing to handle the
most common cases ourselves.

For instance, if what follows the < isn't a letter (or an exclamation
point), then it should probably be &lt; instead.  That would have fixed
the problem in this case, and is something I think shr should do.

But you can go pretty far down the rabbit hole in being lenient with
invalid HTML, and I think it's probably best not to go any further down
that road.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




This bug report was last modified 5 years and 311 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.