GNU bug report logs -
#37009
EWW Gets Confused on Invalid HTML
Previous Next
Reported by: Nick Daly <nick.m.daly <at> gmail.com>
Date: Mon, 12 Aug 2019 04:20:01 UTC
Severity: minor
Tags: fixed
Merged with 37397
Found in version 26.2
Fixed in version 27.1
Done: Lars Ingebrigtsen <larsi <at> gnus.org>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
Eli Zaretskii <eliz <at> gnu.org> writes:
>> I'm not sure how feasible it will be to fix this at all. Eww relies on
>> libxml for parsing, and it's not as flexible as a typical web browser:
>>
>> (with-temp-buffer
>> (insert "<html>
>> <body>abc <- xyz<body>
>> </html>")
>> (libxml-parse-html-region (point-min) (point-max)))
>>
>> ;=> (html nil (body nil "abc\n"))
>
> Maybe we should report this to libxml developers and hear their
> opinion?
If libxml2 would add the standard work-arounds that most browsers use to
handle invalid HTML, that would be nice.
But it's not that difficult to add some pre-processing to handle the
most common cases ourselves.
For instance, if what follows the < isn't a letter (or an exclamation
point), then it should probably be < instead. That would have fixed
the problem in this case, and is something I think shr should do.
But you can go pretty far down the rabbit hole in being lenient with
invalid HTML, and I think it's probably best not to go any further down
that road.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
This bug report was last modified 5 years and 311 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.