GNU bug report logs -
#24831
shr mangling messages
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24831 in the body.
You can then email your comments to 24831 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Mon, 31 Oct 2016 02:47:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Mon, 31 Oct 2016 02:47:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Gentelmen, the "shr" program is mangling messages.
It could remove vital words, causing arguments:
"I did include the address!" "No you didn't." "Yes I did. Your mail
reader probably cut it out."
We're talking data loss here. It may still be on the disk, but not shown
to the user.
True, the HTML might not be perfect, but at least Chromium, Firefox,
etc. show it fine.
>>>>> "KY" == Katsumi Yamaoka <yamaoka <at> jpl.org> writes:
KY> Emacs-w3m renders it as:
KY> http://w
KY> Hi, you have a new email from Catherineme
KY> [25]
KY> View your inbox at http://www.travel-buddies.com/Inbox.aspx
KY> © Travel Buddies 2015 | All rights reserved
Hmmm, w3m -dump on the attachment shows the first URL in full.
KY> However shr renders it as:
KY> Travel Buddies
KY> © Travel Buddies 2015 | All rights reserved
KY> http://www.travel-buddies.com/
KY> *
KY> There lacks the "Hi, you have a new mail" message. The return
KY> value of `libxml-parse-html-region' contains the message as
KY> (h1 nil (span nil "Hi, you have a new email from") "Catherineme")
KY> (p nil "View your inbox at "
KY> (a ((href . "http://www.travel-buddies.com/Inbox.aspx"))
KY> "http://www.travel-buddies.com/Inbox.aspx"))
KY> regardless of whether all style specs are removed[1] or not
KY> (three nil portions above are replaced with style specs if they
KY> are not removed). So, style specs are not cause of not
KY> displaying some meaningful message in an html mail, I believe.
KY> In that case, making shr display images does not help.
KY> I think there's something wrong in shr.el, and what you should
KY> do would be to send a bug report to the Emacs bug team, i.e.,
KY> M-x report-emacs-bug, with the sample html part (I'm not so
KY> familiar with recent shr, sorry). Note that a mail containing
KY> html part might be rejected by the server, so putting it in your
KY> web site separately would be better.
KY> [1] I tested it by modifying mm-shr so as to remove style specs.
OK I'll send the message,
[SHRcutOFFmessage.gz (application/gzip, attachment)]
[Message part 3 (text/plain, inline)]
here in this bug report about In GNU Emacs 24.5.1 (i686-pc-linux-gnu,
GTK+ Version 3.21.5) of 2016-09-06 on x86-csail-01, modified by Debian.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Tue, 01 Nov 2016 01:40:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 24831 <at> debbugs.gnu.org (full text, mbox):
On Mon, 31 Oct 2016 10:45:58 +0800, Dan Jacobson wrote:
> Gentelmen, the "shr" program is mangling messages.
I found the cause of the problem that shr does not display the
"Hi, you have a new email..."
statement contained in the example message. That is, the message
has a table in which the td element is omitted or lost. Here is
a simplified html form (try `M-x shr-render-region RET' on it):
--8<---------------cut here---------------start------------->8---
<html>
<body>
<table>
<tr>
<!--td-->
<table>
<tr>
<td>
Hi, you have a new email
</td>
</tr>
</table>
<!--/td-->
</tr>
</table>
</body>
</html>
--8<---------------cut here---------------end--------------->8---
> True, the HTML might not be perfect, but at least Chromium, Firefox,
> etc. show it fine.
Yes, what is bad is the html message, but shr should show it.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Tue, 01 Nov 2016 10:00:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 24831 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Tue, 01 Nov 2016 10:39:12 +0900, Katsumi Yamaoka wrote:
> I found the cause of the problem that shr does not display the
> "Hi, you have a new email..."
> statement contained in the example message. That is, the message
> has a table in which the td element is omitted or lost.
I tried fixing it. A patch is below. But I feel it somewhat
awkward, so I hope Lars or someone will review it. My patch
simply adds the missing td tag as follows:
(table nil (tr nil contents))
↓
(table nil (tr nil (td nil contents)))
Thanks.
[Message part 2 (text/x-patch, inline)]
--- shr.el~ 2016-11-01 02:35:57.788777000 +0000
+++ shr.el 2016-11-01 09:51:32.251984400 +0000
@@ -1759,6 +1759,7 @@
;; we then render everything again with the new widths, and finally
;; insert all these boxes into the main buffer.
(defun shr-tag-table-1 (dom)
+ (shr-add-missing-td dom)
(setq dom (or (dom-child-by-tag dom 'tbody) dom))
(let* ((shr-inhibit-images t)
(shr-table-depth (1+ shr-table-depth))
@@ -1787,6 +1788,19 @@
;; Then render the table again with these new "hard" widths.
(shr-insert-table (shr-make-table dom sketch-widths t) sketch-widths)))
+(defun shr-add-missing-td (dom)
+ "Add missing td tag to table."
+ (let (tr td)
+ (dolist (elem (dom-children dom))
+ (when (eq (car-safe elem) 'tr)
+ (setq tr elem
+ td nil
+ elem (cddr elem))
+ (while (and (not td) elem)
+ (setq td (eq (car-safe (pop elem)) 'td)))
+ (unless td
+ (setcdr (cdr tr) (list (cons 'td (cons nil (cddr tr))))))))))
+
(defun shr-table-body (dom)
(let ((tbodies (seq-filter (lambda (child)
(eq (dom-tag child) 'tbody))
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Tue, 01 Nov 2016 10:09:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 24831 <at> debbugs.gnu.org (full text, mbox):
Katsumi Yamaoka <yamaoka <at> jpl.org> writes:
> I tried fixing it. A patch is below. But I feel it somewhat
> awkward, so I hope Lars or someone will review it. My patch
> simply adds the missing td tag as follows:
>
> (table nil (tr nil contents))
> ↓
> (table nil (tr nil (td nil contents)))
I'm not sure I think it's worth trying to work around invalid HTML to
this extent.
In addition, other browsers do not correct "missing" TDs in this way:
Instead they typically render non-TD/TH nodes before the table, which I
think might be a better idea.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Tue, 01 Nov 2016 10:16:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 24831 <at> debbugs.gnu.org (full text, mbox):
Lars Ingebrigtsen <larsi <at> gnus.org> writes:
> I'm not sure I think it's worth trying to work around invalid HTML to
> this extent.
Besides, there's often lots of empty space text nodes interspersed,
aren't there?
<table>
<tr>
<td>
...
will have a node with "\n " before the TD node, I think? Those text
nodes are supposed to be ignored.
I'd prefer just to close this bug with a WONTFIX.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Tue, 01 Nov 2016 11:23:01 GMT)
Full text and
rfc822 format available.
Message #20 received at 24831 <at> debbugs.gnu.org (full text, mbox):
Another idea would be first run it through a validator.
If valid, proceed as before.
If invalid, just spew out all the text nodes of the whole document,
separated by spaces.
Anything is better than vital sentences going missing.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Tue, 01 Nov 2016 11:25:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 24831 <at> debbugs.gnu.org (full text, mbox):
At least print a warning,
*** Invalid HTML detected, some text might be missing ***
in red, which stays at the top of the message. (Not in the fleeting minibuffer.)
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Tue, 01 Nov 2016 17:18:02 GMT)
Full text and
rfc822 format available.
Message #26 received at 24831 <at> debbugs.gnu.org (full text, mbox):
[[[ To any NSA and FBI agents reading my email: please consider ]]]
[[[ whether defending the US Constitution against all enemies, ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]
> Another idea would be first run it through a validator.
> If valid, proceed as before.
> If invalid, just spew out all the text nodes of the whole document,
> separated by spaces.
Do we have a validator in Emacs Lisp? Or would we run one as a child?
What program is available?
--
Dr Richard Stallman
President, Free Software Foundation (gnu.org, fsf.org)
Internet Hall-of-Famer (internethalloffame.org)
Skype: No way! See stallman.org/skype.html.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Tue, 01 Nov 2016 18:47:01 GMT)
Full text and
rfc822 format available.
Message #29 received at 24831 <at> debbugs.gnu.org (full text, mbox):
Lars Ingebrigtsen <larsi <at> gnus.org> writes:
> In addition, other browsers do not correct "missing" TDs in this way:
> Instead they typically render non-TD/TH nodes before the table, which I
> think might be a better idea.
And thinking about it a bit more, I think that would perhaps be the most
likely solution for shr, too. That is, `shr-tag-table' could, at the
end there, go through and find all non-blank non-td/th elements and
insert them at the end.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Wed, 02 Nov 2016 09:51:02 GMT)
Full text and
rfc822 format available.
Message #32 received at 24831 <at> debbugs.gnu.org (full text, mbox):
On Tue, 01 Nov 2016 19:43:23 +0100, Lars Ingebrigtsen wrote:
> Lars Ingebrigtsen <larsi <at> gnus.org> writes:
>> In addition, other browsers do not correct "missing" TDs in this way:
>> Instead they typically render non-TD/TH nodes before the table, which I
>> think might be a better idea.
> And thinking about it a bit more, I think that would perhaps be the most
> likely solution for shr, too. That is, `shr-tag-table' could, at the
> end there, go through and find all non-blank non-td/th elements and
> insert them at the end.
Thanks. I'm trying it but not succeeded yet though, I think I
understand what I should do. The function for it should gather
only those extra elements, that are parts of a table tag of
which the parent (of the parent ...) table has no TD/TH tag.
It's a good brain teaser. :)
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Wed, 02 Nov 2016 09:51:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Fri, 04 Nov 2016 07:20:01 GMT)
Full text and
rfc822 format available.
Message #38 received at 24831 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Wed, 02 Nov 2016 18:49:58 +0900, Katsumi Yamaoka wrote:
> On Tue, 01 Nov 2016 19:43:23 +0100, Lars Ingebrigtsen wrote:
>> And thinking about it a bit more, I think that would perhaps be the most
>> likely solution for shr, too. That is, `shr-tag-table' could, at the
>> end there, go through and find all non-blank non-td/th elements and
>> insert them at the end.
> Thanks. I'm trying it but not succeeded yet though,...
I did it. A patch is below. Bad things in this version I know
at least are:
・It does not support styles -- font, color, etc.
・No way to exclude text existing outside of <html>...</html>.
Thers is no such problems in the first version I posted. ;-)
[Message part 2 (text/x-patch, inline)]
--- shr.el~ 2016-11-01 02:35:57.788777000 +0000
+++ shr.el 2016-11-04 07:17:19.789855000 +0000
@@ -1897,11 +1897,48 @@
(when (zerop shr-table-depth)
(save-excursion
(shr-expand-alignments start (point)))
+ ;; Insert also non-td/th strings excluding comments and styles.
+ (save-restriction
+ (narrow-to-region (point) (point))
+ (insert (mapconcat #'identity
+ (shr-collect-extra-strings-in-table dom)
+ "\n"))
+ (shr-fill-lines (point-min) (point-max)))
(dolist (elem (dom-by-tag dom 'object))
(shr-tag-object elem))
(dolist (elem (dom-by-tag dom 'img))
(shr-tag-img elem)))))
+(defun shr-collect-extra-strings-in-table (dom &optional flags)
+ "Return extra strings in DOM of which the root is a table clause.
+FLAGS is a cons of two flags that control whether to collect strings."
+ ;; If and only if the cdr is not set, the car will be set to t when
+ ;; a <td> or a <th> clause is found in the children of DOM, and reset
+ ;; to nil when a <table> clause is found in the children of DOM.
+ ;; The cdr will be set to t when a <table> clause is found if the car
+ ;; is not set then, and will never be reset.
+ ;; This function collects strings if the car of FLAGS is not set.
+ (unless flags (setq flags (cons nil nil)))
+ (cl-loop for child in (dom-children dom)
+ if (stringp child)
+ when (and (not (car flags))
+ (string-match "\\(?:[^\t\n\r ]+[\t\n\r ]+\\)*[^\t\n\r ]+"
+ child))
+ collect (match-string 0 child)
+ end
+ else
+ unless (let ((tag (dom-tag child)))
+ (or (memq tag '(comment style))
+ (progn
+ (cond ((memq tag '(td th))
+ (unless (cdr flags) (setcar flags t)))
+ ((eq tag 'table)
+ (if (car flags)
+ (unless (cdr flags) (setcar flags nil))
+ (setcdr flags t))))
+ nil)))
+ append (shr-collect-extra-strings-in-table child flags)))
+
(defun shr-insert-table (table widths)
(let* ((collapse (equal (cdr (assq 'border-collapse shr-stylesheet))
"collapse"))
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Fri, 04 Nov 2016 07:20:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Fri, 04 Nov 2016 08:55:01 GMT)
Full text and
rfc822 format available.
Message #44 received at 24831 <at> debbugs.gnu.org (full text, mbox):
Katsumi Yamaoka <yamaoka <at> jpl.org> writes:
> I did it. A patch is below.
Great! Looks good to me.
> Bad things in this version I know at least are:
>
> ・It does not support styles -- font, color, etc.
I don't think that matters very much. The HTML is invalid.
> ・No way to exclude text existing outside of <html>...</html>.
Hm... I don't quite follow...
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Fri, 04 Nov 2016 10:29:02 GMT)
Full text and
rfc822 format available.
Message #47 received at 24831 <at> debbugs.gnu.org (full text, mbox):
On Fri, 04 Nov 2016 09:51:52 +0100, Lars Ingebrigtsen wrote:
> Katsumi Yamaoka <yamaoka <at> jpl.org> writes:
>> I did it. A patch is below.
> Great! Looks good to me.
Thanks! I'll commit it to master.
[...]
>> ・No way to exclude text existing outside of <html>...</html>.
> Hm... I don't quite follow...
I found it in some mails from amazon.co.jp, but not so many and
not so annoying. Here it is:
<html>
...
</html> --MuLtIpArT_BoUnDaRy--
Well, is this a reasonable operation?
(with-temp-buffer
(insert "<html><body>Foo</body></html>Bar")
(libxml-parse-html-region (point-min) (point-max)))
=> (html nil (body nil "Foo") (html nil (p nil "Bar")))
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Fri, 04 Nov 2016 10:29:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Fri, 04 Nov 2016 11:20:02 GMT)
Full text and
rfc822 format available.
Message #53 received at 24831 <at> debbugs.gnu.org (full text, mbox):
Katsumi Yamaoka <yamaoka <at> jpl.org> writes:
> I found it in some mails from amazon.co.jp, but not so many and
> not so annoying. Here it is:
>
> <html>
> ...
> </html> --MuLtIpArT_BoUnDaRy--
Oh, right...
> Well, is this a reasonable operation?
>
> (with-temp-buffer
> (insert "<html><body>Foo</body></html>Bar")
> (libxml-parse-html-region (point-min) (point-max)))
> => (html nil (body nil "Foo") (html nil (p nil "Bar")))
Yes, it's two <html> elements after each other. In HTML, the <html>
start (and end) tags are optional.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Fri, 04 Nov 2016 18:19:01 GMT)
Full text and
rfc822 format available.
Message #56 received at 24831 <at> debbugs.gnu.org (full text, mbox):
On Tue, 01 Nov 2016 13:16:52 -0400 Richard Stallman <rms <at> gnu.org> wrote:
>> Another idea would be first run it through a validator.
>> If valid, proceed as before.
>> If invalid, just spew out all the text nodes of the whole document,
>> separated by spaces.
RS> Do we have a validator in Emacs Lisp? Or would we run one as a child?
RS> What program is available?
IMHO validation is not a workable solution, both because of complexity
and because real-world HTML authors are incredibly skilled at writing
broken HTML that somehow renders in the browsers they support.
Ted
Reply sent
to
Katsumi Yamaoka <yamaoka <at> jpl.org>
:
You have taken responsibility.
(Sun, 06 Nov 2016 23:33:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
:
bug acknowledged by developer.
(Sun, 06 Nov 2016 23:33:02 GMT)
Full text and
rfc822 format available.
Message #61 received at 24831-done <at> debbugs.gnu.org (full text, mbox):
On Fri, 04 Nov 2016 12:17:18 +0100, Lars Ingebrigtsen wrote:
> Katsumi Yamaoka <yamaoka <at> jpl.org> writes:
>> Well, is this a reasonable operation?
>>
>> (with-temp-buffer
>> (insert "<html><body>Foo</body></html>Bar")
>> (libxml-parse-html-region (point-min) (point-max)))
>> => (html nil (body nil "Foo") (html nil (p nil "Bar")))
> Yes, it's two <html> elements after each other. In HTML, the <html>
> start (and end) tags are optional.
I see. But I'm sorry for my confusion; that extra text appearing
is not due to my change. So, I'm closing this bug. Thanks.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#24831
; Package
emacs
.
(Sun, 06 Nov 2016 23:33:03 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 05 Dec 2016 12:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 8 years and 254 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.