GNU bug report logs - #4950
`xml-parse-file' returns incorrect results strings after `>' before `<' when CR\LF TAB+

Previous Next

Package: emacs;

Reported by: MON KEY <monkey <at> sandpframing.com>

Date: Tue, 17 Nov 2009 22:20:03 UTC

Severity: normal

Tags: notabug

Done: Chong Yidong <cyd <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 4950 in the body.
You can then email your comments to 4950 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#4950; Package emacs. (Tue, 17 Nov 2009 22:20:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to MON KEY <monkey <at> sandpframing.com>:
New bug report received and forwarded. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Tue, 17 Nov 2009 22:20:04 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):

From: MON KEY <monkey <at> sandpframing.com>
To: bug-gnu-emacs <at> gnu.org
Subject: `xml-parse-file' returns incorrect results strings after `>' before 
	`<' when CR\LF TAB+
Date: Tue, 17 Nov 2009 17:12:37 -0500
`xml-parse-file' returns incorrect results strings after `>' before
`<' when CR\LF TAB+

`xml-parse-file' fails to retrun correct results when there are ^C-j
(e.g. CR\LF)
followed by \t+ e.g TAB+ after a tag's trailing `>' and before the next tag's
leading `<'. IOW the following:

,----
| <ELEMENT attr1="a1" attr2="a2" attr3="a3" attr4="a4" attr5="a5">CR\LF
| TAB TAB TAB <NEXT-NODE>
`----

Returns (:NOTE with my pp-ing to help clarify the problem):

,----
| (ELEMENT nil
|          ((attr1 . "a1")
|           (attr2 . "a2")
|           (attr3 . "a3")
|           (attr4 . "a4")
|           (attr5 . "a5") "
|             " ;; <-i.e. (mapconcat #'char-to-string '(32 10 9 9 9) "")
|           (NEXT-NODE nil (...
`----

Is it if fair/safe to assume that where these types of sequences occur they are
not part of the XML and can be removed with a regexp? E.g. :

,----
| (while (search-forward-regexp "\"\)\n[\[:blank:]]+\"\)" nil t)
|        (replace-match ""))
`----

or perhaps:

,----
| (defun cln-xml<-parsed (fname &optional insertp intrp)
|   "Strip non-sensical strings created by xml-parse-file because of
| CR\LF TAB+ following tags/elements.
| FNAME is an XML filename path to parse and clean.
| When INSERTP is non-nil or called-interactively insert pretty printed lisp
| representation of XML file at point. Does not move point."
|   (interactive "fXML file to parse: \ni\np")
|   (let (get-xml)
|     (setq get-xml
|           (with-temp-buffer
|             (prin1 (xml-parse-file fname) (current-buffer))
|             (goto-char (point-min))
|             (while (search-forward-regexp
|                     "\\( \"\n[\[:blank:]]+\\)\"\\(\\(\\()\\)\\|\\(
(\\)\\)\\)" nil t)
|                    ;;^^1^^^^^^^^^^^^^^^^^^^^^^^^^2^^3^^^^^^^^^^^^4^^^^^^^^^^^^
|             (replace-match "\\2"))
|             (pp-buffer)
|             (buffer-substring-no-properties (point-min) (point-max))))
|     (if (or insertp intrp)
|         (save-excursion
|           (newline)
|           (princ get-xml (current-buffer)))
|         get-xml)))
`----

:SEE-ALSO
(URL `http://lists.gnu.org/archive/html/bug-gnu-emacs/2001-11/msg00052.html')

s_P




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#4950; Package emacs. (Sun, 01 Jul 2012 11:28:01 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Chong Yidong <cyd <at> gnu.org>
To: MON KEY <monkey <at> sandpframing.com>
Cc: 4950 <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org
Subject: Re: bug#4950: `xml-parse-file' returns incorrect results strings
	after `>' before `<' when CR\LF TAB+
Date: Sun, 01 Jul 2012 19:22:33 +0800
MON KEY <monkey <at> sandpframing.com> writes:

> <ELEMENT attr1="a1" attr2="a2" attr3="a3" attr4="a4" attr5="a5">CR\LF
> TAB TAB TAB <NEXT-NODE>
>
> Returns (:NOTE with my pp-ing to help clarify the problem):
>
> (ELEMENT nil
>          ((attr1 . "a1")
>           (attr2 . "a2")
>           (attr3 . "a3")
>           (attr4 . "a4")
>           (attr5 . "a5") "
>             " ;; <-i.e. (mapconcat #'char-to-string '(32 10 9 9 9) "")
>           (NEXT-NODE nil (...
>
> Is it if fair/safe to assume that where these types of sequences occur
> they are not part of the XML and can be removed with a regexp?

No.

XML 1.0 Recommendation, Section 2.10 White Space Handling:

"An XML processor MUST always pass all characters in a document that are
not markup through to the application."




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#4950; Package emacs. (Sun, 01 Jul 2012 11:28:02 GMT) Full text and rfc822 format available.

Added tag(s) notabug. Request was from Chong Yidong <cyd <at> gnu.org> to control <at> debbugs.gnu.org. (Sun, 01 Jul 2012 11:30:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 4950 <at> debbugs.gnu.org and MON KEY <monkey <at> sandpframing.com> Request was from Chong Yidong <cyd <at> gnu.org> to control <at> debbugs.gnu.org. (Sun, 01 Jul 2012 11:30:03 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 30 Jul 2012 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 13 years and 20 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.