GNU bug report logs -
#75998
[guile-lib] html->sxml does not decode entities in attributes
Previous Next
Reported by: Tomas Volf <~@wolfsden.cz>
Date: Sat, 1 Feb 2025 20:11:01 UTC
Severity: normal
Done: Tomas Volf <~@wolfsden.cz>
Bug is archived. No further changes may be made.
Forwarded to oleg@okmij.org
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 75998 in the body.
You can then email your comments to 75998 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-guile <at> gnu.org
:
bug#75998
; Package
guile
.
(Sat, 01 Feb 2025 20:11:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Tomas Volf <~@wolfsden.cz>
:
New bug report received and forwarded. Copy sent to
bug-guile <at> gnu.org
.
(Sat, 01 Feb 2025 20:11:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hello,
I think I found a bug in the htmlprag module in guile-lib. When parsing
attributes, the values are not properly decoded:
--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,use (htmlprag)
scheme@(guile-user)> (html->sxml "<hr aaa=\"bbb"ccc'ddd\" />")
$1 = (*TOP* (hr (@ (aaa "bbb"ccc'ddd"))))
scheme@(guile-user)> (html->sxml "<a href=\"a&b\" />")
$2 = (*TOP* (a (@ (href "a&b"))))
--8<---------------cut here---------------end--------------->8---
I think that $1 should be "bbb\"ccc'ddd" and $2 should be "a&b".
The annoying part is that this cannot really be changed now, because
people (me included) already have workarounds in place, and
automatically decoding now would lead to double decoding.
I see few ways forward:
1. Document the current behavior and keep it as it is.
2. Add argument #:decode-attributes, defaulting to #f, to the relevant
procedures, so that people can opt into the fixed behavior.
3. Introduce parameter %decode-attributes, so that people can opt into
the fixed behavior.
I am sure there are also other approaches possible.
Have a nice day,
Tomas
--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
Information forwarded
to
bug-guile <at> gnu.org
:
bug#75998
; Package
guile
.
(Sun, 02 Feb 2025 06:48:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 75998 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Sat, Feb 01, 2025 at 09:10:04PM +0100, Tomas Volf wrote:
>
> Hello,
>
> I think I found a bug in the htmlprag module in guile-lib. When parsing
> attributes, the values are not properly decoded:
>
> --8<---------------cut here---------------start------------->8---
> scheme@(guile-user)> ,use (htmlprag)
> scheme@(guile-user)> (html->sxml "<hr aaa=\"bbb"ccc'ddd\" />")
> $1 = (*TOP* (hr (@ (aaa "bbb"ccc'ddd"))))
> scheme@(guile-user)> (html->sxml "<a href=\"a&b\" />")
> $2 = (*TOP* (a (@ (href "a&b"))))
> --8<---------------cut here---------------end--------------->8---
>
> I think that $1 should be "bbb\"ccc'ddd" and $2 should be "a&b".
Ouch. Have you contacted Oleg Kiselyov about it? He's usually pretty
responsive and very friendly.
> The annoying part is that this cannot really be changed now, because
> people (me included) already have workarounds in place, and
> automatically decoding now would lead to double decoding.
>
> I see few ways forward:
>
> 1. Document the current behavior and keep it as it is.
> 2. Add argument #:decode-attributes, defaulting to #f, to the relevant
> procedures, so that people can opt into the fixed behavior.
> 3. Introduce parameter %decode-attributes, so that people can opt into
> the fixed behavior.
>
> I am sure there are also other approaches possible.
If it were me, I'd take 2.
Cheers
--
tomás
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to
bug-guile <at> gnu.org
:
bug#75998
; Package
guile
.
(Sun, 02 Feb 2025 09:58:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 75998 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
<tomas <at> tuxteam.de> writes:
> On Sat, Feb 01, 2025 at 09:10:04PM +0100, Tomas Volf wrote:
>>
>> Hello,
>>
>> I think I found a bug in the htmlprag module in guile-lib. When parsing
>> attributes, the values are not properly decoded:
>>
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> ,use (htmlprag)
>> scheme@(guile-user)> (html->sxml "<hr aaa=\"bbb"ccc'ddd\" />")
>> $1 = (*TOP* (hr (@ (aaa "bbb"ccc'ddd"))))
>> scheme@(guile-user)> (html->sxml "<a href=\"a&b\" />")
>> $2 = (*TOP* (a (@ (href "a&b"))))
>> --8<---------------cut here---------------end--------------->8---
>>
>> I think that $1 should be "bbb\"ccc'ddd" and $2 should be "a&b".
>
> Ouch. Have you contacted Oleg Kiselyov about it? He's usually pretty
> responsive and very friendly.
I did not. I did not find a "how to report bugs" section on guile-lib's
website, and on the (htmlprag) documentation section Oleg Kiselyov is
mentioned only in one sentence as a "Thanks".
I think I have managed to find his email in one Haskell paper of his, so
I will CC him on the bug report, as suggested.
Thanks,
Tomas
--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
[signature.asc (application/pgp-signature, inline)]
Set bug forwarded-to-address to 'oleg <at> okmij.org'.
Request was from
Tomas Volf <~@wolfsden.cz>
to
control <at> debbugs.gnu.org
.
(Sun, 02 Feb 2025 09:59:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-guile <at> gnu.org
:
bug#75998
; Package
guile
.
(Sun, 02 Feb 2025 21:49:01 GMT)
Full text and
rfc822 format available.
Message #16 received at 75998 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hello Thomas,
> I did not. I did not find a "how to report bugs" section on
> guile-lib's website
HACKING
INSTALL
NEWS
README http://git.savannah.nongnu.org/cgit/guile-lib.git/tree/README
all do mention, in their header [HACKING as an example]:
Guile-Lib - HACKING
===========================================
Please send Guile-Lib bug reports to
guile-devel <at> gnu.org
I'd recommend to close this bug report here saying 'not a guile bug' and
repost on guile-devel.
> and on the (htmlprag) documentation section Oleg Kiselyov is
> mentioned only in one sentence as a "Thanks". I think I have managed
> to find his email in one Haskell paper of his, so I will CC him on
> the bug report, as suggested.
Note and be aware that ther version in guile-lib has been patched
'recently', see commit 84c420769, i Pushed on behalf of Maxim Cournoyer
<maxim.cournoyer <at> gmail.com>, who's the actual guile-lib maintainer.
David
[Message part 2 (application/pgp-signature, inline)]
Information forwarded
to
bug-guile <at> gnu.org
:
bug#75998
; Package
guile
.
(Mon, 03 Feb 2025 14:32:01 GMT)
Full text and
rfc822 format available.
Message #19 received at 75998 <at> debbugs.gnu.org (full text, mbox):
Hi Tomas,
Thank you for reporting this issue.
Tomas Volf <~@wolfsden.cz> writes:
> <tomas <at> tuxteam.de> writes:
>
>> On Sat, Feb 01, 2025 at 09:10:04PM +0100, Tomas Volf wrote:
>>>
>>> Hello,
>>>
>>> I think I found a bug in the htmlprag module in guile-lib. When parsing
>>> attributes, the values are not properly decoded:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> scheme@(guile-user)> ,use (htmlprag)
>>> scheme@(guile-user)> (html->sxml "<hr aaa=\"bbb"ccc'ddd\" />")
>>> $1 = (*TOP* (hr (@ (aaa "bbb"ccc'ddd"))))
>>> scheme@(guile-user)> (html->sxml "<a href=\"a&b\" />")
>>> $2 = (*TOP* (a (@ (href "a&b"))))
>>> --8<---------------cut here---------------end--------------->8---
>>>
>>> I think that $1 should be "bbb\"ccc'ddd" and $2 should be "a&b".
>>
>> Ouch. Have you contacted Oleg Kiselyov about it? He's usually pretty
>> responsive and very friendly.
>
> I did not. I did not find a "how to report bugs" section on guile-lib's
> website, and on the (htmlprag) documentation section Oleg Kiselyov is
> mentioned only in one sentence as a "Thanks".
>
> I think I have managed to find his email in one Haskell paper of his, so
> I will CC him on the bug report, as suggested.
And also for containing Oleg. I hope they can provide us with their
opinion on whether this is an actual bug or was designed that way. To
me, it's not clear whether html->sxml should alterate the raw value of
attributes in any way. Users may haev different use cases requiring to
apply different transformation themselves? If we hard-code a decoding
scheme ourselves, then force that choice onto users, no?
--
Thanks,
Maxim
Information forwarded
to
bug-guile <at> gnu.org
:
bug#75998
; Package
guile
.
(Tue, 04 Feb 2025 20:56:02 GMT)
Full text and
rfc822 format available.
Message #22 received at 75998 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
David Pirotte <david <at> altosw.be> writes:
> HACKING
> INSTALL
> NEWS
> README http://git.savannah.nongnu.org/cgit/guile-lib.git/tree/README
>
> all do mention, in their header [HACKING as an example]:
>
> Guile-Lib - HACKING
> ===========================================
>
> Please send Guile-Lib bug reports to
> guile-devel <at> gnu.org
>
> I'd recommend to close this bug report here saying 'not a guile bug' and
> repost on guile-devel.
Ah, I see. I admit I was checking only the website, and then I asked on
IRC. Will re-post on guile-devel as instructed.
Tomas
--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
[signature.asc (application/pgp-signature, inline)]
bug closed, send any further explanations to
75998 <at> debbugs.gnu.org and Tomas Volf <~@wolfsden.cz>
Request was from
Tomas Volf <~@wolfsden.cz>
to
control <at> debbugs.gnu.org
.
(Tue, 04 Feb 2025 20:57:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-guile <at> gnu.org
:
bug#75998
; Package
guile
.
(Tue, 04 Feb 2025 21:16:02 GMT)
Full text and
rfc822 format available.
Message #27 received at 75998 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
> Hi Tomas,
>
> Thank you for reporting this issue.
>
> Tomas Volf <~@wolfsden.cz> writes:
>
>> <tomas <at> tuxteam.de> writes:
>>
>>> On Sat, Feb 01, 2025 at 09:10:04PM +0100, Tomas Volf wrote:
>>>>
>>>> Hello,
>>>>
>>>> I think I found a bug in the htmlprag module in guile-lib. When parsing
>>>> attributes, the values are not properly decoded:
>>>>
>>>> --8<---------------cut here---------------start------------->8---
>>>> scheme@(guile-user)> ,use (htmlprag)
>>>> scheme@(guile-user)> (html->sxml "<hr aaa=\"bbb"ccc'ddd\" />")
>>>> $1 = (*TOP* (hr (@ (aaa "bbb"ccc'ddd"))))
>>>> scheme@(guile-user)> (html->sxml "<a href=\"a&b\" />")
>>>> $2 = (*TOP* (a (@ (href "a&b"))))
>>>> --8<---------------cut here---------------end--------------->8---
>>>>
>>>> I think that $1 should be "bbb\"ccc'ddd" and $2 should be "a&b".
>>>
>>> Ouch. Have you contacted Oleg Kiselyov about it? He's usually pretty
>>> responsive and very friendly.
>>
>> I did not. I did not find a "how to report bugs" section on guile-lib's
>> website, and on the (htmlprag) documentation section Oleg Kiselyov is
>> mentioned only in one sentence as a "Thanks".
>>
>> I think I have managed to find his email in one Haskell paper of his, so
>> I will CC him on the bug report, as suggested.
>
> And also for containing Oleg. I hope they can provide us with their
> opinion on whether this is an actual bug or was designed that way. To
> me, it's not clear whether html->sxml should alterate the raw value of
> attributes in any way.
It already modifies the raw value for regular HTML text:
--8<---------------cut here---------------start------------->8---
scheme@(htmlprag)> (html->sxml "a&b")
$10 = (*TOP* "a&b")
scheme@(htmlprag)> (sxml->html '(*TOP* "a&b"))
$13 = "a&b"
--8<---------------cut here---------------end--------------->8---
I now noticed this also affect encoding:
--8<---------------cut here---------------start------------->8---
scheme@(htmlprag)> (sxml->html '(*TOP* (a (@ (href "a&b")))))
$12 = "<a href=\"a&b\"></a>"
--8<---------------cut here---------------end--------------->8---
I am not sure why attributes should be special here.
For what it is worth, (sxml simple) itself decodes even attributes:
--8<---------------cut here---------------start------------->8---
scheme@(htmlprag)> (xml->sxml "<a href=\"a&b\"></a>")
$11 = (*TOP* (a (@ (href "a&b"))))
--8<---------------cut here---------------end--------------->8---
For comparison, Firefox seems to decode the attributes as well even in
HTML. That is actually how I discovered this issue, links I extracted
from <a href=".."> using html->sxml were not working until I ran a
decoding pass on them.
> Users may haev different use cases requiring to apply different
> transformation themselves?
I agree in the abstract, but do you have any specific use case in mind
when you would want to use the raw content of attributes (especially
since you already cannot get raw content of text nodes).
> If we hard-code a decoding scheme ourselves, then force that choice
> onto users, no?
I agree we cannot hard-code or change it now due to compatibility
concerns, but adding #:decode-attributes to html->sxml,
#:encode-attributes to sxml->html and possibly %deencode-attributes?
parameter, in the spirit of %strict-tokenizer? would seem reasonable.
Tomas
--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to
bug-guile <at> gnu.org
:
bug#75998
; Package
guile
.
(Thu, 06 Feb 2025 14:39:01 GMT)
Full text and
rfc822 format available.
Message #30 received at 75998 <at> debbugs.gnu.org (full text, mbox):
Hi Tomas,
[...]
> It already modifies the raw value for regular HTML text:
>
> scheme@(htmlprag)> (html->sxml "a&b")
> $10 = (*TOP* "a&b")
> scheme@(htmlprag)> (sxml->html '(*TOP* "a&b"))
> $13 = "a&b"
>
>
> I now noticed this also affect encoding:
>
> scheme@(htmlprag)> (sxml->html '(*TOP* (a (@ (href "a&b")))))
> $12 = "<a href=\"a&b\"></a>"
>
>
> I am not sure why attributes should be special here.
>
> For what it is worth, (sxml simple) itself decodes even attributes:
>
> scheme@(htmlprag)> (xml->sxml "<a href=\"a&b\"></a>")
> $11 = (*TOP* (a (@ (href "a&b"))))
>
> For comparison, Firefox seems to decode the attributes as well even in
> HTML. That is actually how I discovered this issue, links I extracted
> from <a href=".."> using html->sxml were not working until I ran a
> decoding pass on them.
Good points. Thanks for these.
>> Users may haev different use cases requiring to apply different
>> transformation themselves?
>
> I agree in the abstract, but do you have any specific use case in mind
> when you would want to use the raw content of attributes (especially
> since you already cannot get raw content of text nodes).
>> If we hard-code a decoding scheme ourselves, then force that choice
>> onto users, no?
>
> I agree we cannot hard-code or change it now due to compatibility
> concerns, but adding #:decode-attributes to html->sxml,
> #:encode-attributes to sxml->html and possibly %deencode-attributes?
> parameter, in the spirit of %strict-tokenizer? would seem reasonable.
I see this situation and %strict-tokenizer as a bit different; the
htmlprag module was designed to be lenient, so being lenient could not
really be considered a bug :-). But this here could well be considered
a bug. So perhaps something we could do is fix this correctly, and bump
at least the minor digit in our version (we're still in an unstable 0
version (last one was 0.2.8.1), so technically we don't promise
stability yet (perhaps never, as this guile-lib project aims to be a lab
for components that could later be included in Guile). But we should
communicate this change well in the NEWS file.
--
Thanks,
Maxim
Information forwarded
to
bug-guile <at> gnu.org
:
bug#75998
; Package
guile
.
(Thu, 06 Feb 2025 22:36:02 GMT)
Full text and
rfc822 format available.
Message #33 received at 75998 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi Maxim,
Thomas,
> But this here could well be considered a bug. So perhaps something
> we could do is fix this correctly, and bump at least the minor digit
> in our version (we're still in an unstable 0 version (last one was
> 0.2.8.1), so technically we don't promise stability yet (perhaps
> never, as this guile-lib project aims to be a lab for components that
> could later be included in Guile). But we should communicate this
> change well in the NEWS file.
1+ for
a proper fix
bump the version to 0.3.0
well written NEWS entry(ies)
clearly state that the htmlprag module was fixed, in a
way that users who locally applied their own work
around to the fixed problem/bug will have to review
their code and adpat to this new version ...
David
[Message part 2 (application/pgp-signature, inline)]
Information forwarded
to
bug-guile <at> gnu.org
:
bug#75998
; Package
guile
.
(Fri, 07 Feb 2025 12:48:02 GMT)
Full text and
rfc822 format available.
Message #36 received at 75998 <at> debbugs.gnu.org (full text, mbox):
Hi,
David Pirotte <david <at> altosw.be> writes:
> Hi Maxim,
> Thomas,
>
>> But this here could well be considered a bug. So perhaps something
>> we could do is fix this correctly, and bump at least the minor digit
>> in our version (we're still in an unstable 0 version (last one was
>> 0.2.8.1), so technically we don't promise stability yet (perhaps
>> never, as this guile-lib project aims to be a lab for components that
>> could later be included in Guile). But we should communicate this
>> change well in the NEWS file.
>
> 1+ for
>
> a proper fix
> bump the version to 0.3.0
> well written NEWS entry(ies)
> clearly state that the htmlprag module was fixed, in a
> way that users who locally applied their own work
> around to the fixed problem/bug will have to review
> their code and adpat to this new version ...
Thanks for weighing in.
Tomas, is it a fix you'd be interested in contributing? Otherwise, I'll
get to it but my hands are rather full at the moment :-).
--
Thanks,
Maxim
Information forwarded
to
bug-guile <at> gnu.org
:
bug#75998
; Package
guile
.
(Sun, 09 Feb 2025 11:52:02 GMT)
Full text and
rfc822 format available.
Message #39 received at 75998 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi,
Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
> Tomas, is it a fix you'd be interested in contributing? Otherwise, I'll
> get to it but my hands are rather full at the moment :-).
To quote myself from the other thread:
> Probably not. I have spent 20 minutes staring into the file and do not
> really have any idea where to start (ok, probably somewhere around
> `scan-attr'). So I cannot really promise I will be able to work on this
> (at least not soon), since I assume it will take me long time to figure
> out.
So I do not have any immediate plans to start working on this. :/
Tomas
--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to
bug-guile <at> gnu.org
:
bug#75998
; Package
guile
.
(Sat, 15 Feb 2025 15:30:01 GMT)
Full text and
rfc822 format available.
Message #42 received at 75998 <at> debbugs.gnu.org (full text, mbox):
Hi Tomas,
Tomas Volf <~@wolfsden.cz> writes:
> Hi,
>
> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:
>
>> Tomas, is it a fix you'd be interested in contributing? Otherwise, I'll
>> get to it but my hands are rather full at the moment :-).
>
> To quote myself from the other thread:
>
>> Probably not. I have spent 20 minutes staring into the file and do not
>> really have any idea where to start (ok, probably somewhere around
>> `scan-attr'). So I cannot really promise I will be able to work on this
>> (at least not soon), since I assume it will take me long time to figure
>> out.
>
> So I do not have any immediate plans to start working on this. :/
OK, no worries. I'll look into it when I have a good chunk of time
ahead.
--
Thanks,
Maxim
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sun, 16 Mar 2025 11:24:35 GMT)
Full text and
rfc822 format available.
This bug report was last modified 95 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.