GNU bug report logs - #75998
[guile-lib] html->sxml does not decode entities in attributes

Previous Next

Package: guile;

Reported by: Tomas Volf <~@wolfsden.cz>

Date: Sat, 1 Feb 2025 20:11:01 UTC

Severity: normal

Done: Tomas Volf <~@wolfsden.cz>

Bug is archived. No further changes may be made.

Forwarded to oleg@okmij.org

Full log


View this message in rfc822 format

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Tomas Volf <~@wolfsden.cz>
Cc: 75998 <at> debbugs.gnu.org, tomas <at> tuxteam.de
Subject: bug#75998: [guile-lib] html->sxml does not decode entities in attributes
Date: Thu, 06 Feb 2025 23:37:56 +0900
Hi Tomas,

[...]

> It already modifies the raw value for regular HTML text:
>
> scheme@(htmlprag)> (html->sxml "a&amp;b")
> $10 = (*TOP* "a&b")
> scheme@(htmlprag)> (sxml->html '(*TOP* "a&b"))
> $13 = "a&amp;b"
>
>
> I now noticed this also affect encoding:
>
> scheme@(htmlprag)> (sxml->html '(*TOP* (a (@ (href "a&b")))))
> $12 = "<a href=\"a&b\"></a>"
>
>
> I am not sure why attributes should be special here.
>
> For what it is worth, (sxml simple) itself decodes even attributes:
>
> scheme@(htmlprag)> (xml->sxml "<a href=\"a&amp;b\"></a>")
> $11 = (*TOP* (a (@ (href "a&b"))))
>
> For comparison, Firefox seems to decode the attributes as well even in
> HTML.  That is actually how I discovered this issue, links I extracted
> from <a href=".."> using html->sxml were not working until I ran a
> decoding pass on them.

Good points.  Thanks for these.

>> Users may haev different use cases requiring to apply different
>> transformation themselves?
>
> I agree in the abstract, but do you have any specific use case in mind
> when you would want to use the raw content of attributes (especially
> since you already cannot get raw content of text nodes).

>> If we hard-code a decoding scheme ourselves, then force that choice
>> onto users, no?
>
> I agree we cannot hard-code or change it now due to compatibility
> concerns, but adding #:decode-attributes to html->sxml,
> #:encode-attributes to sxml->html and possibly %deencode-attributes?
> parameter, in the spirit of %strict-tokenizer? would seem reasonable.

I see this situation and %strict-tokenizer as a bit different; the
htmlprag module was designed to be lenient, so being lenient could not
really be considered a bug :-).  But this here could well be considered
a bug.  So perhaps something we could do is fix this correctly, and bump
at least the minor digit in our version (we're still in an unstable 0
version (last one was 0.2.8.1), so technically we don't promise
stability yet (perhaps never, as this guile-lib project aims to be a lab
for components that could later be included in Guile).  But we should
communicate this change well in the NEWS file.

-- 
Thanks,
Maxim




This bug report was last modified 95 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.