GNU bug report logs - #63125
30.0.50; [BUG] last argument of libxml2-parse-html-region has no effect?

Previous Next

Package: emacs;

Reported by: Ruijie Yu <ruijie <at> netyu.xyz>

Date: Thu, 27 Apr 2023 16:34:02 UTC

Severity: normal

Found in version 30.0.50

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 63125 in the body.
You can then email your comments to 63125 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#63125; Package emacs. (Thu, 27 Apr 2023 16:34:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ruijie Yu <ruijie <at> netyu.xyz>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 27 Apr 2023 16:34:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ruijie Yu <ruijie <at> netyu.xyz>
To: bug-gnu-emacs <at> gnu.org
Subject: 30.0.50; [BUG] last argument of libxml2-parse-html-region has no
 effect?
Date: Fri, 28 Apr 2023 00:19:22 +0800
[I know I'm running a one-month old master.  I will try to reproduce
this issue again within a day with an up-to-date master unless someone
else does it first.  And -Q as well.]

I'm trying out the function `libxml2-parse-html-region' as recommended
by a thread in help-gnu-emacs.  However, I discovered that the last
argument of this function does not help me normalize a relative url.

Reproducer:

Visit the attached toy html file.  I imagine that it is hosted at
"https://example.com/good/day".

Run this snippet:

    (pp (libxml-parse-html-region
         (point-min) (point-max)
         "https://example.com/good/day"))

Compare it with this snippet:

    (pp (libxml-parse-html-region
         (point-min) (point-max)))

What I get is this result for both snippets (which is shown twice, once
"pretty-printed", and once returned as a string):

--8<---------------cut here---------------start------------->8---
(html nil
      (body nil "\n    "
            (a
             ((href . "/hello"))
             "1")
            "\n    "
            (a
             ((href . "../world"))
             "2")
            "\n    "
            (a
             ((href . "good"))
             "3")
            "\n    "
            (a
             ((href . "morning/or/night"))
             "4")
            "\n  "))
--8<---------------cut here---------------end--------------->8---

Notice, that the href values are not normalized: they are copied
verbatim from the original html file.

If I understand the docstring correctly, the last argument of
`libxml2-parse-html-region', when specified as a url string, should be
used as the "base point" of resolving relative paths found within the
html document.  But the <a href=xxx> paths are not resolved at the
moment.

---

In GNU Emacs 30.0.50 (build 1, x86_64-pc-linux-gnu, GTK+ Version
 3.24.37, cairo version 1.17.8) of 2023-03-25 built on ruijie
Repository revision: db7e95531ac36ae842787b6c5f2859d0642c78cc
Repository branch: makepkg
System Description: Arch Linux

Configured using:
 'configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/lib
 --localstatedir=/var --mandir=/usr/share/man --with-gameuser=:games
 --with-modules --without-libotf --without-m17n-flt --without-gconf
 --enable-link-time-optimization --with-native-compilation=yes
 --with-xinput2 --with-pgtk --without-xaw3d --with-sound=alsa
 --with-tree-sitter '--program-transform-name=s/\([ec]tags\)/\1.emacs/'
 'CFLAGS=-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions
 -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security
 -fstack-clash-protection -fcf-protection'
 LDFLAGS=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now'

Configured features:
ACL CAIRO DBUS FREETYPE GIF GLIB GMP GNUTLS GPM GSETTINGS HARFBUZZ JPEG
JSON LCMS2 LIBSYSTEMD LIBXML2 MODULES NATIVE_COMP NOTIFY INOTIFY PDUMPER
PGTK PNG RSVG SECCOMP SOUND SQLITE3 THREADS TIFF TOOLKIT_SCROLL_BARS
TREE_SITTER WEBP XIM GTK3 ZLIB

Important settings:
  value of $LANG: en_US.UTF-8
  value of $XMODIFIERS: @im=fcitx
  locale-coding-system: utf-8-unix

-- 
Best,


RY

[Please note that this mail might go to spam due to some
misconfiguration in my mail server -- still investigating.]




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63125; Package emacs. (Thu, 27 Apr 2023 17:09:01 GMT) Full text and rfc822 format available.

Message #8 received at 63125 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Ruijie Yu <ruijie <at> netyu.xyz>
Cc: 63125 <at> debbugs.gnu.org
Subject: Re: bug#63125: 30.0.50;
 [BUG] last argument of libxml2-parse-html-region has no effect?
Date: Thu, 27 Apr 2023 20:08:14 +0300
> Date: Fri, 28 Apr 2023 00:19:22 +0800
> From:  Ruijie Yu via "Bug reports for GNU Emacs,
>  the Swiss army knife of text editors" <bug-gnu-emacs <at> gnu.org>
> 
> I'm trying out the function `libxml2-parse-html-region' as recommended
> by a thread in help-gnu-emacs.  However, I discovered that the last
> argument of this function does not help me normalize a relative url.
> 
> Reproducer:
> 
> Visit the attached toy html file.  I imagine that it is hosted at
> "https://example.com/good/day".
> 
> Run this snippet:
> 
>     (pp (libxml-parse-html-region
>          (point-min) (point-max)
>          "https://example.com/good/day"))
> 
> Compare it with this snippet:
> 
>     (pp (libxml-parse-html-region
>          (point-min) (point-max)))
> 
> What I get is this result for both snippets (which is shown twice, once
> "pretty-printed", and once returned as a string):
> 
> --8<---------------cut here---------------start------------->8---
> (html nil
>       (body nil "\n    "
>             (a
>              ((href . "/hello"))
>              "1")
>             "\n    "
>             (a
>              ((href . "../world"))
>              "2")
>             "\n    "
>             (a
>              ((href . "good"))
>              "3")
>             "\n    "
>             (a
>              ((href . "morning/or/night"))
>              "4")
>             "\n  "))
> --8<---------------cut here---------------end--------------->8---
> 
> Notice, that the href values are not normalized: they are copied
> verbatim from the original html file.
> 
> If I understand the docstring correctly, the last argument of
> `libxml2-parse-html-region', when specified as a url string, should be
> used as the "base point" of resolving relative paths found within the
> html document.  But the <a href=xxx> paths are not resolved at the
> moment.

If you look at xml.c, you will see that we just call a libxml function
passing it this URL.  So if anything isn't as expected, the answer is
in libxml, not in Emacs.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63125; Package emacs. (Fri, 28 Apr 2023 01:34:01 GMT) Full text and rfc822 format available.

Message #11 received at 63125 <at> debbugs.gnu.org (full text, mbox):

From: Ruijie Yu <ruijie <at> netyu.xyz>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 63125 <at> debbugs.gnu.org
Subject: Re: bug#63125: 30.0.50; [BUG] last argument of
 libxml2-parse-html-region has no effect?
Date: Fri, 28 Apr 2023 09:30:30 +0800
[Message part 1 (text/plain, inline)]
Eli Zaretskii <eliz <at> gnu.org> writes:

>> Date: Fri, 28 Apr 2023 00:19:22 +0800
>> From:  Ruijie Yu via "Bug reports for GNU Emacs,
>>  the Swiss army knife of text editors" <bug-gnu-emacs <at> gnu.org>
>> 
>> I'm trying out the function `libxml2-parse-html-region' as recommended
>> by a thread in help-gnu-emacs.  However, I discovered that the last
>> argument of this function does not help me normalize a relative url.
>> 
>> Reproducer:
>> 
>> Visit the attached toy html file.  I imagine that it is hosted at
>> "https://example.com/good/day".
>> 
>> Run this snippet:
>> 
>>     (pp (libxml-parse-html-region
>>          (point-min) (point-max)
>>          "https://example.com/good/day"))
>> 
>> Compare it with this snippet:
>> 
>>     (pp (libxml-parse-html-region
>>          (point-min) (point-max)))
>> 
>> What I get is this result for both snippets (which is shown twice, once
>> "pretty-printed", and once returned as a string):
>> 
>> --8<---------------cut here---------------start------------->8---
>> (html nil
>>       (body nil "\n    "
>>             (a
>>              ((href . "/hello"))
>>              "1")
>>             "\n    "
>>             (a
>>              ((href . "../world"))
>>              "2")
>>             "\n    "
>>             (a
>>              ((href . "good"))
>>              "3")
>>             "\n    "
>>             (a
>>              ((href . "morning/or/night"))
>>              "4")
>>             "\n  "))
>> --8<---------------cut here---------------end--------------->8---
>> 
>> Notice, that the href values are not normalized: they are copied
>> verbatim from the original html file.
>> 
>> If I understand the docstring correctly, the last argument of
>> `libxml2-parse-html-region', when specified as a url string, should be
>> used as the "base point" of resolving relative paths found within the
>> html document.  But the <a href=xxx> paths are not resolved at the
>> moment.
>
> If you look at xml.c, you will see that we just call a libxml function
> passing it this URL.  So if anything isn't as expected, the answer is
> in libxml, not in Emacs.

Thank you for pointing that out.  I will take a look at its source in a
day or two.  I am also upgrading it from 2.10.3-2 to 2.10.4-2, and will
see if that changes anything.

If I end up deciding that it is a libxml2 bug, I'll file a bug there and
link to this bug.

For completeness, here attached is the toy html file that I forgot to
attach in my initial report.

[hello.html (text/html, attachment)]
[Message part 3 (text/plain, inline)]
-- 
Best,


RY

[Please note that this mail might go to spam due to some
misconfiguration in my mail server -- still investigating.]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63125; Package emacs. (Fri, 28 Apr 2023 10:24:02 GMT) Full text and rfc822 format available.

Message #14 received at 63125 <at> debbugs.gnu.org (full text, mbox):

From: Ruijie Yu <ruijie <at> netyu.xyz>
To: Ruijie Yu <ruijie <at> netyu.xyz>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 63125 <at> debbugs.gnu.org
Subject: Re: bug#63125: 30.0.50; [BUG] last argument of
 libxml2-parse-html-region has no effect?
Date: Fri, 28 Apr 2023 18:18:21 +0800
Ruijie Yu <ruijie <at> netyu.xyz> writes:
>>
>> If you look at xml.c, you will see that we just call a libxml function
>> passing it this URL.  So if anything isn't as expected, the answer is
>> in libxml, not in Emacs.
>
> Thank you for pointing that out.  I will take a look at its source in a
> day or two.  I am also upgrading it from 2.10.3-2 to 2.10.4-2, and will
> see if that changes anything.

No difference -- as expected.

> If I end up deciding that it is a libxml2 bug, I'll file a bug there and
> link to this bug.

I have filed an issue [1] in libxml2.  We'll see what they say about it.

FTR, [2] is the documentation of the libxml2's htmlReadMemory()
function -- though it does not say much.

[1]: https://gitlab.gnome.org/GNOME/libxml2/-/issues/525
[2]: https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html#htmlReadMemory.

-- 
Best,


RY

[Please note that this mail might go to spam due to some
misconfiguration in my mail server -- still investigating.]




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63125; Package emacs. (Fri, 28 Apr 2023 10:52:02 GMT) Full text and rfc822 format available.

Message #17 received at 63125 <at> debbugs.gnu.org (full text, mbox):

From: Ruijie Yu <ruijie <at> netyu.xyz>
To: Ruijie Yu <ruijie <at> netyu.xyz>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 63125 <at> debbugs.gnu.org
Subject: Re: bug#63125: 30.0.50; [BUG] last argument of
 libxml-parse-html-region has no effect?
Date: Fri, 28 Apr 2023 18:40:35 +0800
Ruijie Yu <ruijie <at> netyu.xyz> writes:

> Ruijie Yu <ruijie <at> netyu.xyz> writes:
>>>
>>> If you look at xml.c, you will see that we just call a libxml function
>>> passing it this URL.  So if anything isn't as expected, the answer is
>>> in libxml, not in Emacs.
>>
>> Thank you for pointing that out.  I will take a look at its source in a
>> day or two.  I am also upgrading it from 2.10.3-2 to 2.10.4-2, and will
>> see if that changes anything.
>
> No difference -- as expected.
>
>> If I end up deciding that it is a libxml2 bug, I'll file a bug there and
>> link to this bug.
>
> I have filed an issue [1] in libxml2.  We'll see what they say about it.
>
> FTR, [2] is the documentation of the libxml2's htmlReadMemory()
> function -- though it does not say much.
>
> [1]: https://gitlab.gnome.org/GNOME/libxml2/-/issues/525
> [2]: https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html#htmlReadMemory.

I just got a response from one of libxml2's maintainers.

It seems that the docstring for `libxml-parse-html-region' is wrong:
this argument has never served the purpose of resolving relative URLs.
It was only used for error messages.  So I suggest that we modify the
docstring of this function and `libxml-parse-xml-region' to reflect this
fact.

I also don't know if, based on this new information, you want to mark
this parameter obsolete.  I see no immediate need, though.

Should I send a patch for the documentation change, or will you do it?

-- 
Best,


RY

[Please note that this mail might go to spam due to some
misconfiguration in my mail server -- still investigating.]




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63125; Package emacs. (Fri, 28 Apr 2023 11:32:02 GMT) Full text and rfc822 format available.

Message #20 received at 63125 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Ruijie Yu <ruijie <at> netyu.xyz>
Cc: 63125 <at> debbugs.gnu.org
Subject: Re: bug#63125: 30.0.50; [BUG] last argument of
 libxml-parse-html-region has no effect?
Date: Fri, 28 Apr 2023 14:31:28 +0300
> From: Ruijie Yu <ruijie <at> netyu.xyz>
> Cc: Eli Zaretskii <eliz <at> gnu.org>, 63125 <at> debbugs.gnu.org
> Date: Fri, 28 Apr 2023 18:40:35 +0800
> 
> > I have filed an issue [1] in libxml2.  We'll see what they say about it.
> >
> > FTR, [2] is the documentation of the libxml2's htmlReadMemory()
> > function -- though it does not say much.
> >
> > [1]: https://gitlab.gnome.org/GNOME/libxml2/-/issues/525
> > [2]: https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html#htmlReadMemory.
> 
> I just got a response from one of libxml2's maintainers.
> 
> It seems that the docstring for `libxml-parse-html-region' is wrong:
> this argument has never served the purpose of resolving relative URLs.
> It was only used for error messages.  So I suggest that we modify the
> docstring of this function and `libxml-parse-xml-region' to reflect this
> fact.

The response doesn't say much.  What is this "base URL" argument used
for, and why is it named "bas URL"?  What does it mean "used for error
messages"?  And where is the up-to-date and accurate documentation of
this function, which explains what is this argument for?

Without knowing all that, we cannot fix our documentation, let alone
code.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#63125; Package emacs. (Sat, 29 Apr 2023 01:22:02 GMT) Full text and rfc822 format available.

Message #23 received at 63125 <at> debbugs.gnu.org (full text, mbox):

From: Ruijie Yu <ruijie <at> netyu.xyz>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, 63125 <at> debbugs.gnu.org
Subject: Re: bug#63125: 30.0.50; [BUG] last argument of
 libxml-parse-html-region has no effect?
Date: Sat, 29 Apr 2023 08:58:03 +0800
Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Ruijie Yu <ruijie <at> netyu.xyz>
>> Cc: Eli Zaretskii <eliz <at> gnu.org>, 63125 <at> debbugs.gnu.org
>> Date: Fri, 28 Apr 2023 18:40:35 +0800
>> 
>> > I have filed an issue [1] in libxml2.  We'll see what they say about it.
>> >
>> > FTR, [2] is the documentation of the libxml2's htmlReadMemory()
>> > function -- though it does not say much.
>> >
>> > [1]: https://gitlab.gnome.org/GNOME/libxml2/-/issues/525
>> > [2]:
>> > https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html#htmlReadMemory.
>> 
>> I just got a response from one of libxml2's maintainers.
>> 
>> It seems that the docstring for `libxml-parse-html-region' is wrong:
>> this argument has never served the purpose of resolving relative URLs.
>> It was only used for error messages.  So I suggest that we modify the
>> docstring of this function and `libxml-parse-xml-region' to reflect this
>> fact.
>
> The response doesn't say much.  What is this "base URL" argument used
> for, and why is it named "bas URL"?  What does it mean "used for error
> messages"?  And where is the up-to-date and accurate documentation of
> this function, which explains what is this argument for?
>
> Without knowing all that, we cannot fix our documentation, let alone
> code.

The "base-url" is an argument to the Elisp function
`libxml-parse-html-region'.  I added Lars to the CC, who originally
introduced this function according to git-blame, and who may have a
better idea.

The following portion are my impressions, but I'm happy to pass any
questions you still have to the libxml2 devs if you want (or you can
comment there directly in the linked issue on gnome's gitlab instance).

-----

As you pointed out, these arguments of the Elisp function are passed
with minimal transformations and sent to the libxml2 function
`htmlReadMemory()' function.  This C function takes an argument `url',
which is the string `base-url' or empty string if `base-url' is nil.

According to Nick (the libxml2 maintainer) and my interpretation, the
`url' parameter of the libxml2 function is simply stored inside the
`url' field of a `xmlDoc' struct, to be used when an error message needs
to be displayed.  So, the `url' parameter practically does nothing for
us, since we disable all libxml2-level warnings and errors in calling
`htmlReadMemory()'.

I put this url [1] to the issue assuming that it is the documentation,
and Nick doesn't have any comment regarding the url.  So this is
probably the up-to-date, albeit not very elaborate, documentation for
the function.

[1]: https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html#htmlReadMemory

-- 
Best,


RY

[Please note that this mail might go to spam due to some
misconfiguration in my mail server -- still investigating.]




Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Sat, 29 Apr 2023 06:40:02 GMT) Full text and rfc822 format available.

Notification sent to Ruijie Yu <ruijie <at> netyu.xyz>:
bug acknowledged by developer. (Sat, 29 Apr 2023 06:40:02 GMT) Full text and rfc822 format available.

Message #28 received at 63125-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Ruijie Yu <ruijie <at> netyu.xyz>
Cc: larsi <at> gnus.org, 63125-done <at> debbugs.gnu.org
Subject: Re: bug#63125: 30.0.50; [BUG] last argument of
 libxml-parse-html-region has no effect?
Date: Sat, 29 Apr 2023 09:40:19 +0300
> From: Ruijie Yu <ruijie <at> netyu.xyz>
> Cc: 63125 <at> debbugs.gnu.org, Lars Ingebrigtsen <larsi <at> gnus.org>
> Date: Sat, 29 Apr 2023 08:58:03 +0800
> 
> > The response doesn't say much.  What is this "base URL" argument used
> > for, and why is it named "bas URL"?  What does it mean "used for error
> > messages"?  And where is the up-to-date and accurate documentation of
> > this function, which explains what is this argument for?
> >
> > Without knowing all that, we cannot fix our documentation, let alone
> > code.
> 
> The "base-url" is an argument to the Elisp function
> `libxml-parse-html-region'.  I added Lars to the CC, who originally
> introduced this function according to git-blame, and who may have a
> better idea.
> 
> The following portion are my impressions, but I'm happy to pass any
> questions you still have to the libxml2 devs if you want (or you can
> comment there directly in the linked issue on gnome's gitlab instance).
> 
> -----
> 
> As you pointed out, these arguments of the Elisp function are passed
> with minimal transformations and sent to the libxml2 function
> `htmlReadMemory()' function.  This C function takes an argument `url',
> which is the string `base-url' or empty string if `base-url' is nil.
> 
> According to Nick (the libxml2 maintainer) and my interpretation, the
> `url' parameter of the libxml2 function is simply stored inside the
> `url' field of a `xmlDoc' struct, to be used when an error message needs
> to be displayed.  So, the `url' parameter practically does nothing for
> us, since we disable all libxml2-level warnings and errors in calling
> `htmlReadMemory()'.
> 
> I put this url [1] to the issue assuming that it is the documentation,
> and Nick doesn't have any comment regarding the url.  So this is
> probably the up-to-date, albeit not very elaborate, documentation for
> the function.
> 
> [1]: https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html#htmlReadMemory

Thanks.  So I've now updated our documentation with this information,
and I'm therefore closing the bug.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 27 May 2023 11:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 23 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.