GNU bug report logs -
#52338
Crawler bots are downloading substitutes
Previous Next
Reported by: Leo Famulari <leo <at> famulari.name>
Date: Mon, 6 Dec 2021 21:22:01 UTC
Severity: normal
Done: Mathieu Othacehe <othacehe <at> gnu.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 52338 in the body.
You can then email your comments to 52338 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Mon, 06 Dec 2021 21:22:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Leo Famulari <leo <at> famulari.name>
:
New bug report received and forwarded. Copy sent to
bug-guix <at> gnu.org
.
(Mon, 06 Dec 2021 21:22:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
I noticed that some bots are downloading substitutes from
ci.guix.gnu.org.
We should add a robots.txt file to reduce this waste.
Specifically, I see bots from Bing and Semrush:
https://www.bing.com/bingbot.htm
https://www.semrush.com/bot.html
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Mon, 06 Dec 2021 22:19:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 52338 <at> debbugs.gnu.org (full text, mbox):
I tested that `guix system build` does succeed with this change, but I
would like a review on whether the resulting Nginx configuration is
correct, and if this is the correct path to disallow. It generates an
Nginx location block like this:
------
location /robots.txt {
add_header Content-Type text/plain;
return 200 "User-agent: *
Disallow: /nar
";
}
------
* hydra/nginx/berlin.scm (berlin-locations): Add a robots.txt Nginx location.
---
hydra/nginx/berlin.scm | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/hydra/nginx/berlin.scm b/hydra/nginx/berlin.scm
index 1f4b0be..3bb2129 100644
--- a/hydra/nginx/berlin.scm
+++ b/hydra/nginx/berlin.scm
@@ -174,7 +174,14 @@ PUBLISH-URL."
(nginx-location-configuration
(uri "/berlin.guixsd.org-export.pub")
(body
- (list "root /var/www/guix;"))))))
+ (list "root /var/www/guix;")))
+
+ (nginx-location-configuration
+ (uri "/robots.txt")
+ (body
+ (list
+ "add_header Content-Type text/plain;"
+ "return 200 \"User-agent: *\nDisallow: /nar/\n\";"))))))
(define guix.gnu.org-redirect-locations
(list
--
2.34.0
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Thu, 09 Dec 2021 13:28:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 52338 <at> debbugs.gnu.org (full text, mbox):
Hello Leo,
> + (nginx-location-configuration
> + (uri "/robots.txt")
> + (body
> + (list
> + "add_header Content-Type text/plain;"
> + "return 200 \"User-agent: *\nDisallow: /nar/\n\";"))))))
Nice, the bots are also accessing the Cuirass web interface, do you
think it would be possible to extend this snippet to prevent it?
Thanks,
Mathieu
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Thu, 09 Dec 2021 16:36:02 GMT)
Full text and
rfc822 format available.
Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Mathieu Othacehe 写道:
> Hello Leo,
>
>> + (nginx-location-configuration
>> + (uri "/robots.txt")
It's a micro-optimisation, but it can't hurt to generate ‘location
= /robots.txt’ instead of ‘location /robots.txt’ here.
>> + (body
>> + (list
>> + "add_header Content-Type text/plain;"
>> + "return 200 \"User-agent: *\nDisallow:
>> /nar/\n\";"))))))
Use \r\n instead of \n, even if \n happens to work.
There are many ‘buggy’ crawlers out there. It's in their own
interest to be fussy whilst claiming to respect robots.txt. The
less you deviate from the most basic norm imaginable, the better.
I tested whether embedding raw \r\n bytes in nginx.conf strings
like this works, and it seems to, even though a human would
probably not do so.
> Nice, the bots are also accessing the Cuirass web interface, do
> you
> think it would be possible to extend this snippet to prevent it?
You can replace ‘/nar/’ with ‘/’ to disallow everything:
Disallow: /
If we want crawlers to index only the front page (so people can
search for ‘Guix CI’, I guess), that's possible:
Disallow: /
Allow: /$
Don't confuse ‘$’ with ‘supports regexps’. Buggy bots might fall
back to ‘Disallow: /’.
This is where it gets ugly: nginx doesn't support escaping ‘$’ in
strings. At all. It's insane.
[Message part 2 (text/plain, inline)]
geo $dollar { default "$"; } #
stackoverflow.com/questions/57466554
server {
location = /robots.txt {
return 200
"User-agent: *\r\nDisallow: /\r\nAllow: /$dollar\r\n";
}
}
[Message part 3 (text/plain, inline)]
*Obviously.*
An alternative to that is to serve a real on-disc robots.txt.
Kind regards,
T G-R
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Thu, 09 Dec 2021 16:36:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Fri, 10 Dec 2021 16:23:01 GMT)
Full text and
rfc822 format available.
Message #20 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Thu, Dec 09, 2021 at 04:42:24PM +0100, Tobias Geerinckx-Rice wrote:
[...]
> An alternative to that is to serve a real on-disc robots.txt.
Alright, I leave it up to you. I just want to prevent bots from
downloading substitutes. I don't really have opinions about any of the
details.
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Fri, 10 Dec 2021 16:23:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Fri, 10 Dec 2021 16:47:01 GMT)
Full text and
rfc822 format available.
Message #26 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Leo Famulari 写道:
> Alright, I leave it up to you.
Dammit.
Kind regards,
T G-R
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Fri, 10 Dec 2021 16:47:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Fri, 10 Dec 2021 21:22:01 GMT)
Full text and
rfc822 format available.
Message #32 received at 52338 <at> debbugs.gnu.org (full text, mbox):
Hi Leo,
Leo Famulari <leo <at> famulari.name> writes:
> I noticed that some bots are downloading substitutes from
> ci.guix.gnu.org.
>
> We should add a robots.txt file to reduce this waste.
>
> Specifically, I see bots from Bing and Semrush:
>
> https://www.bing.com/bingbot.htm
> https://www.semrush.com/bot.html
For what it's worth: during the years that I administered Hydra, I found
that many bots disregarded the robots.txt file that was in place there.
In practice, I found that I needed to periodically scan the access logs
for bots and forcefully block their requests in order to keep Hydra from
becoming overloaded with expensive queries from bots.
Regards,
Mark
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Fri, 10 Dec 2021 23:04:01 GMT)
Full text and
rfc822 format available.
Message #35 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
All,
Mark H Weaver 写道:
> For what it's worth: during the years that I administered Hydra,
> I found
> that many bots disregarded the robots.txt file that was in place
> there.
> In practice, I found that I needed to periodically scan the
> access logs
> for bots and forcefully block their requests in order to keep
> Hydra from
> becoming overloaded with expensive queries from bots.
Very good point.
IME (which is a few years old at this point) at least the
highlighted BingBot & SemrushThing always respected my robots.txt,
but it's definitely a concern. I'll leave this bug open to remind
us of that in a few weeks or so…
If it does become a problem, we (I) might add some basic
User-Agent sniffing to either slow down or outright block
non-Guile downloaders. Whitelisting any legitimate ones, of
course. I think that's less hassle than dealing with dynamic IP
blocks whilst being equally effective here.
Thanks (again) for taking care of Hydra, Mark, and thank you Leo
for keeping an eye on Cuirass :-)
T G-R
[signature.asc (application/pgp-signature, inline)]
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Fri, 10 Dec 2021 23:04:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Sat, 11 Dec 2021 09:47:01 GMT)
Full text and
rfc822 format available.
Message #41 received at submit <at> debbugs.gnu.org (full text, mbox):
Hey,
The Cuirass web interface logs were quite silent this morning and I
suspected an issue somewhere. I then realized that you did update the
Nginx conf and the bots were no longer knocking at our door, which is
great!
Thanks to both of you,
Mathieu
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Sat, 11 Dec 2021 09:48:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Tue, 14 Dec 2021 16:47:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#52338
; Package
guix
.
(Tue, 14 Dec 2021 16:47:03 GMT)
Full text and
rfc822 format available.
Reply sent
to
Mathieu Othacehe <othacehe <at> gnu.org>
:
You have taken responsibility.
(Sun, 19 Dec 2021 16:54:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Leo Famulari <leo <at> famulari.name>
:
bug acknowledged by developer.
(Sun, 19 Dec 2021 16:54:02 GMT)
Full text and
rfc822 format available.
Message #55 received at 52338-done <at> debbugs.gnu.org (full text, mbox):
> Thanks to both of you,
And closing!
Mathieu
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 17 Jan 2022 12:24:07 GMT)
Full text and
rfc822 format available.
This bug report was last modified 3 years and 206 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.