GNU bug report logs - #52338
Crawler bots are downloading substitutes

Package: guix;

Reported by: Leo Famulari <leo <at> famulari.name>

Date: Mon, 6 Dec 2021 21:22:01 UTC

Severity: normal

Done: Mathieu Othacehe <othacehe <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 52338 in the body.
You can then email your comments to 52338 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Mon, 06 Dec 2021 21:22:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Leo Famulari <leo <at> famulari.name>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Mon, 06 Dec 2021 21:22:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Leo Famulari <leo <at> famulari.name>
To: bug-guix <at> gnu.org
Subject: Crawler bots are downloading substitutes
Date: Mon, 6 Dec 2021 16:20:55 -0500

I noticed that some bots are downloading substitutes from
ci.guix.gnu.org.

We should add a robots.txt file to reduce this waste.

Specifically, I see bots from Bing and Semrush:

https://www.bing.com/bingbot.htm
https://www.semrush.com/bot.html

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Mon, 06 Dec 2021 22:19:01 GMT) Full text and rfc822 format available.

Message #8 received at 52338 <at> debbugs.gnu.org (full text, mbox):

From: Leo Famulari <leo <at> famulari.name>
To: 52338 <at> debbugs.gnu.org
Subject: [maintenance] hydra: berlin: Create robots.txt.
Date: Mon,  6 Dec 2021 17:18:10 -0500

I tested that `guix system build` does succeed with this change, but I
would like a review on whether the resulting Nginx configuration is
correct, and if this is the correct path to disallow. It generates an
Nginx location block like this:

------
      location /robots.txt {
        add_header  Content-Type  text/plain;
        return 200 "User-agent: *
Disallow: /nar
";
      }
------

* hydra/nginx/berlin.scm (berlin-locations): Add a robots.txt Nginx location.
---
 hydra/nginx/berlin.scm | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/hydra/nginx/berlin.scm b/hydra/nginx/berlin.scm
index 1f4b0be..3bb2129 100644
--- a/hydra/nginx/berlin.scm
+++ b/hydra/nginx/berlin.scm
@@ -174,7 +174,14 @@ PUBLISH-URL."
            (nginx-location-configuration
             (uri "/berlin.guixsd.org-export.pub")
             (body
-             (list "root /var/www/guix;"))))))
+             (list "root /var/www/guix;")))
+
+           (nginx-location-configuration
+             (uri "/robots.txt")
+             (body
+               (list
+                 "add_header  Content-Type  text/plain;"
+                 "return 200 \"User-agent: *\nDisallow: /nar/\n\";"))))))
 
 (define guix.gnu.org-redirect-locations
   (list
-- 
2.34.0

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Thu, 09 Dec 2021 13:28:02 GMT) Full text and rfc822 format available.

Message #11 received at 52338 <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Leo Famulari <leo <at> famulari.name>
Cc: 52338 <at> debbugs.gnu.org
Subject: Re: bug#52338: Crawler bots are downloading substitutes
Date: Thu, 09 Dec 2021 14:27:38 +0100

Hello Leo,

> +           (nginx-location-configuration
> +             (uri "/robots.txt")
> +             (body
> +               (list
> +                 "add_header  Content-Type  text/plain;"
> +                 "return 200 \"User-agent: *\nDisallow: /nar/\n\";"))))))

Nice, the bots are also accessing the Cuirass web interface, do you
think it would be possible to extend this snippet to prevent it?

Thanks,

Mathieu

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Thu, 09 Dec 2021 16:36:02 GMT) Full text and rfc822 format available.

Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Tobias Geerinckx-Rice <me <at> tobias.gr>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 52338 <at> debbugs.gnu.org, bug-guix <at> gnu.org, Leo Famulari <leo <at> famulari.name>
Subject: Re: bug#52338: Crawler bots are downloading substitutes
Date: Thu, 09 Dec 2021 16:42:24 +0100

[Message part 1 (text/plain, inline)]

Mathieu Othacehe 写道：
> Hello Leo,
>
>> +           (nginx-location-configuration
>> +             (uri "/robots.txt")

It's a micro-optimisation, but it can't hurt to generate ‘location 
= /robots.txt’ instead of ‘location /robots.txt’ here.

>> +             (body
>> +               (list
>> +                 "add_header  Content-Type  text/plain;"
>> +                 "return 200 \"User-agent: *\nDisallow: 
>> /nar/\n\";"))))))

Use \r\n instead of \n, even if \n happens to work.

There are many ‘buggy’ crawlers out there.  It's in their own 
interest to be fussy whilst claiming to respect robots.txt.  The 
less you deviate from the most basic norm imaginable, the better.

I tested whether embedding raw \r\n bytes in nginx.conf strings 
like this works, and it seems to, even though a human would 
probably not do so.

> Nice, the bots are also accessing the Cuirass web interface, do 
> you
> think it would be possible to extend this snippet to prevent it?

You can replace ‘/nar/’ with ‘/’ to disallow everything:

 Disallow: /

If we want crawlers to index only the front page (so people can 
search for ‘Guix CI’, I guess), that's possible:

 Disallow: /
 Allow: /$

Don't confuse ‘$’ with ‘supports regexps’.  Buggy bots might fall 
back to ‘Disallow: /’.

This is where it gets ugly: nginx doesn't support escaping ‘$’ in 
strings.  At all.  It's insane.

[Message part 2 (text/plain, inline)]

 geo $dollar { default "$"; } # 
 stackoverflow.com/questions/57466554
 server {
   location = /robots.txt {
     return 200
     "User-agent: *\r\nDisallow: /\r\nAllow: /$dollar\r\n";
   }
 }

[Message part 3 (text/plain, inline)]

*Obviously.*

An alternative to that is to serve a real on-disc robots.txt.

Kind regards,

T G-R

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Thu, 09 Dec 2021 16:36:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Fri, 10 Dec 2021 16:23:01 GMT) Full text and rfc822 format available.

Message #20 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Leo Famulari <leo <at> famulari.name>
To: Tobias Geerinckx-Rice <me <at> tobias.gr>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 52338 <at> debbugs.gnu.org,
 bug-guix <at> gnu.org
Subject: Re: bug#52338: Crawler bots are downloading substitutes
Date: Fri, 10 Dec 2021 11:22:15 -0500

[Message part 1 (text/plain, inline)]

On Thu, Dec 09, 2021 at 04:42:24PM +0100, Tobias Geerinckx-Rice wrote:
[...]
> An alternative to that is to serve a real on-disc robots.txt.

Alright, I leave it up to you. I just want to prevent bots from
downloading substitutes. I don't really have opinions about any of the
details.

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Fri, 10 Dec 2021 16:23:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Fri, 10 Dec 2021 16:47:01 GMT) Full text and rfc822 format available.

Message #26 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Tobias Geerinckx-Rice <me <at> tobias.gr>
To: Leo Famulari <leo <at> famulari.name>
Cc: Mathieu Othacehe <othacehe <at> gnu.org>, 52338 <at> debbugs.gnu.org,
 bug-guix <at> gnu.org
Subject: Re: bug#52338: Crawler bots are downloading substitutes
Date: Fri, 10 Dec 2021 17:47:09 +0100

[Message part 1 (text/plain, inline)]

Leo Famulari 写道：
> Alright, I leave it up to you.

Dammit.

Kind regards,

T G-R

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Fri, 10 Dec 2021 16:47:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Fri, 10 Dec 2021 21:22:01 GMT) Full text and rfc822 format available.

Message #32 received at 52338 <at> debbugs.gnu.org (full text, mbox):

From: Mark H Weaver <mhw <at> netris.org>
To: Leo Famulari <leo <at> famulari.name>, 52338 <at> debbugs.gnu.org
Subject: Re: bug#52338: Crawler bots are downloading substitutes
Date: Fri, 10 Dec 2021 16:21:11 -0500

Hi Leo,

Leo Famulari <leo <at> famulari.name> writes:

> I noticed that some bots are downloading substitutes from
> ci.guix.gnu.org.
>
> We should add a robots.txt file to reduce this waste.
>
> Specifically, I see bots from Bing and Semrush:
>
> https://www.bing.com/bingbot.htm
> https://www.semrush.com/bot.html

For what it's worth: during the years that I administered Hydra, I found
that many bots disregarded the robots.txt file that was in place there.
In practice, I found that I needed to periodically scan the access logs
for bots and forcefully block their requests in order to keep Hydra from
becoming overloaded with expensive queries from bots.

     Regards,
       Mark

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Fri, 10 Dec 2021 23:04:01 GMT) Full text and rfc822 format available.

Message #35 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Tobias Geerinckx-Rice <me <at> tobias.gr>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 52338 <at> debbugs.gnu.org, bug-guix <at> gnu.org, Leo Famulari <leo <at> famulari.name>
Subject: Re: bug#52338: Crawler bots are downloading substitutes
Date: Fri, 10 Dec 2021 23:52:51 +0100

[Message part 1 (text/plain, inline)]

All,

Mark H Weaver 写道：
> For what it's worth: during the years that I administered Hydra, 
> I found
> that many bots disregarded the robots.txt file that was in place 
> there.
> In practice, I found that I needed to periodically scan the 
> access logs
> for bots and forcefully block their requests in order to keep 
> Hydra from
> becoming overloaded with expensive queries from bots.

Very good point.

IME (which is a few years old at this point) at least the 
highlighted BingBot & SemrushThing always respected my robots.txt, 
but it's definitely a concern.  I'll leave this bug open to remind 
us of that in a few weeks or so…

If it does become a problem, we (I) might add some basic 
User-Agent sniffing to either slow down or outright block 
non-Guile downloaders.  Whitelisting any legitimate ones, of 
course.  I think that's less hassle than dealing with dynamic IP 
blocks whilst being equally effective here.

Thanks (again) for taking care of Hydra, Mark, and thank you Leo 
for keeping an eye on Cuirass :-)

T G-R

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Fri, 10 Dec 2021 23:04:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Sat, 11 Dec 2021 09:47:01 GMT) Full text and rfc822 format available.

Message #41 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: Tobias Geerinckx-Rice <me <at> tobias.gr>
Cc: 52338 <at> debbugs.gnu.org, bug-guix <at> gnu.org, Leo Famulari <leo <at> famulari.name>
Subject: Re: bug#52338: Crawler bots are downloading substitutes
Date: Sat, 11 Dec 2021 10:46:37 +0100

Hey,

The Cuirass web interface logs were quite silent this morning and I
suspected an issue somewhere. I then realized that you did update the
Nginx conf and the bots were no longer knocking at our door, which is
great!

Thanks to both of you,

Mathieu

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Sat, 11 Dec 2021 09:48:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Tue, 14 Dec 2021 16:47:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-guix <at> gnu.org:
bug#52338; Package guix. (Tue, 14 Dec 2021 16:47:03 GMT) Full text and rfc822 format available.

Reply sent to Mathieu Othacehe <othacehe <at> gnu.org>:
You have taken responsibility. (Sun, 19 Dec 2021 16:54:02 GMT) Full text and rfc822 format available.

Notification sent to Leo Famulari <leo <at> famulari.name>:
bug acknowledged by developer. (Sun, 19 Dec 2021 16:54:02 GMT) Full text and rfc822 format available.

Message #55 received at 52338-done <at> debbugs.gnu.org (full text, mbox):

From: Mathieu Othacehe <othacehe <at> gnu.org>
To: 52338-done <at> debbugs.gnu.org
Subject: Re: bug#52338: Crawler bots are downloading substitutes
Date: Sun, 19 Dec 2021 17:53:27 +0100

> Thanks to both of you,

And closing!

Mathieu

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 17 Jan 2022 12:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 206 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #52338 Crawler bots are downloading substitutes

GNU bug report logs - #52338
Crawler bots are downloading substitutes