GNU bug report logs - #52338
Crawler bots are downloading substitutes

Previous Next

Package: guix;

Reported by: Leo Famulari <leo <at> famulari.name>

Date: Mon, 6 Dec 2021 21:22:01 UTC

Severity: normal

Done: Mathieu Othacehe <othacehe <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Mark H Weaver <mhw <at> netris.org>
To: Leo Famulari <leo <at> famulari.name>, 52338 <at> debbugs.gnu.org
Subject: bug#52338: Crawler bots are downloading substitutes
Date: Fri, 10 Dec 2021 16:21:11 -0500
Hi Leo,

Leo Famulari <leo <at> famulari.name> writes:

> I noticed that some bots are downloading substitutes from
> ci.guix.gnu.org.
>
> We should add a robots.txt file to reduce this waste.
>
> Specifically, I see bots from Bing and Semrush:
>
> https://www.bing.com/bingbot.htm
> https://www.semrush.com/bot.html

For what it's worth: during the years that I administered Hydra, I found
that many bots disregarded the robots.txt file that was in place there.
In practice, I found that I needed to periodically scan the access logs
for bots and forcefully block their requests in order to keep Hydra from
becoming overloaded with expensive queries from bots.

     Regards,
       Mark




This bug report was last modified 3 years and 210 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.