GNU bug report logs - #52338
Crawler bots are downloading substitutes

Previous Next

Package: guix;

Reported by: Leo Famulari <leo <at> famulari.name>

Date: Mon, 6 Dec 2021 21:22:01 UTC

Severity: normal

Done: Mathieu Othacehe <othacehe <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


Message #35 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Tobias Geerinckx-Rice <me <at> tobias.gr>
To: Mark H Weaver <mhw <at> netris.org>
Cc: 52338 <at> debbugs.gnu.org, bug-guix <at> gnu.org, Leo Famulari <leo <at> famulari.name>
Subject: Re: bug#52338: Crawler bots are downloading substitutes
Date: Fri, 10 Dec 2021 23:52:51 +0100
[Message part 1 (text/plain, inline)]
All,

Mark H Weaver 写道:
> For what it's worth: during the years that I administered Hydra, 
> I found
> that many bots disregarded the robots.txt file that was in place 
> there.
> In practice, I found that I needed to periodically scan the 
> access logs
> for bots and forcefully block their requests in order to keep 
> Hydra from
> becoming overloaded with expensive queries from bots.

Very good point.

IME (which is a few years old at this point) at least the 
highlighted BingBot & SemrushThing always respected my robots.txt, 
but it's definitely a concern.  I'll leave this bug open to remind 
us of that in a few weeks or so…

If it does become a problem, we (I) might add some basic 
User-Agent sniffing to either slow down or outright block 
non-Guile downloaders.  Whitelisting any legitimate ones, of 
course.  I think that's less hassle than dealing with dynamic IP 
blocks whilst being equally effective here.

Thanks (again) for taking care of Hydra, Mark, and thank you Leo 
for keeping an eye on Cuirass :-)

T G-R
[signature.asc (application/pgp-signature, inline)]

This bug report was last modified 3 years and 208 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.