GNU bug report logs -
#52338
Crawler bots are downloading substitutes
Previous Next
Reported by: Leo Famulari <leo <at> famulari.name>
Date: Mon, 6 Dec 2021 21:22:01 UTC
Severity: normal
Done: Mathieu Othacehe <othacehe <at> gnu.org>
Bug is archived. No further changes may be made.
Full log
Message #35 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
All,
Mark H Weaver 写道:
> For what it's worth: during the years that I administered Hydra,
> I found
> that many bots disregarded the robots.txt file that was in place
> there.
> In practice, I found that I needed to periodically scan the
> access logs
> for bots and forcefully block their requests in order to keep
> Hydra from
> becoming overloaded with expensive queries from bots.
Very good point.
IME (which is a few years old at this point) at least the
highlighted BingBot & SemrushThing always respected my robots.txt,
but it's definitely a concern. I'll leave this bug open to remind
us of that in a few weeks or so…
If it does become a problem, we (I) might add some basic
User-Agent sniffing to either slow down or outright block
non-Guile downloaders. Whitelisting any legitimate ones, of
course. I think that's less hassle than dealing with dynamic IP
blocks whilst being equally effective here.
Thanks (again) for taking care of Hydra, Mark, and thank you Leo
for keeping an eye on Cuirass :-)
T G-R
[signature.asc (application/pgp-signature, inline)]
This bug report was last modified 3 years and 208 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.