GNU bug report logs - #52338
Crawler bots are downloading substitutes

Previous Next

Package: guix;

Reported by: Leo Famulari <leo <at> famulari.name>

Date: Mon, 6 Dec 2021 21:22:01 UTC

Severity: normal

Done: Mathieu Othacehe <othacehe <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Tobias Geerinckx-Rice <me <at> tobias.gr>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 52338 <at> debbugs.gnu.org, leo <at> famulari.name
Subject: bug#52338: Crawler bots are downloading substitutes
Date: Thu, 09 Dec 2021 16:42:24 +0100
[Message part 1 (text/plain, inline)]
Mathieu Othacehe 写道:
> Hello Leo,
>
>> +           (nginx-location-configuration
>> +             (uri "/robots.txt")

It's a micro-optimisation, but it can't hurt to generate ‘location 
= /robots.txt’ instead of ‘location /robots.txt’ here.

>> +             (body
>> +               (list
>> +                 "add_header  Content-Type  text/plain;"
>> +                 "return 200 \"User-agent: *\nDisallow: 
>> /nar/\n\";"))))))

Use \r\n instead of \n, even if \n happens to work.

There are many ‘buggy’ crawlers out there.  It's in their own 
interest to be fussy whilst claiming to respect robots.txt.  The 
less you deviate from the most basic norm imaginable, the better.

I tested whether embedding raw \r\n bytes in nginx.conf strings 
like this works, and it seems to, even though a human would 
probably not do so.

> Nice, the bots are also accessing the Cuirass web interface, do 
> you
> think it would be possible to extend this snippet to prevent it?

You can replace ‘/nar/’ with ‘/’ to disallow everything:

 Disallow: /

If we want crawlers to index only the front page (so people can 
search for ‘Guix CI’, I guess), that's possible:

 Disallow: /
 Allow: /$

Don't confuse ‘$’ with ‘supports regexps’.  Buggy bots might fall 
back to ‘Disallow: /’.

This is where it gets ugly: nginx doesn't support escaping ‘$’ in 
strings.  At all.  It's insane.

[Message part 2 (text/plain, inline)]
 geo $dollar { default "$"; } # 
 stackoverflow.com/questions/57466554
 server {
   location = /robots.txt {
     return 200
     "User-agent: *\r\nDisallow: /\r\nAllow: /$dollar\r\n";
   }
 }
[Message part 3 (text/plain, inline)]
*Obviously.*

An alternative to that is to serve a real on-disc robots.txt.

Kind regards,

T G-R
[signature.asc (application/pgp-signature, inline)]

This bug report was last modified 3 years and 209 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.