Blocking 'good' bots in nginx with multiple conditions for certain off-limits URL's where humans can go

Posted by Glenn Plas on Server Fault See other posts from Server Fault or by Glenn Plas
Published on 2012-04-25T12:15:04Z Indexed on 2012/11/17 5:03 UTC
Read the original article Hit count: 455

Filed under:
|

After 2 days of searching/trying/failing I decided to post this here, I haven't found any example of someone doing the same nor what I tried seems to be working OK. I'm trying to send a 403 to bots not respecting the robots.txt file (even after downloading it several times). Specifically Googlebot. It will support the following robots.txt definition.

User-agent: *
Disallow: /*/*/page/

The intent is to allow Google to browse whatever they can find on the site but return a 403 for the following type of request. Googlebot seems to keep on nesting these links eternally adding paging block after block:

my_domain.com:80 - 66.x.67.x - - [25/Apr/2012:11:13:54 +0200] "GET /2011/06/
page/3/?/page/2//page/3//page/2//page/3//page/2//page/2//page/4//page/4//pag
e/1/&wpmp_switcher=desktop HTTP/1.1" 403 135 "-" "Mozilla/5.0 (compatible; G
ooglebot/2.1; +http://www.google.com/bot.html)"

It's a wordpress site btw. I don't want those pages to show up, even though after the robots.txt info got through, they stopped for a while only to begin crawling again later. It just never stops .... I do want real people to see this. As you can see, google get a 403 but when I try this myself in a browser I get a 404 back. I want browsers to pass.

root@my_domain:# nginx -V
nginx version: nginx/1.2.0

I tried different approaches, using a map and plain old nono if's and they both act the same: (under http section)

map $http_user_agent $is_bot {
default 0;
~crawl|Googlebot|Slurp|spider|bingbot|tracker|click|parser|spider 1;
}

(under the server section)

location ~ /(\d+)/(\d+)/page/ {
if ($is_bot) {
return 403; # Please respect the robots.txt file !
}
}

I recently had to polish up my Apache skills for a client where I did about the same thing like this :

# Block real Engines , not respecting robots.txt but allowing correct calls to pass
# Google
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$ [NC,OR]
# Bing
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ bingbot/2\.[01];\ \+http://www\.bing\.com/bingbot\.htm\)$ [NC,OR]
# msnbot
RewriteCond %{HTTP_USER_AGENT} ^msnbot-media/1\.[01]\ \(\+http://search\.msn\.com/msnbot\.htm\)$ [NC,OR]
# Slurp
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Yahoo!\ Slurp;\ http://help\.yahoo\.com/help/us/ysearch/slurp\)$ [NC]

# block all page searches, the rest may pass
RewriteCond %{REQUEST_URI} ^(/[0-9]{4}/[0-9]{2}/page/) [OR]

# or with the wpmp_switcher=mobile parameter set
RewriteCond %{QUERY_STRING} wpmp_switcher=mobile

# ISSUE 403 / SERVE ERRORDOCUMENT
RewriteRule .* - [F,L]
# End if match

This does a bit more than I asked nginx to do but it's about the same principle, I'm having a hard time figuring this out for nginx.

So my question would be, why would nginx serve my browser a 404 ? Why isn't it passing, The regex isn't matching for my UA:

"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.30 Safari/536.5"

There are tons of example to block based on UA alone, and that's easy. It also looks like the matchin location is final, e.g. it's not 'falling' through for regular user, I'm pretty certain that this has some correlation with the 404 I get in the browser.

As a cherry on top of things, I also want google to disregard the parameter wpmp_switcher=mobile , wpmp_switcher=desktop is fine but I just don't want the same content being crawled multiple times.

Even though I ended up adding wpmp_switcher=mobile via the google webmaster tools pages (requiring me to sign up ....). that also stopped for a while but today they are back spidering the mobile sections.

So in short, I need to find a way for nginx to enforce the robots.txt definitions. Can someone shell out a few minutes of their lives and push me in the right direction please ?

I really appreciate ANY response that makes me think harder ;-)

© Server Fault or respective owner

Related posts about nginx

Related posts about crawler