Detecting 'stealth' web-crawlers

Posted by Jacco on Stack Overflow See other posts from Stack Overflow or by Jacco
Published on 2008-10-24T11:46:52Z Indexed on 2010/05/08 16:18 UTC
Read the original article Hit count: 882

Filed under:

web-development

What options are there to detect web-crawlers that do not want to be detected?

(I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.)

I'm not talking about the nice crawlers such as googlebot and Yahoo! Slurp. I consider a bot nice if it:

identifies itself as a bot in the user agent string
reads robots.txt (and obeys it)

I'm talking about the bad crawlers, hiding behind common user agents, using my bandwidth and never giving me anything in return.

There are some trapdoors that can be constructed updated list (thanks Chris, gs):

Adding a directory only listed (marked as disallow) in the robots.txt,
Adding invisible links (possibly marked as rel="nofollow"?),
- style="display: none;" on link or parent container
- placed underneath another element with higher z-index
detect who doesn't understand CaPiTaLiSaTioN,
detect who tries to post replies but always fail the Captcha.
detect GET requests to POST-only resources
detect interval between requests
detect order of pages requested
detect who (consistently) requests https resources over http
detect who does not request image file (this in combination with a list of user-agents of known image capable browsers works surprisingly nice)

Some traps would be triggered by both 'good' and 'bad' bots. you could combine those with a whitelist:

It trigger a trap
It request robots.txt?
It doest not trigger another trap because it obeyed robots.txt

One other important thing here is:
Please consider blind people using a screen readers: give people a way to contact you, or solve a (non-image) Captcha to continue browsing.

What methods are there to automatically detect the web crawlers trying to mask themselves as normal human visitors.

Update
The question is not: How do I catch every crawler. The question is: How can I maximize the chance of detecting a crawler.

Some spiders are really good, and actually parse and understand html, xhtml, css javascript, VB script etc...
I have no illusions: I won't be able to beat them.

You would however be surprised how stupid some crawlers are. With the best example of stupidity (in my opinion) being: cast all URLs to lower case before requesting them.

And then there is a whole bunch of crawlers that are just 'not good enough' to avoid the various trapdoors.

Developer IT

Detecting 'stealth' web-crawlers - Developer IT

Detecting 'stealth' web-crawlers

web-crawler

spider

web-development

Related posts about web-crawler

web crawler needed

Building an automatic web crawler

Appengine Apps Vs Google bot web crawler

Extracting data from internet

Web crawler update strategy

Related posts about spider

Creating a spider using Scrapy, Spider generation error.

Site crawler/spider that tosses results into mysql

Getting Started with Python: Attribute Error

How to create a web crawler/spider/robot?

How do I block a user-agent from Apache

Categories cloud