CAPTCHA blocking for my scraping script?

Posted by Surabhil Sergy on Programmers See other posts from Programmers or by Surabhil Sergy
Published on 2014-08-16T05:20:03Z Indexed on 2014/08/19 16:29 UTC
Read the original article Hit count: 317

I am working on a scraping project which involves getting web data and parsing them for further use. I have been working using PHP and CURL to make scraping scripts which crawls web data and I make use of either PHP Dom or Simple HTML DOM Parser library for these kinds of projects.

On a recent project I encountered some challenges; initially I found the target website blocked my server IP such that the server could not make any successful requests to the site. Understanding these issues as common I bought a set of private proxies and tried to make request calls using them.

Though this could get successful response, I noticed the script is getting some kind of blocks after 2-3 continuous requests. On printing and checking the response I could see a pop-up asking for CAPTCHA validation. I could not see any captcha characters to be entered and it also shows an error “input error: invalid referrer”. On examining the source I could see some Google recaptcha scripts within. I’m stuck at this point and I m not able to execute my script.

My script is used for gathering data and it needs to go through a large number of pages periodically over the site. But in the current scenario I am not able to proceed with my script. I could see there are some options to overcome these captcha issues and scraping these kinds of sites too are common.

I have been checking my script performance and responses over last two months. I could see during first month I was able to execute very large number of requests from a single IP and I was able to get results. Later I get an IP block and used private proxies which could get me some results. Later I am facing now with the captcha trouble.

I would appreciate any help or suggestions in this regard.

(Often in this kind of questions I used to get a first comment as, ‘Have you asked for prior permission from the target?’ .I haven’t ,but I know there are many sites doing so to get the details out of sites and target sites may not often give access to them. I respect the legality and scraping etiquettes but I would like to know at what point I stuck and how could I overcome that! )

I could provide any supporting information if needed.

© Programmers or respective owner

Related posts about php

Related posts about programming-practices