Search Results

Search found 346 results on 14 pages for 'scraping'.

Page 1/14 | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

PHP Screen Scraping Class

- by BRADINO

After some positive feedback I have decided to continue to develop the PHP Screen Scraping class. This post will server as the permanent home for the class. Download PHP Screen Scraping Class Updates 20009-07-30 Added setHeader() function

Read the article
CAPTCHA blocking for my scraping script?

- by Surabhil Sergy

I am working on a scraping project which involves getting web data and parsing them for further use. I have been working using PHP and CURL to make scraping scripts which crawls web data and I make use of either PHP Dom or Simple HTML DOM Parser library for these kinds of projects. On a recent project I encountered some challenges; initially I found the target website blocked my server IP such that the server could not make any successful requests to the site. Understanding these issues as common I bought a set of private proxies and tried to make request calls using them. Though this could get successful response, I noticed the script is getting some kind of blocks after 2-3 continuous requests. On printing and checking the response I could see a pop-up asking for CAPTCHA validation. I could not see any captcha characters to be entered and it also shows an error “input error: invalid referrer”. On examining the source I could see some Google recaptcha scripts within. I’m stuck at this point and I m not able to execute my script. My script is used for gathering data and it needs to go through a large number of pages periodically over the site. But in the current scenario I am not able to proceed with my script. I could see there are some options to overcome these captcha issues and scraping these kinds of sites too are common. I have been checking my script performance and responses over last two months. I could see during first month I was able to execute very large number of requests from a single IP and I was able to get results. Later I get an IP block and used private proxies which could get me some results. Later I am facing now with the captcha trouble. I would appreciate any help or suggestions in this regard. (Often in this kind of questions I used to get a first comment as, ‘Have you asked for prior permission from the target?’ .I haven’t ,but I know there are many sites doing so to get the details out of sites and target sites may not often give access to them. I respect the legality and scraping etiquettes but I would like to know at what point I stuck and how could I overcome that! ) I could provide any supporting information if needed.

Read the article
Screen Scraping - how to get AJAX based filtered data

- by Muhammad Akhtar

hi, I am working on screen scraping, its easy when filteration in query string, but the problem in AJAX based filteration, e.g. here is an sample URL When you open this page, enter hotel name and click Go, Ajax filter work and show the result accordingly or you click on Next Page, it will shown next record using AJAX based. please suggest me, how to handle these kind of issues when working in Screen Scraping? Thank alot

Read the article
c# Network Programming - HTTPWebRequest Scraping

- by masterguru

Hi, I am building a web scraping application. It should scrape a complex web site with concurrent HttpWebRequests from a single host to a single target web server. The application should run on Windows server 2008. One single HttpWebRequest for data could take from 1 minute to 4 minutes to complete (because of long running db operations) I should have at least 100 parallel requests to the target web server, but i have noticed that when i use more then 2-3 long-running requests i have big performance issues (request timeouts/hanging). How many concurrent requests can i have in this scenario from a single host to a single target web server? can i use Thread Pools in the application to run parallel HttpWebRequests to the server? will i have any issues with the default outbound HTTP connection/requests limits? what about Request timeouts when i reach outbound connection limits? what would be the best setup for my scenario? Any help would be appreciated. Thanks

Read the article
Facebook fan page photo's scraping

- by Daan Poron

Hi, We want to add a facebook fan page photo competition to our fan page. The meaning is that ppl can upload photo's and others can like them. The person with the most likes on his photo wins a price. Now i was wondering if anyone knows a good idea on how to get a snapshot of all the photo's on a given moment. So that when we want to stop the contest we get an overview of the number of likes of all the persons. Some good website scraping tools? maybe a usefull facebook app? some other alternatives? greets, Daan

Read the article
Screen Scraping Twitter

- by BRADINO

I got an email today asking for help to scrape Twitter. In particular, to be able to login. So I am going to show everyone, NOT to encourage anyone to violate Twitters terms of use but as an educational blog post about how PHP and cURL can be used to post variables and store cookies. Again, I am using the cScrape class I wrote, which you can download. Step 1 First go to twitter.com and look at the source code of the login to get the form field names and the form post location. You will see that the form posts to https://twitter.com/session and the username and password fields are session[username_or_email] and session[password] respectively. Step 2 Now you are ready to login. So using the fetch function in the Scrape class you create an associative array to contain the form values you want to post. The other thing you will need to do is uncomment the lines for CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR. Cookies will be required to stay logged in and scrape around. The paths to the cookie files need to be writable by your app. Also you will need to uncomment the line about CURLOPT_FOLLOWLOCATION. $data = array('session[username_or_email]' => "bradino", 'session[password]' => "secret"); $scrape->fetch('https://twitter.com/sessions',$data); Step 1.5 Oops that didn't work. All I got back was 403 Forbidden: The server understood the request, but is refusing to fulfill it. Ahhh I see another variable called authenticity_token I bet Twitter was looking for that. So let's back up and first hit twitter.com to get the authenticity_token variable, and then make the login post request with that variable included in our array of parameters. $scrape->fetch('https://twitter.com'); $data = array('session[username_or_email]' => "bradino", 'session[password]' => "secret"); $data['authenticity_token'] = $scrape->fetchBetween('name="authenticity_token" type="hidden" value="','"',$scrape->result); $scrape->fetch('https://twitter.com/sessions',$data); echo $scrape->result; So that's basically it. Now you are logged in and can scrape around and request other pages as you normally would. Sorry it wasn't a longer post. I really do enjoy this kind of stuff so if anyone has a request, hit me up. Errors? 1) Make sure that you are properly parsing the token variable 2) Make sure that you uncommented the lines about CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR, those options need to be enabled and be sure the path set is writable by your application 3) Make sure that the path to the cookie file is writable and that it is getting data written to it 4) If you get a message about being redirected you need to uncomment the line about CURLOPT_FOLLOWLOCATION, that option needs to be enabled true

Read the article
Improving performance for web scraping code

- by Pankaj Upadhyay

I have a website in which the code scrapes other websites for getting the accurate data. While the code works good but there a decent lag in performance because the code firsts downloads the html stream from various sites(some times 9 websites), extracts the relative part and then renders the html page. What should I do to get an optimal performance. Should I change from shared hosting (godaddy) to my own server or it has nothing to do with my hosting and I need to make changes to my code?

Read the article
Web scraping etiquette

- by Ash

I'm considering writing a simple web scraping application to extract information from a website that does not seem to specifically prohibit this. I've checked for other alternatives (eg RSS, web service) to get this information, but there are none available at this stage. Despite this I've also developed/maintained a few websites myself and so I realize that if web scraping is done naively/greedily it can slow things down for other users and generally become a nuisance. So, what etiquette is involved in terms of: Number of requests per second/minute/hour. HTTP User Agent content. HTTP Referer content. HTTP Cache settings. Buffer size for larger files/resources. Legalities and licensing issues. Good tools or design approaches to use. Robots.txt, is this relevant for web scraping or just crawlers/spiders? Compression such as GZip in requests. Update Found this relevant question on Meta: Etiquette of Screen Scaping StackOverflow. Jeff Atwood's answer has some helpful recommendations. Other related StackOverflow questions: Options for html scraping Legalities of screen scraping

Read the article
Screen-scraping of a secure page of any site on https:// with asp.net in C#

- by Ajit

I've done site scraping of secure page of any site on http:// but when I am trying to scrap any site on https:// then i always scrape the login page not secure page. Please advice what should i do for scraping a secure page of any site on https://.

Read the article
methods for preventing large scale data scraping from REST api

- by Simon Kenyon Shepard

I know the immediate answer to this is going to be there is no 100% reliable method of doing this. But I'd like to create a question that details the different possibilities, the difficulty of implementing them and success rates. I would like to go from simple software ip/request speed analysis to high end sophisticated soft/hardware tools, e.g. neural networks. With a goal of predicting and preventing bogus requests and attempts to scrape the service. Many Thanks.

Read the article
Screen scraping over SSL with .NET

- by Even Mien

What solutions exist for screen scraping a site over SSL for use with .NET? My use case is that I need to login to a partner website (https), navigate through a dynamic hierarchy, and download a zipped file of reports. I certainly could use other screen scrapers if there are no good viable options in .NET, either though the framework or OSS.

Read the article
screen scraping

- by sam

hello folks., i am screen scraping a website which is in danish language.. i am unable to scrape certain characters as like må .. any idea to solve this? thanks

Read the article
Following a link using Nokogiri for scraping

- by DavidP6

Is there a method to follow a link using Nokogiri for scraping? I know I can extract the href and open it, but I thought I saw a method to do this using hpricot and was wondering if there was something like that in Nokogiri.

Read the article
Scraping data from Flash (Games)

- by awegawef

I saw this video, and I am really curious how it was performed. Does anyone have any ideas? My intuition is that he scraped pixels from the screen (one per 'box'), and then fed that into some program to determine the next move. Is scraping pixel-by-pixel the way to do this, or is there a better way? I am looking to do something similar with either Java or Python. Thanks

Read the article
HTML Scraping in Php.

- by tsellon

I've been doing some html scraping in PHP using regular expressions. This works, but the result is finicky and fragile. Has anyone used any packages that provide a more robust solution? A config driven solution would be ideal, but I'm not picky.

Read the article
Scraping &#151 character (long dash) error in Nokogiri

- by DavidP6

I having trouble scraping a certain long dash that is encoded as — ; on the Time magazine site. It looks like this: —. It works fine when this dash is encoded as mdash, but when the problem dash is scraped, it is returned as unknown characters. I am using Nokogiri and am wondering if I have to use some sort of special encoding? The page says it is encoded with UTF-8.

Read the article
Applications for data scraping from websites and database creation

- by beginning_steps

I am looking for application/software that will help me in scraping data from yellow pages, jigsaw and other similar kind of websites. I want to collect info like contact details/ name designation and email address. Please advice some software that will be able to do so, the price i am looking should be affordable or preferably free.

Read the article
page posting issue when working in Screen Scraping

- by Muhammad Akhtar

Hi, I am working on screen scraping and done successfully in 3 websites, I have an issue in last website here is my url, When I hit with my parameter, it is showing result on next page, simply posting to other page and showing the result fine on other page Here is My Test However, when I hit from my application, since here I don't have an option to post, it only fetch html of requested page that is obviously my above mention HTML test link, that actually have parameter in URL to get the result. How can I handle this situtation? Please give me hint. Thanks here is my C# code, I am using HTMLAgality String url; HtmlWeb hw = new HtmlWeb(); HtmlDocument doc; url = "http://mysampleURL"; doc = hw.Load(url);

Read the article
Screen Scraping

- by Sambo

Hi I'm trying to implement a screen scraping scenario on my website and have the following set so far. What I'm ultimately trying to do is replace all links in the $results variable that have "ResultsDetails.aspx?" to "results-scrape-details/" then output again. Can anyone point me in the right direction? <?php $url = "http://mysite:90/Testing/label/stuff/ResultsIndex.aspx"; $raw = file_get_contents($url); $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B"); $content = str_replace($newlines, "", html_entity_decode($raw)); $start = strpos($content,"<div id='pageBack'"); $end = strpos($content,'</body>',$start) + 6; $results = substr($content,$start,$end-$start); $pattern = 'ResultsDetails.aspx?'; $replacement = 'results-scrape-details/'; preg_replace($pattern, $replacement, $results); echo $results;

Read the article
What's the requests/second standard for scraping websites?

- by feydr

This was the closest question to my question and it wasn't really answered very well imo: http://stackoverflow.com/questions/2022030/web-scraping-etiquette I'm looking for the answer to #1: How many requests/second should you be doing to scrape? Right now I pull from a queue of links. Every site that gets scraped has it's own thread and sleeps for 1 second in between requests. I ask for gzip compression to save bandwidth. Are there standards for this? Surely all the big search engines have some set of guidelines they follow in regards to this.

Read the article
Screen Scraping HTML with C#

- by WildBill

I have been given the task at work of screen scraping one of our legacy web apps to extract certain data from the code. The data is formatted and "should" be displayed exactly the same every time. I am just not sure how to go about doing this. It's a full html file with header and footer navigations but in the middle of all this is the data I need. I need to extract the Company Name value, Contact Name, Telephone, email address, etc. Here is an example of what the code looks like: ...html above here <br /><br /> <table cellpadding="0" cellspacing="12" border="0"> <tr> <td valign="top" align="center">  <table cellpadding="0" cellspacing="0" border="0"> <tr> <td class="black"> <table cellspacing="1" cellpadding="0" border="0" width="370"> <tr> <th>ABC INDUSTRIES</th> </tr> <tr> <td class="search"> <table cellpadding="5" cellspacing="0" border="0" width="100%"> <tr> <td> <table cellpadding="1" cellspacing="0" border="0" width="100%"> <tr> <td align="center" colspan="2"><hr></td> </tr> <tr> <td align="right" nowrap><b><font color="FF0000">Contact Person <img src="/images/icon_contact.gif" align="absmiddle"> :</font></b></td> <td align="left" width="100%"> Joe Smith</td> </tr> <tr> <td align="right" nowrap><b><font color="FF0000">Phone Number <img src="/images/icon_phone.gif" align="absmiddle"> :</font></b></td> <td align="left" width="100%"> 555-555-5555</td> </tr> <tr> <td align="right" nowrap><b><font color="FF0000">E-mail Address <img src="/images/icon_email.gif" align="absmiddle"> :</font></b></td> <td align="left" width="100%"> <a HREF="mailto:[email protected]">[email protected]</a></td> </tr> more... There is more code on the screen in a different table structure that I also need to pull.

Read the article
How to work around a site forbidding me to scrape their images with PHP

- by Petruza

I'm scraping a site, searching for JPGs to download. Scraping the site's HTML pages works fine. But when I try getting the JPGs with CURL, copy(), fopen(), etc., I get a 403 forbiden status. I know that's because the site owners don't want their images scraped, so I understand a good answer would be just don't do it, because they don't want you to. Ok, but let's say it's ok and I try to work around this, how could this be achieved? If I get the same URL with a browser, I can open the image perfectly, it's not that my IP is banned or anything, and I'm testing the scraper one file at a time, so it's not blocking me because I make too many requests too often. From my understanding, it could be that either the site is checking for some cookies that confirm that I'm using a browser and browsing their site before I download a JPG. Or that maybe PHP is using some user agent for the requests that the server can detect and filter out. Anyway, have any idea?

Read the article
HTML Agility Pack Screen Scraping XPATH isn't returning data

- by Matthias Welsh

I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox and what my C# program is seeing. The code I'm currently using is pretty quick and dirty... //This function retrieves data from the digikey private static List<string> ExtractProductInfo(HtmlDocument doc) { List<HtmlNode> m_unparsedProductInfoNodes = new List<HtmlNode>(); List<string> m_unparsedProductInfo = new List<string>(); //Base Node for part info string m_baseNode = @"//html[1]/body[1]/div[2]"; //Write part info to list m_unparsedProductInfoNodes.Add(doc.DocumentNode.SelectSingleNode(m_baseNode + @"/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]")); //More lines of similar form will go here for more info //this retrieves digikey PN foreach(HtmlNode node in m_unparsedProductInfoNodes) { m_unparsedProductInfo.Add(node.InnerText); } return m_unparsedProductInfo; } Although the path I'm using appears to be "correct" I keep getting NULL when I look at the list "m_unparsedProductInfoNodes" Any idea what's going on here? I'll also add that if I do a "SelectNodes" on the baseNode it only returns a div... not sure what that indicates but it doesn't seem right.

Read the article
looking for alternative to Webzinc .NET , screen scraping, web automation library for .net

- by gpow

i came across this .net library http://www.webzinc.com/online/faq.aspx however, i was wondering if there was a free alternative out there ?

Read the article
Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

- by Diego

Is there a way to get around the following? httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth. I'm using mechanize and BeautifulSoup on Python2.6. hoping for a work-around

Read the article

Search Results

Search found 346 results on 14 pages for 'scraping'.

Page 1/14 | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by BRADINO

- by Surabhil Sergy

- by Muhammad Akhtar

- by masterguru

- by Daan Poron

- by BRADINO

- by Pankaj Upadhyay

- by Ash

- by Ajit

- by Simon Kenyon Shepard

- by Even Mien

- by sam

- by DavidP6

- by awegawef

- by tsellon

- by DavidP6

- by beginning_steps

- by Muhammad Akhtar

- by Sambo

- by feydr

- by WildBill

- by Petruza

- by Matthias Welsh

- by gpow

- by Diego

1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >