Search Results

Search found 346 results on 14 pages for 'scraping'.

Page 4/14 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

rcurl web scraping timeout exits program

- by user1742368

I am using a loop and rcurl scrape data from multiple pages which seems to work fine at certain times but fails when there is a timeout due to the server not responding. I am using a timeout=30 which traps the timeout error however the program stops after the timeout. i would like the progrm to continue to the next page when the timeout occurrs but cant figureout how to do this? url = getCurlHandle(cookiefile = "", verbose = TRUE) Here is the statement I am using that causes the timeout. I am happy to share the code if there is interest. webpage = getURLContent(url, followlocation=TRUE, curl = curl,.opts=list( verbose = TRUE, timeout=90, maxredirs = 2)) woodwardjj

Read the article
json service from data scraping with php

- by fredz0003

I am trying to figure out what is the best way to make this work, I am new to php. I was able to make my script work to find specific data on my htm file with the following script tested on my local server. <?php include ('simple_html_dom.php'); //create DOM from URL or local file $html = file_get_html ('Lotto Texas.htm'); //find td class name currLotWinnum and store in variable winNumbers foreach($html ->find('td.currLotWinnum') as $winNumbers) //print winNumbers echo "<b>The winning numbers are</b><br>"; echo $winNumbers -> innertext . '<br>'; ?> Need some light here, ultimately I would like to create a web service to return json format and access that data from my iOS application using NSJSONSerialization class.

Read the article
Scraping html WITHOUT uniquie identifiers using python

- by Nicholas Law

I would like to design an algorithm using python that scrapes thousands of pages like this one and this one, gathers all the data and inserts it into a MySQL database. The script will be run on a weekly or bi-weekly basis to update the database of any new information added to each individual page. Ideally I would like a scraper that is easy to work with for table structured data but also data that does not have unique identifiers (ie. id and classes attributes). Which scraper add-on should I use? BeautifulSoup, Scrapy or Mechanize? Are there any particular tutorials/books I should be looking at for this desired result? In the long-run I will be implementing a mobile app that works with all this data through querying the database.

Read the article
Scraping non-absolute URL

- by cooldude

I am trying to scrape www.weather.bm. I want all 10 radar images, but I can only get one (the image updates regularly) and it's not a absolute image url. I was hoping I could use the image as a image slideshow like the link but dont know how. Also, how can I remove images/Radarlegend.png? I just need the radar images. Here is my code: include('simple_html_dom.php'); $html = file_get_html('http://www.weather.bm/radarMobile.asp'); foreach($html->find('img') as $element) echo $element->src . '<br>' My output is: <div id="main"> images/Radar/CurrentRadarAnimation_100km_sri/100km_sri-radar-2011-01-04-1556.jpg<br>images/Radarlegend.png<br></div> </div>

Read the article
Perl scraping script not recognising certain characters

- by user1849286

I have a script that works fine locally but on the server fails. It displays the non-breaking space symbol   as ? when printing to standard output. In the parsing of the page, if I try to get rid of non breaking space symbol with s/ //g nothing happens, neither getting rid of the question mark s/?//g It seems to stick no matter what. Bizzarely, this is not an issue when running the script locally. Additionally, question marks within a diamond symbol are inserted everywhere (on both the server script and the local script) instead of apostrophes, although at least that is not causing the parsing of the page to break on the local page. Confused, pls help.

Read the article
How to insert scraping data to mysql

- by user1887288

i am fetching data from other websites can any one tell me how to insert fetch data to mysql database Below code i am using to fetch results coming $urls = $_POST["urls"]; require_once('simple_html_dom.php'); $useragent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)'; foreach ($urls as $url) { $curl = curl_init(); curl_setopt($curl, CURLOPT_URL, $url); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 20); curl_setopt($curl, CURLOPT_USERAGENT, $useragent); $str = curl_exec($curl); curl_close($curl); $html= str_get_html($str); foreach($html->find('span.price') as $e) echo $e->innertext . '<br>'; }

Read the article
PHP equivalent of PyQuery or Nokogiri?

- by Swaroop C H

Basically, I want to do some HTML screen scraping, but figuring out if it is possible in PHP. In Python, I would use PyQuery. In Ruby, I would use Nokogiri.

Read the article
how to create scrap page in asp.net with c# coding?

- by sribharanidharan

hai how to create the scrap or inbox like in orkut scraping in asp.net with c# coding? any one guide me i'm strucked in my project!!!!!!!!

Read the article
How many iMacros can run at the same time?

- by user292311

We're using iMacros to fill web forms. Does anyone know how many instances of iMacros can be run at the same time on a PC? If I need to automatically fill web forms for screen scraping, is there a better tool if I need "tons" of instances to run simultaneously? Thanks.

Read the article
scrape data from a website and post it on the blog (wordpress)

- by Pennf0lio

This could be in DocType But I'm looking for a software or just a plugin for wordpress. I wanted to fetch those data from a website and automatically post it on my blog (Wordpress powered). It doesn't have rss or api to get those data, so I need to manually copy and paste it one-by-one and post it on wordpress. Do you know an alternative options on my process? or you know a software or a plugin that does the job? Thanks!

Read the article
How to detect Javascript pop-up notifications in WatiN?

- by Ian P

I have a, what seems to be, rather common scenario I'm trying to work through. I have a site that accepts input through two different text fields. If the input is malformed or invalid, I receive a Javascript pop-up notification. I will not always receive one, but I should in the event of (like I said earlier) malformed data, or when a search result couldn't be found. How can I detect this in WatiN? A quick Google search produced results that show how to click through them, but I'm curious as to whether or not I can detect when I get one? In case anyone is wondering, I'm using WatiN to do some screen scraping for me, rather than integration testing :) Thanks in advance! Ian

Read the article
Scrape HTML tables from a given URL into CSV

- by dreeves

I seek a tool that can be run on the command line like so: tablescrape 'http://someURL.foo.com' [n] If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list. If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV. Potential additional features: To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill. The Perl module HTML::TableExtract can do this and may be good place to start for writing the tool I have in mind. An option to asciify any unicode. An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table. Related questions: http://stackoverflow.com/questions/259091/how-can-i-scrape-an-html-table-to-csv http://stackoverflow.com/questions/1403087/how-can-i-convert-an-html-table-to-csv http://stackoverflow.com/questions/2861/options-for-html-scraping

Read the article
Screen scrape a web page that uses javaScript and frames

- by Mello

Hi, I want to scrape data from www.marktplaats.nl . I want to analyze the scraped description, price, date and views in Excel/Access. I tried to scrape data with Ruby (nokogiri, scrapi) but nothing worked. (on other sites it worked well) The main problem is that for example selectorgadget and the add-on firebug (Firefox) don’t find any css I can use to scrape the page. On other sites I can extract the css with selectorgadget or firebug and use it with nokogiri or scrapi. Due to lack of experience it is difficult to identify the problem and therefore searching for a solution isn’t easy. Can you tell me where to start solving this problem and where I maybe can find more info about a similar scraping process? Thanks in advance!

Read the article
Automating WebTrends analysis

- by tridium

Every week I access server logs processed by WebTrends (for about 7 profiles) and copy ad clickthrough and visitor information into Excel spreadsheets. A lot of it is just accessing certain sections and finding the right title and then copying the unique visitor information. I tried using WebTrends' built-in query tool but that is really poorly done (only uses a drag-and-drop system instead of text-based) and it has a maximum number of parameters and maximum length of queries to query with. As far as I know, the tools in WebTrends are not suitable to my purpose of automating the entire web metrics gathering process. I've gotten access to the raw server logs, but it seems redundant to parse that given that they are already being processed by WebTrends. To me it seems very scriptable, but how would I go about doing that? Is screen-scraping an option?

Read the article
Python GUI Scraper hanging issues.

- by bball

I wrote a scraper using python a while back, and it worked fine in the command line. I have made a GUI for the application now, but I am having trouble with one issue. When I attempt to update text inside the gui (e.g. 'fetching URL 12/50'), I am unable seeing as the function within the scraper is grabbing 100+ links. Also when going from one scraping function, to a function that should update the gui, to another function, the gui update function seems to be skipped over while the next scrape function is run. An example would be: scrapeLinksA() #takes 20 seconds updateInfo("LinksA done") scrapeLinksB() #takes another 20 seconds in the above example, updateInfo is never executed, unless I end the program with a KeyboardInterrupt. I'm thinking my solution is threading, but I'm not sure. What can I do to fix this? I am using: PyQt4 urllib2 BeautifulSoup

Read the article
Python Scraper for Javascript?

- by Diego

Hey all, Can anyone direct me to a good Python screen scraping library for javascript code (hopefully one with good documentation/tutorials)? I'd like to see what options are out there, but most of all the easiest to learn with fastest results... wondering if anyone had experience. I've heard some stuff about spidermonkey, but maybe there are better ones out there? Specifically, I use BeautifulSoup and Mechanize to get to here, but need a way to open the javascript popup, submit data, and download/parse the results in the javascript popup. <a href="javascript:openFindItem(12510109)" onclick="s_objectID="javascript:openFindItem(12510109)_1";return this.s_oc?this.s_oc(e):true">Find Item</a> I'd like to implement this with Google App engine and Django. Thanks!

Read the article
xvfb on a machine with a display, can an application run 'in the background?'

- by marfarma

I'm setting up to cron a web scraping job, using xvfb, firefox, and watir on my Mac OS X. In testing the script so far, firefox pops up visibly on the local desktop, the watir script executes, and then firefox exits (I quit firefox in my script). I'd like to set the xvfb DISPLAY such that firefox will run, but won't be seen on the local desktop, running 'in the background' so to speak. Nothing I've been able to find online discusses such a possibility - nor explains that it's not possible. Is it possible? If so, what do I need to do to make it work?

Read the article
How to get InnerText of IFrame from another site?

- by Eclipsed4utoo

I am trying to do some screen-scraping of a website. The content that I want to get is inside of an IFrame. How do I get the InnerText or HTML that is being displayed inside of the IFrame? I am using .Net 4.0 and C#. I want to be able to do this from a WinForm. I tried this, but can't find where to get the actual data from... void PageCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) { WebBrowser b = sender as WebBrowser; string response = b.DocumentText; HtmlElement element = b.Document.GetElementById("profileFrame"); if (element != null) { // do something with the data } } I've tried searching through the element but couldn't find any of the HTML. Is this possible?

Read the article
Is there a good tutorial for figuring out what a website is doing so your program can do the same th

- by brian d foy

Is there a good guide or tutorial for people who need to programmatically interact with dynamic websites? There's been a rash of Perl questions about that lately, and I haven't found a good resource to point people toward. I'm asking not because I need one but because I don't want to waste my time writing it if it already exists. Although I'm most interested in Perl, the extra tools and techniques are mostly the same. Typically, I see see these problems in people's questions: Handling, setting, and saving cookies Finding and interacting with forms Handling JavaScript inside your user-agent especially things like onLoad, onSumbit, and Ajax Using HTTP sniffer tools Using Web developer plugins in interactive browsers Interacting with DOM, screen scraping, etc. If there's no good tutorial, I'll add it to my list of things to do (unless someone else wants to do it :). Along the way, if you don't have a suggestion for an existing tutorial, please suggest the things that you think should be in a new one, including links, your favorite tools, and your own user-agent development experiences. I don't care about the particular language you use.

Read the article
How to protect/monitor your site from crawling by malicious user

- by deathy

Situation: Site with content protected by username/password (not all controlled since they can be trial/test users) a normal search engine can't get at it because of username/password restrictions a malicious user can still login and pass the session cookie to a "wget -r" or something else. The question would be what is the best solution to monitor such activity and respond to it (considering the site policy is no-crawling/scraping allowed) I can think of some options: Set up some traffic monitoring solution to limit the number of requests for a given user/IP. Related to the first point: Automatically block some user-agents (Evil :)) Set up a hidden link that when accessed logs out the user and disables his account. (Presumably this would not be accessed by a normal user since he wouldn't see it to click it, but a bot will crawl all links.) For point 1. do you know of a good already-implemented solution? Any experiences with it? One problem would be that some false positives might show up for very active but human users. For point 3: do you think this is really evil? Or do you see any possible problems with it? Also accepting other suggestions.

Read the article
Groovy htmlunit getFirstByXPath returning null

- by StartingGroovy

I have had a few issues with HtmlUnit returning nulls lately and am looking for guidance. each of my results for grabbing the first row of a website have returned null. I am wondering if someone can A) explain why they might be returning null B) explain better ways (if there are some) to go about getting the information Here is my current code (URL is in the source): client = new WebClient(BrowserVersion.FIREFOX_3) client.javaScriptEnabled = false def url = "http://www.hidemyass.com/proxy-list/" page = client.getPage(url) IpAddress = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[2]").getValue() println "IP Address is: $data" //returns null //Port_Number is an Image Country = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[4][@class='country']/@rel").getValue() println "Country abbreviation is: $Country" //differentiate speed and connection by name of gif? Type = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[7]").getValue() println "Proxy type is: $Type" Anonymity = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[8]").getValue() println "Anonymity Level is: $Anonymity" client.closeAllWindows() Right now all of my XPaths return null and .getValue() obviously doesn't work on null. I also have questions as to what I should do about the PORT since it is an image? Is there a better alternative than downloading it and attempting to solve it by OCR? Side Note There is no significance in this site, I was just looking for a site that I could practice scraping on (the last one I ran into issues of fragment identities and couldn't get an answer to: HtmlUnit getByXpath returns null and HtmlUnit and Fragment Identities )

Read the article
Screenscraping and reverse engineering health based web tool

- by ArbInv

Hi There is a publicly available free tool which has been built to help people understand the impact of various risk factors on their health / life expectancy. I am interested in understanding the data that sits behind the tool. To get this out it would require putting in a range of different socio-demographic factors and analyzing the resulting outputs. This would need to be done across many thousand different individual profiles. The tool was probably built on some standard BI platorm. I have no interest in how the tool was built but do want to get to the data within it. The site has a Terms of Use Agreement which includes: Not copying, distribute, adapt, create derivative works of, translate, or otherwise modify the said tool Not decompile, disassemble, reverse assemble, or otherwise reverse engineer the tool. The said institution retains all rights, title and interest in and to the Tool, and any and all modifications thereof, including all copyright, copyright registrations, trade secrets, trademarks, goodwill and confidential and proprietary information related thereto. Would i be in effect breaking the law if i were to point a screen scraping tool which downloaded the data that sits behind the tool in question?? Any advice welcomed? THANKS

Read the article
Perl - WWW::Mechanize Cookie Session Id is being reset with every get(), how to make it stop?

- by Phill Pafford

So I'm scraping a site that I have access to via HTTPS, I can login and start the process but each time I hit a new page (URL) the cookie Session Id changes. How do I keep the logged in Cookie Session Id? #!/usr/bin/perl -w use strict; use warnings; use WWW::Mechanize; use HTTP::Cookies; use LWP::Debug qw(+); use HTTP::Request; use LWP::UserAgent; use HTTP::Request::Common; my $un = 'username'; my $pw = 'password'; my $url = 'https://subdomain.url.com/index.do'; my $agent = WWW::Mechanize->new(cookie_jar => {}, autocheck => 0); $agent->{onerror}=\&WWW::Mechanize::_warn; $agent->agent('Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.3) Gecko/20100407 Ubuntu/9.10 (karmic) Firefox/3.6.3'); $agent->get($url); $agent->form_name('form'); $agent->field(username => $un); $agent->field(password => $pw); $agent->click("Log In"); print "After Login Cookie: "; print $agent->cookie_jar->as_string(); print "\n\n"; my $searchURL='https://subdomain.url.com/search.do'; $agent->get($searchURL); print "After Search Cookie: "; print $agent->cookie_jar->as_string(); print "\n"; The output: After Login Cookie: Set-Cookie3: JSESSIONID=367C6D; path="/thepath"; domain=subdomina.url.com; path_spec; secure; discard; version=0 After Search Cookie: Set-Cookie3: JSESSIONID=855402; path="/thepath"; domain=subdomain.com.com; path_spec; secure; discard; version=0 Also I think the site requires a CERT (Well in the browser it does), would this be the correct way to add it? $ENV{HTTPS_CERT_FILE} = 'SUBDOMAIN.URL.COM'; ## Insert this after the use HTTP::Request... Also for the CERT In using the first option in this list, is this correct? X.509 Certificate (PEM) X.509 Certificate with chain (PEM) X.509 Certificate (DER) X.509 Certificate (PKCS#7) X.509 Certificate with chain (PKCS#7)

Read the article
View Generated Source (After AJAX/JavaScript) in C#

- by Michael La Voie

Is there a way to view the generated source of a web page (the code after all AJAX calls and JavaScript DOM manipulations have taken place) from a C# application without opening up a browser from the code? Viewing the initial page using a WebRequest or WebClient object works ok, but if the page makes extensive use of JavaScript to alter the DOM on page load, then these don't provide an accurate picture of the page. I have tried using Selenium and Watin UI testing frameworks and they work perfectly, supplying the generated source as it appears after all JavaScript manipulations are completed. Unfortunately, they do this by opening up an actual web browser, which is very slow. I've implemented a selenium server which offloads this work to another machine, but there is still a substantial delay. Is there a .Net library that will load and parse a page (like a browser) and spit out the generated code? Clearly, Google and Yahoo aren't opening up browsers for every page they want to spider (of course they may have more resources than me...). Is there such a library or am I out of luck unless I'm willing to dissect the source code of an open source browser? SOLUTION Well, thank you everyone for you're help. I have a working solution that is about 10X faster then Selenium. Woo! Thanks to this old article from beansoftware I was able to use the System.Windows.Forms.WebBrwoswer control to download the page and parse it, then give em the generated source. Even though the control is in Windows.Forms, you can still run it from Asp.Net (which is what I'm doing), just remember to add System.Window.Forms to your project references. There are two notable things about the code. First, the WebBrowser control is called in a new thread. This is because it must run on a single threaded apartment. Second, the GeneratedSource variable is set in two places. This is not due to an intelligent design decision :) I'm still working on it and will update this answer when I'm done. wb_DocumentCompleted() is called multiple times. First when the initial HTML is downloaded, then again when the first round of JavaScript completes. Unfortunately, the site I'm scraping has 3 different loading stages. 1) Load initial HTML 2) Do first round of JavaScript DOM manipulation 3) pause for half a second then do a second round of JS DOM manipulation. For some reason, the second round isn't cause by the wb_DocumentCompleted() function, but it is always caught when wb.ReadyState == Complete. So why not remove it from wb_DocumentCompleted()? I'm still not sure why it isn't caught there and that's where the beadsoftware article recommended putting it. I'm going to keep looking into it. I just wanted to publish this code so anyone who's interested can use it. Enjoy! using System.Threading; using System.Windows.Forms; public class WebProcessor { private string GeneratedSource{ get; set; } private string URL { get; set; } public string GetGeneratedHTML(string url) { URL = url; Thread t = new Thread(new ThreadStart(WebBrowserThread)); t.SetApartmentState(ApartmentState.STA); t.Start(); t.Join(); return GeneratedSource; } private void WebBrowserThread() { WebBrowser wb = new WebBrowser(); wb.Navigate(URL); wb.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler( wb_DocumentCompleted); while (wb.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents(); //Added this line, because the final HTML takes a while to show up GeneratedSource= wb.Document.Body.InnerHtml; wb.Dispose(); } private void wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) { WebBrowser wb = (WebBrowser)sender; GeneratedSource= wb.Document.Body.InnerHtml; } }

Read the article
Why am I getting a new session ID on every page fetch in my Perl WWW::Mechanize script?

- by Phill Pafford

So I'm scraping a site that I have access to via HTTPS, I can login and start the process but each time I hit a new page (URL) the cookie Session Id changes. How do I keep the logged in Cookie Session Id? #!/usr/bin/perl -w use strict; use warnings; use WWW::Mechanize; use HTTP::Cookies; use LWP::Debug qw(+); use HTTP::Request; use LWP::UserAgent; use HTTP::Request::Common; my $un = 'username'; my $pw = 'password'; my $url = 'https://subdomain.url.com/index.do'; my $agent = WWW::Mechanize->new(cookie_jar => {}, autocheck => 0); $agent->{onerror}=\&WWW::Mechanize::_warn; $agent->agent('Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.3) Gecko/20100407 Ubuntu/9.10 (karmic) Firefox/3.6.3'); $agent->get($url); $agent->form_name('form'); $agent->field(username => $un); $agent->field(password => $pw); $agent->click("Log In"); print "After Login Cookie: "; print $agent->cookie_jar->as_string(); print "\n\n"; my $searchURL='https://subdomain.url.com/search.do'; $agent->get($searchURL); print "After Search Cookie: "; print $agent->cookie_jar->as_string(); print "\n"; The output: After Login Cookie: Set-Cookie3: JSESSIONID=367C6D; path="/thepath"; domain=subdomina.url.com; path_spec; secure; discard; version=0 After Search Cookie: Set-Cookie3: JSESSIONID=855402; path="/thepath"; domain=subdomain.com.com; path_spec; secure; discard; version=0 Also I think the site requires a CERT (Well in the browser it does), would this be the correct way to add it? $ENV{HTTPS_CERT_FILE} = 'SUBDOMAIN.URL.COM'; ## Insert this after the use HTTP::Request... Also for the CERT In using the first option in this list, is this correct? X.509 Certificate (PEM) X.509 Certificate with chain (PEM) X.509 Certificate (DER) X.509 Certificate (PKCS#7) X.509 Certificate with chain (PKCS#7)

Read the article

Search Results

Search found 346 results on 14 pages for 'scraping'.

Page 4/14 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by user1742368

- by fredz0003

- by Nicholas Law

- by cooldude

- by user1849286

- by user1887288

- by Swaroop C H

- by sribharanidharan

- by user292311

- by Pennf0lio

- by Ian P

- by dreeves

- by Mello

- by tridium

- by bball

- by Diego

- by marfarma

- by Eclipsed4utoo

- by brian d foy

- by deathy

- by StartingGroovy

- by ArbInv

- by Phill Pafford

- by Michael La Voie

- by Phill Pafford

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >