Search Results

Search found 346 results on 14 pages for 'scraping'.

Page 6/14 | < Previous Page | 2 3 4 5 6 7 8 9 10 11 12 13 | Next Page >

Is selling a "website screen scraper" is illegal?

- by Yatendra Goel

I have coded a "website screen scraper" and want to sell it commercially. I know that webpages scraped by the screen scraper are restricted to be scraped by the webmaser of that website. The robots.txt file of the website says that its webpages must not be scraped. So my question is whether selling that screen scraper is a crime or using that screen scraper is a crime in legal terms. I know that this question is related to law but I thought the software experts on SO must also have answer to this question.

Read the article
Calling UIGetScreenImage() on manually-spawned thread prints "_NSAutoreleaseNoPool():" message to lo

- by jtrim

This is the body of the selector that is specified in NSThread +detachNewThreadSelector:(SEL)aSelector toTarget:(id)aTarget withObject:(id)anArgument NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; while (doIt) { if (doItForSure) { NSLog(@"checking"); doItForSure = NO; (void)gettimeofday(&start, NULL); /* do some stuff */ // the next line prints "_NSAutoreleaseNoPool():" message to the log CGImageRef screenImage = UIGetScreenImage(); /* do some other stuff */ (void)gettimeofday(&end, NULL); elapsed = ((double)(end.tv_sec) + (double)(end.tv_usec) / 1000000) - ((double)(start.tv_sec) + (double)(start.tv_usec) / 1000000); NSLog(@"Time elapsed: %e", elapsed); [pool drain]; } } [pool release]; Even with the autorelease pool present, I get this printed to the log when I call UIGetScreenImage(): 2010-05-03 11:39:04.588 ProjectName[763:5903] *** _NSAutoreleaseNoPool(): Object 0x15a2e0 of class NSCFNumber autoreleased with no pool in place - just leaking Has anyone else seen this with UIGetScreenImage() on a separate thread?

Read the article
How to export scrubyt extractor?

- by robintw

I've written a scrubyt extractor based on the 'learning' technique - that is, specifying the current text on the page and getting it to work out the XPath expressions itself. However, I now want to export the extractor so that it can be used even when the page has changed. The documentation for scrubyt seems to be all over the place now, but from what I can find I should be able to put the line extractor.export(__FILE__) and it should work. It doesn't - I just get an error saying that there is the wrong number of arguments for export, it should have 0. I've tried it without any arguments and it still fails. I would ask on the scrubyt forum, but it seems like no-one's been there for ages! Any ideas what to do here?

Read the article
What movie website allows people to scrape it?

- by Sergio Tapia

I've wanted to make a C# library to scrape movie information and return it to the application, but someone told me that it's against the TOS. RottenTomatoes seems to have no problems with it from what I've read on their licensing page, but I'm not quite sure. Where could I aquire movie information legally and without cost? It's for an open source application hosted here: LINK

Read the article
What I must to learn to write php graber (parser)?

- by butteff

What I must to learn to write php web-site graber (parser)? It's just must collect any other information from other websites, as weather forecast, wiki "on this day", some news and other useful and interesting "every day" information! and what i must to read for writing m3u player on php? sorry for my bad english

Read the article
How to open URL's in rails?

- by yuval

I'm trying to read in the html of a certain website. Trying @something = open("http://www.google.com/") fails with the following error: Errno::ENOENT in testController#show No such file or directory - http://www.google.com/ Going to http://www.google.com/, I obviously see the site. What am I doing wrong? Thanks!

Read the article
I want to scrape a site using GAE and post the results into a Google Entity

- by cozza

I want to scrape this URL : https://www.xstreetsl.com/modules.php?searchSubmitImage_x=0&searchSubmitImage_y=0&SearchLocale=0&name=Marketplace&SearchKeyword=business&searchSubmitImage.x=0&searchSubmitImage.y=0&SearchLocale=0&SearchPriceMin=&SearchPriceMax=&SearchRatingMin=&SearchRatingMax=&sort=&dir=asc Go into each of the links and extract out various pieces of information e.g. permissions, prims etc then post the results into a Entity on google app engine. I want to know the best way to go about it? Chris

Read the article
Saving HttpResponse/Request to file system

- by chrisjlong

Here is my scenario. User fills out this large page which is dynamically created based off DB values. Those values can change. When the user fills out the page and hits submit we want to save a copy of the page as html on the server, this way if the text or wording changes, when they go back to view their posted information, it is historically accurate. So I basically need to do this protected void buttonSave_Click(object sender, EventArgs e) { //collect information into an object to save it in the db bool result = BusinessLogic.Save(myBusinessObject); if (result) //!!! Here is where I need to save this page as an html file on my servers IFS!!!! else //whatever Response.Redirect("~/SomeOtherPage.aspx"); } Any help is greatly apprciated. Also I CANNOT just request the data from the url because query string parameters are a big no no in this case. The key to pull the database info up (at its highest level) is all in session so I cant just request a url and save it. Thanks!

Read the article
Nokogiri Doc Element Not Returning Correctly

- by TenJack

I am trying to scrape a wiktionary entry: uri = URI.parse("http://en.wiktionary.org/wiki/" + CGI.escape('abjure')) doc = Nokogiri::HTML(open(uri, 'User-Agent' => 'ruby')) but the doc shows no elements for this word. The other words work fine and this word used to work. I have no idea what changed. Anyone see anything wrong with this?

Read the article
Programmaticaly grabbing text from a web page that is dynamically generated.

- by bstullkid

There is a website I am trying to pull information from in perl, however the section of the page I need is being generated using javascript so all you see in the source is <div id="results"></div> I need to somehow pull out the contents of that div and save it to a file using perl/proxies/whatever. e.g. the information I want to save would be document.getElementById('results').innerHTML; I am not sure if this is possible or if anyone had any ideas or a way to do this. I was using a lynx source dump for other pages but since I cant straight forward screen scrape this page I came here to ask about it! If anyone is interested, the page is http://downloadcenter.trendmicro.com/index.php?clk=left_nav&clkval=pattern_file&regs=NABU and the info I am trying to get is the row about the ConsumerOPR

Read the article
Yahoo Web Scrapes: What are the limits?

- by bvandrunen

We are using a web scraper and have it set up to have a sleep function which has a random function set up (so that it isn't the same time between each scrape) but we are still getting blocked from Yahoo after 20-30 requests. Does any one know if there is a limit (i.e: 20 requests per minutes, 200 an hour) Right now our average between each request is around 3-6 seconds. Thanks for any help

Read the article
Python module for converting PDF to text

- by cnu

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.

Read the article
Reading Ontology with Jena, feeding it with RDF triples, and producing correct RDF string output.

- by JonB

Hi, I have an ontology, which I read in with Jena to help me scrape some RDFa triples from a website. I don't currently store these triples in a Jena model, but that is fairly straight forward to do, its on my to do next list. The area I am struggling with, though, is to get Jena to output correct RDF for the ontology I have. The ontology uses Owl and RDFS definitions, but when I pass some example triples into the model, they don't appear correctly. Almost as if it doesn't know anything about the ontology. The output is, however, still valid RDF, just it's not coming out in the form I was hoping for. Am I correct in thinking that Jena should be able to produce well written RDF (not just valid) about the triples I have collected, based on the ontology or does this out stretch what it is capable of? Many thanks for any input.

Read the article
Is selling a "website screen scraper" illegal?

- by Yatendra Goel

I have coded a "website screen scraper" and want to sell it commercially. I know that webpages scraped by the screen scraper are restricted to be scraped by the webmaser of that website. The robots.txt file of the website says that its webpages must not be scraped. So my question is whether selling that screen scraper is a crime or using that screen scraper is a crime in legal terms. I know that this question is related to law but I thought the software experts on SO must also have answer to this question.

Read the article
Help converting code using httlib2 to use urllib2

- by ThinkCode

What am I trying to do? Visit a site, retrieve cookie, visit the next page by sending in the cookie info. It all works but httplib2 is giving me one too many problems with socks proxy on one site. http = httplib2.Http() main_url = 'http://mywebsite.com/get.aspx?id='+ id +'&rows=25' response, content = http.request(main_url, 'GET', headers=headers) main_cookie = response['set-cookie'] referer = 'http://google.com' headers = {'Content-type': 'application/x-www-form-urlencoded', 'Cookie': main_cookie, 'User-Agent' : USER_AGENT, 'Referer' : referer} How to do the same exact thing using urllib2 (cookie retrieving, passing to the next page on the same site)? Thank you.

Read the article
How can I get all content within <td> tag using a HTML Agility Pack?

- by Bob Dylan

So I'm writing an application that will do a little screen scrapping. I'm using the HTML Agility Pack to load an entire HTML page into an instance of HtmlDocoument called doc. Now I want to parse that doc, looking for this: <table border="0" cellspacing="3"> <tr><td>First rows stuff</td></tr> <tr> <td> The data I want is in here <br /> and it's seperated by these annoying <br /> 's. No id's, classes, or even a single <p> tag. </p> Just a bunch of <br /> tags. </td> </tr> </table> So I just need to get the data within the 2nd row. How can I do this? Should I use a regex or something else?

Read the article
Voting on Hacker News stories programmatically?

- by igorgue

I decided to write an app like: http://michaelgrinich.com/hackernews/ but for Android devices, my idea will use a web application backend (because I rather code in Python and for the web than completely in Java for Android devices). What I have right now implemented is something like this: $ curl -i http://localhost:8080/stories.json?page=1\&stories=1 HTTP/1.0 200 OK Date: Sun, 25 Apr 2010 07:59:37 GMT Server: WSGIServer/0.1 Python/2.6.5 Content-Length: 296 Content-Type: application/json [{"title": "Don\u2019t talk to aliens, warns Stephen Hawking", "url": "http://www.timesonline.co.uk/tol/news/science/space/article7107207.ece?", "unix_time": 1272175177, "comments": 15, "score": 38, "user": "chaostheory", "position": 1, "human_time": "Sun Apr 25 01:59:37 2010", "id": "1292241"}] The next step (and final I think) is voting, my design is doing something like this: $ curl -i http://localhost:8080/stories/1?vote=up -u username:password Will vote up and: $ curl -i http://localhost:8080/stories/1?vote=down -u username:password Down. I have no idea how to do it though... I was planning to use Twill but the login link is always different, e.g.: http://news.ycombinator.com/x?fnid=7u89ccHKln Later the Android app will consume this API. Any experience with programmatically browsing Hacker News?

Read the article
Why Shouldn't I Programmatically Submit Username/Password to Facebook/Twitter/Amazon/etc?

- by viatropos

I wish there was a central, fully customizable, open source, universal login system that allowed you to login and manage all of your online accounts (maybe there is?)... I just found RPXNow today after starting to build a Sinatra app to login to Google, Facebook, Twitter, Amazon, OpenID, and EventBrite, and it looks like it might save some time. But I keep wondering, not being an authentication guru, why couldn't I just have a sleek login page saying "Enter username and password, and check your login service", and then in the background either scrape the login page from say EventBrite and programmatically submit the form with Mechanize, or use an API if there was one? It would be so much cleaner and such a better user experience if they didn't have to go through popups and redirects and they could use any previously existing accounts. My question is: What are the reasons why I shouldn't do something like that? I don't know much about the serious details of cookies/sessions/security, so if you could be descriptive or point me to some helpful links that would be awesome. Thanks!

Read the article
What I must to learn to parse dynamic HTML sites with PHP?

- by butteff

What I must to learn to write php web-site grabber (parser)? It must collect information from other websites, such as as weather forecast, wiki "on this day", some news and other useful and interesting "every day" information! what i must to read for writing m3u player on php? sorry for my bad english

Read the article
How can I download information from a website if it returns XML/JSON in its response?

- by Sergio Tapia

Does Python3 have a built in method to do this? Any guidance at all would be great! :) The website in question exposes all of its information and even gives you an API key to use.

Read the article
Automated download of website content using ASP.net

- by Yaaqov

Using ASP.net, what methods can I use to do the following: Open up a connection to a given URL to read HTML content Parse the given URL for hyperlinks, and place them in an array Loop through each hyperlink (only 1 level down), opening each one, saving the HTML contents in a table, and move to the next hyperlink until done. If ASP.net is not up to the task, other languages or free scripts/toolkits would be acceptable. Thanks.

Read the article
form submitting with mechanize and Python

- by MATELIN Alexis

I'm trying to scrap a website that requires to submit two forms : a first one to loggin and a second one to specify my research. I'm using Python and the mechanize package. No problem with the first one, but i just can't figure out how to pass through the second one. Here is the part of my code related to the firm above-mentionned agemin=18 agemax=25 by='region' country='France' region=2 newcustomers=1 browser.select_form(nr=0) browser['age[min]']=agemin browser['age[max]']=agemax browser['country']=country browser['region']=region browser['by']=by browser['new-customers']=newcustomers response=browser.submit() content=response.read() but when I submit the variable 'age[min]' by example, I get the following error message : TypeError: object of type 'int' has no len() to give you some more informations, here is what I get with 'print br.form' <POST http://www.adopteunmec.com/qsearch/ajax_quick application/x-www-form-urlencoded <SelectControl(age[min]=[, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, *30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])> <SelectControl(age[max]=[, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, *45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])> <SelectControl(by=[*region, distance])> <SelectControl(country=[*fr, be, ch, ca])> <SelectControl(region=[*1, 2, 3, 4, 5, 6, 7, 8, 22, 23, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 11])> <SelectControl(distance[min]=[*, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800, 810, 820, 830, 840, 850, 860, 870, 880, 890, 900, 910, 920, 930, 940, 950, 960, 970, 980, 990, 1000])> <SelectControl(distance[max]=[, 0, 10, 20, 30, 40, 50, 60, 70, *80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800, 810, 820, 830, 840, 850, 860, 870, 880, 890, 900, 910, 920, 930, 940, 950, 960, 970, 980, 990, 1000])> <CheckboxControl(new=[*1])>> My guess is that the form needs an object (like a list) containing all the variables to accept it ; that's why it refuses the variables submited one by one. Thank you in advance for any help ! Alexis

Read the article
How can I get all content within <table></table> tags using a regex?

- by Bob Dylan

So I'm writing an application that will do a little screen scrapping. All the pages (about 1000 or so) contain this line: <table border="0" cellspacing="3"> <tr><td>First rows stuff</td></tr> <tr> <td> The data I want is in here <br /> and it's seperated by these annoying <br /> 's. No id's, classes, or even a single <p> tag. Just a bunch of <br /> tags. </td> </tr> </table> So I just need to get the data within the 2nd row out. How can I do this? Should I use a regex or something else?

Read the article
How to handle redirects while parsing HTML? - Python

- by RadiantHex

Hi folks, I'm trying to submit a few forms through a Python script, I'm using the mechanized library. This is so I can implement a temporary API. The problem is that before after submission a blank page is returned informing that the request is being processed, after a few seconds the page is redirected to the final page. I understand if it might sound a bit generic, but I'm not sure what is going on. :) Any ideas?

Read the article
Nokogiri find only inbound links

- by astropanic

I have an html document located on http://somedomain.com/somedir/example.html The document contains of four links: http://otherdomain.com/other.html http://somedomain.com/other.html /only.html test.html How I can get the full urls for the links in the current domain ? I mean I should get: http://somedomain.com/other.html http://somedomain.com/only.html http://somedomain.com/somedir/test.html The first link should be ignored because it does'nt match my domain

Read the article

Search Results

Search found 346 results on 14 pages for 'scraping'.

Page 6/14 | < Previous Page | 2 3 4 5 6 7 8 9 10 11 12 13 | Next Page >

- by Yatendra Goel

- by jtrim

- by robintw

- by Sergio Tapia

- by butteff

- by yuval

- by cozza

- by chrisjlong

- by TenJack

- by bstullkid

- by bvandrunen

- by cnu

- by JonB

- by Yatendra Goel

- by ThinkCode

- by Bob Dylan

- by igorgue

- by viatropos

- by butteff

- by Sergio Tapia

- by Yaaqov

- by MATELIN Alexis

- by Bob Dylan

- by RadiantHex

- by astropanic

< Previous Page | 2 3 4 5 6 7 8 9 10 11 12 13 | Next Page >