Search Results

Search found 82 results on 4 pages for 'scraper'.

Page 1/4 | 1 2 3 4  | Next Page >

  • Is selling a "website screen scraper" is illegal?

    - by Yatendra Goel
    I have coded a "website screen scraper" and want to sell it commercially. I know that webpages scraped by the screen scraper are restricted to be scraped by the webmaser of that website. The robots.txt file of the website says that its webpages must not be scraped. So my question is whether selling that screen scraper is a crime or using that screen scraper is a crime in legal terms. I know that this question is related to law but I thought the software experts on SO must also have answer to this question.

    Read the article

  • Facebook like on demand meta content scraper

    - by Tobias
    you guys ever saw that FB scrapes the link you post on facebook (status, message etc.) live right after you paste it in the link field and displays various metadata, a thumb of the image, various images from the a page link or a video thumb from a video related link (like youtube). any ideas how one would copy this function? i'm thinking about a couple gearman workers or even better just javascript that does a xhr requests and parses the content based on regex's or something similar... any ideas? any links? did someone already tried to do the same and wrapped it in a nice class? anything? :) thanks!

    Read the article

  • Is selling a "website screen scraper" illegal?

    - by Yatendra Goel
    I have coded a "website screen scraper" and want to sell it commercially. I know that webpages scraped by the screen scraper are restricted to be scraped by the webmaser of that website. The robots.txt file of the website says that its webpages must not be scraped. So my question is whether selling that screen scraper is a crime or using that screen scraper is a crime in legal terms. I know that this question is related to law but I thought the software experts on SO must also have answer to this question.

    Read the article

  • Contents farms, scrapers sites, aggregators real world examples? [closed]

    - by Marco Demaio
    Contents farm, scrappers, aggregators real world examples? Could you plz clarify me: efreedom.com is a scraper site, not a content farm? Because it simply copies and pastes contents from stackoverflow. ehow.com and squidoo.com are contents farm? They don't copy and paste contents they just generate fresh new user generated content, but too much and too quickly. expert-exchange.com is NOT a content farm or a scraper site, right?! It's simply that many people (an me too) hates it (they also wrote to Matt Cutts) because it shows up hight in Google providing a useless question with no answer. There are also many sites that act as 'contents aggregators in the form of specialized directories' (let's call them CASD), I don't know how to else define them. Do they have a specific definition? Anyway are these type of CASD contents farms or scrapers sites or what else? Basically these CASD search for all sites of the same type i.e. “restaurants websites”, they copy and paste the contents found in “Restaurant A” and create in their aggregator site a new page called “Restaurant A”, then they do the same for all websites of the same type, thus creating a sort of directory of restaurants. Later on these CASD also sends an email to the owner of “Restaurant A” (usually the email is on the website) with a user and password to let him modify/update its own page on the CASD site. Later on these CASD might ask for money to the owner of “Restaurant A” because they bring him traffic, otherwise they remove its page on the aggregator. Someone could call these simply directories, but I think a directory is different because is something you need to add your site into by filling a form and not something that steals contents from your existing site without a specific acceptance from the site's owner. I also really wonder how Google will sort out all these mess sites packed of contents that show up more and more and everywhere in search results.

    Read the article

  • Python GUI Scraper hanging issues.

    - by bball
    I wrote a scraper using python a while back, and it worked fine in the command line. I have made a GUI for the application now, but I am having trouble with one issue. When I attempt to update text inside the gui (e.g. 'fetching URL 12/50'), I am unable seeing as the function within the scraper is grabbing 100+ links. Also when going from one scraping function, to a function that should update the gui, to another function, the gui update function seems to be skipped over while the next scrape function is run. An example would be: scrapeLinksA() #takes 20 seconds updateInfo("LinksA done") scrapeLinksB() #takes another 20 seconds in the above example, updateInfo is never executed, unless I end the program with a KeyboardInterrupt. I'm thinking my solution is threading, but I'm not sure. What can I do to fix this? I am using: PyQt4 urllib2 BeautifulSoup

    Read the article

  • screen scraper templates for various websites

    - by intuited
    I'm looking specifically for a convenient way to locally archive posts from this and other similar sites. I'd like to separate the question itself from the answers, or maybe crop the question and store it, keeping the page title. Obviously I don't need to store the menu or the various other site interface chrome. The best way to do this would seem to be to associate an XSLT template with a match on the URL and use that template to pull the various relevant informations and format them. My two-part question: Is there a tool specifically built for this task? I.E. something that takes a URL and checks it against a map of path-matching expressions to templates, and outputs the result of applying the template to that resource? xmlto seems to be most of the way there, and could probably just be called from a script that does the pattern-matching, but something already integrated would be more convenient. Is such a URL_pattern-to-XSLT_template map publicly available somewhere? Question 2.5: Is it legal to do this with sites like this one that have public licenses on their content?

    Read the article

  • A good web data extraction/screen scraper program?

    - by Taylor
    I need to capture product data from a site on a regular basis and wondered if any one knows of a good software program? I've trialed Mozenda but its a monthly subscription and pricey in the long term. Obviously something thats free would be best but I don't mind paying either. Just need a decent program thats reliable and doesn't require much programming knowledge.

    Read the article

  • Scrapy cannot find div on this website [on hold]

    - by Jaspal Singh Rathour
    I am very new at this and have been trying to get my head around my first selector can somebody help? i am trying to extract data from page http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false all the info under div class = listing clearfix shelfListing but i cant seem to figure out how to format response.xpath(). I have managed to launch scrapy console but no matter what I type in response.xpath() i cant seem to select the right node. I know it works because when I type response.xpath('//div[@class="container"]') I get a response but don't know how to navigate to the listsing cleardix shelflisting. I am hoping that once i get this bit I can continue working my way through the spider. Thank you in advance! PS I wonder if it is not possible to scan this site - is it possible for the owners to block spiders?

    Read the article

  • Building simple Reddit scraper

    - by Bazant Fundator
    Let's say that I would like to make a collection of images from reddit for my own amusement. I have ran the code on my development env and It haven't gone past the first page of posts (anything beyond requries the after string from the JSON. Additionally, When I turn on the validation, the whole loop breaks if the item doesn't pass it, not just the current iteration. I would be glad If you helped me understand mistakes I made. class Link include Mongoid::Document include Mongoid::Timestamps field :author, type: String field :url, type: String validates_uniqueness_of :url, # no duplicates validates :url, uniqueness :true end def fetch (count, after) count_s = count.to_s # convert count to string link = "http://reddit.com/r/aww/.json?count="+count_s+"&after="+after #so it can be used there res = HTTParty.get(link) # GET req. to the reddit server json = JSON.parse(res.body) # Parse the response if json['kind'] == "Listing" then # check if the retrieved item is a Listing for i in 1...(count) do # for each list item datum = json['data']['children'][i]['data'] #i-th element properties if datum['domain'].in?(["imgur.com", "i.imgur.com"]) then # fetch only imgur links Link.create!(author: datum['author'], url: datum['url']) # save to db end end count += 25 fetch(count, json['data']['after']) # if it retrieved the right kind of object, move on to the next page end end fetch(25," ") # run it

    Read the article

  • How to protect SHTML pages from crawlers/spiders/scrapers?

    - by Adam Lynch
    I have A LOT of SHTML pages I want to protect from crawlers, spiders & scrapers. I understand the limitations of SSIs. An implementation of the following can be suggested in conjunction with any technology/technologies you wish: The idea is that if you request too many pages too fast you're added to a blacklist for 24 hrs and shown a captcha instead of content, upon every page you request. If you enter the captcha correctly you've removed from the blacklist. There is a whitelist so GoogleBot, etc. will never get blocked. Which is the best/easiest way to implement this idea? Server = IIS Cleaning out the old tuples from a DB every 24 hrs is easily done so no need to explain that.

    Read the article

  • Python Scraper for Javascript?

    - by Diego
    Hey all, Can anyone direct me to a good Python screen scraping library for javascript code (hopefully one with good documentation/tutorials)? I'd like to see what options are out there, but most of all the easiest to learn with fastest results... wondering if anyone had experience. I've heard some stuff about spidermonkey, but maybe there are better ones out there? Specifically, I use BeautifulSoup and Mechanize to get to here, but need a way to open the javascript popup, submit data, and download/parse the results in the javascript popup. <a href="javascript:openFindItem(12510109)" onclick="s_objectID=&quot;javascript:openFindItem(12510109)_1&quot;;return this.s_oc?this.s_oc(e):true">Find Item</a> I'd like to implement this with Google App engine and Django. Thanks!

    Read the article

  • Is this Anti-Scraping technique viable with Crawl-Delay?

    - by skibulk
    I want to prevent web scrapers from abusing 1,000,000 on my website. I'd like to do this by returning a "503 Service Unavailable" error code for users that access an abnormal number of pages per minute. I don't want search engine spiders to ever receive the error. My inclination is to set a robots.txt crawl-delay which will ensure spiders access a number of pages per minute under my 503 threshold. Is this an appropriate solution? Do all major search engines support the directive? Could it negatively affect SEO? Are there any other solutions or recommendations?

    Read the article

  • Web Scraper via Web Service API?

    - by 001
    How would I go about doing the following... I want to build a web service for my application to grab a piece of data from an external website, that requires the user to login. The website has no public API , hence the reason for the scrapper. Is there a library to perform the following functions? or what do I do? automate fill-in form, auto click Automate submit button check which URL the user has landed on, and redirect user to URL Grab data from label. EDIT: what im asking for is there a web service, library etc to make it easier to perform screen scrapping/automation functions???

    Read the article

  • Web scraping: how to get scraper implementation from text link?

    - by isme
    I'm building a java web media-scraping application for extracting content from a variety of popular websites: youtube, facebook, rapidshare, and so on. The application will include a search capability to find content urls, but should also allow the user to paste a url into the application if they already where the media is. Youtube Downloader already does this for a variety of video sites. When the program is supplied with a URL, it decides which kind of scraper to use to get the content; for example, a youtube watch link returns a YoutubeScraper, a Facebook fanpage link returns a FacebookScraper and so on. Should I use the factory pattern to do this? My idea is that the factory has one public method. It takes a String argument representing a link, and returns a suitable implementation of the Scraper interface. I guess the Factory would hold a list of Scraper implementations, and would match the link against each Scraper until it finds a suitable one. If there is no suitable one, it throws an Exception instead.

    Read the article

  • beautifulsoup and mechanize to get ajax call result

    - by nabizan
    hi im building a scraper using python 2.5 and beautifulsoup but im stuble upon a problem ... part of the web page is generating after user click on some button, whitch start an ajax request by calling specific javacsript function using proper parameters is there a way to simulate user interaction and get this result? i come across a mechanize module but it seems to me that this is mostly used to work with forms ... i would appreciate any links or some code samples thanks

    Read the article

  • How to rebuild Safari Web Clip functionality in PHP

    - by Mayko
    Hi there, is there a way to rebuild Mac OSX Snow Leopard's Dashboard Widget 'Web Clip' on a PHP website? Something like a crawler or scraper. I thought about using file_get_contents to getting the page content into the page, but how do I select a section on the external page? And does this work with session/login content as well? I'm happy for any kind of suggestions! Cheers

    Read the article

  • Getting all pdf files from a domain (for example *.adomain.com)

    - by Zack
    I need to download all pdf files from a certain domain. There are about 6000 pdf on that domain and most of them don't have an html link (either they have removed the link or they never put one in the first place). I know there are about 6000 files because I'm googling: filetype:pdf site:*.adomain.com However, Google lists only the first 1000 results. I believe there are two ways to achieve this: a) Use Google. However, how I can get all 6000 results from Google? Maybe a scraper? (tried scroogle, no luck) b) Skip Google and search directly on domain for pdf files. How do I do that when most them are not linked?

    Read the article

  • Convert a (nested)HTML unordered list of links to PHP array of links

    - by Klark
    Hi, I have a regular, nested HTML unordered list of links, and I'd like to scrape it with PHP and convert it to an array. The original list looks something like this: <ul> <li><a href="http://someurl.com">First item</a> <ul> <li><a href="http://someotherurl.com/">Child of First Item</a></li> <li><a href="http://someotherurl.com/">Second Child of First Item</a></li> </ul> </li> <li><a href="http://bogusurl.com">Second item</a></li> <li><a href="http://bogusurl.com">Third item</a></li> <li><a href="http://bogusurl.com">Fourth item</a></li> </ul> Any of the items can have children. (The actual screen scraping is not a problem, I can do that.) I'd like to turn this into a PHP array, of just the links, while keeping the hierarchical nature of the list. Any ideas? I've looked at using htmlsimpledom and phpQuery, which both use jQuery like syntax. But, I can't seem to get the syntax right. Thanks.

    Read the article

  • Anyone have a good solution for scraping the HTML source of a page with content (in this case, HTML

    - by phpwns
    Anyone have a good solution for scraping the HTML source of a page with content (in this case, HTML tables) generated with Javascript? An embarrassingly simple, though workable solution using Crowbar: <?php function get_html($url) // $url must be urlencode(d) { $context = stream_context_create(array( 'http' => array('timeout' => 120) // HTTP timeout in seconds )); $html = substr(file_get_contents('http://127.0.0.1:10000/?url=' . $url . '&delay=3000&view=browser', 0, $context), 730, -32); // substr removes HTML from the Crowbar web service, returning only the $url HTML return $html; } ?> The advantage to using Crowbar is that the tables will be rendered (and accessible) thanks to the headless mozilla-based browser. The problem, of course, is being dependent on on an external web service, especially given that SIMILE seems to undergo regular server maintenance. :( A pure php solution would be nice, but any functional (and reliable) alternatives would be great.

    Read the article

  • How can I block abusive bots from accessing my Heroku app?

    - by aem
    My Heroku (Bamboo) app has been getting a bunch of hits from a scraper identifying itself as GSLFBot. Googling for that name produces various results of people who've concluded that it doesn't respect robots.txt (eg, http://www.0sw.com/archives/96). I'm considering updating my app to have a list of banned user-agents, and serving all requests from those user-agents a 400 or similar and adding GSLFBot to that list. Is that an effective technique, and if not what should I do instead? (As a side note, it seems weird to have an abusive scraper with a distinctive user-agent.)

    Read the article

1 2 3 4  | Next Page >