Search Results

Search found 346 results on 14 pages for 'scraping'.

Page 2/14 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • Python web scraping involving HTML tags with attributes

    - by rohanbk
    I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following: <html> <body> <div id="container"> <div id="contents"> <table> <tbody> <tr> <td class="author">####I want whatever is located here ###</td> </tr> </tbody> </table> </div> </div> </body> </html> I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I'm not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do? At the moment, my code looks like what is below: import re import urllib2,sys import lxml from lxml import etree from lxml.html.soupparser import fromstring from lxml.etree import tostring from lxml.cssselect import CSSSelector from BeautifulSoup import BeautifulSoup, NavigableString address='http://www.example.com/' html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) html=soup.prettify() html=html.replace('&nbsp', '&#160') html=html.replace('&iacute','&#237') root=fromstring(html) I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file. EDIT: I suppose that I didn't make this quite clear, but I have multiple tags in page that I want to scrape.

    Read the article

  • Problem with eastern european characters when scraping data from the European Parliaments Website

    - by Thomas Jensen
    Dear Experts I am trying to scrape a lot of data from the European Parliament website for a research project. Ther first step is the create a list if all parliamentarians, however due to the many Eastern European names and the accents they use i get a lot of missing entries. Here is an example of what is giving me troubles (notice the accents at the end of the family name): ANDRIKIENE, Laima Liucija Group of the European People's Party (Christian Democrats) So far I have been using PyParser and the following code: parser_names name = Word(alphanums + alphas8bit) begin, end = map(Suppress, "<") names = begin + ZeroOrMore(name) + "," + ZeroOrMore(name) + end for name in names.searchString(page): print(name) However this does not catch the name from the html above. Any advice in how to proceed? Best, Thomas

    Read the article

  • Scraping a page from a secure URL which is possibly using a session ID

    - by VN44CA
    How to scrape a page like this. https://www.procom.ca/JobList.aspx?keywords=&Cities=&reference=&JobType=0 It is secure, and requires a referrer? I can't get anything using wget or httplib2. If you go through this page, you get a list and it works on a browser but not the command line. https://www.procom.ca/jobsearch.aspx I am interested in command line fetching. thx

    Read the article

  • Screen scraping an application window and interacting with the mouse and keyboard

    - by ccook
    The other day I found myself addicted to a flash game and frustrated by the thing at the same time. In a moment of frustration with the game I thought I would make a 'bot' to beat it for me. Well, I really wouldn't, but it made me realize: I don't know how to interact with another application in a way to do this. Which brings me to the question, how would one take screenshots of another running application and interact with it with the keyboard and mouse. Ideally the solution would be in a managed language like c#. When doing the background reading the net was drowning with articles on scrapping HTML. There were not many articles on actually screen scrapping an application. Diverse answers are appreciated as I’m really looking at surveying what’s out there.

    Read the article

  • Scraping Google docs (can't use API)

    - by Andy Waite
    I'm building an iPhone app which needs a peice of meta data from a user's Google Spreadsheet. Unfortunately the meta data I need is not exposed by the API, so I will need to scrape it from the document's HTML source (it would not be present in any of the exported variants). Is there anyway to include authentication parameters in a call such as: http://spreadsheets.google.com/ccc?key=abc123&username=...&password=...

    Read the article

  • Scraping ASP.NET site with Ruby

    - by JillianK
    I would like to scrape the search results of this ASP.NET site using Ruby and preferably just using Hpricot (I cannot open an instance of Firefox): http://www.ngosinfo.gov.pk/SearchResults.aspx?name=&foa=0 However, I am having trouble figuring out how to go through each page of results. Basically, I need simulate clicking on links like these: <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$Pager1$2','')" class="blue_11" id="ctl00_ContentPlaceHolder1_Pager1">2</a> <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$Pager1$3','')" class="blue_11" id="ctl00_ContentPlaceHolder1_Pager1">3</a> etc. I tried using Net::HTTP to handle the post, but while that received the correct HTML, there were no search results (I'm probably not doing that correctly). In addition, the URL of the page does not contain any parameters indicating page, so it is not possible to force the results that way. Any help would be greatly appreciated.

    Read the article

  • PHP Magento Screen Scraping

    - by Grant unwin
    I am trying to scrape a suppliers magento site in an effort to save some time because of there being around 2000 products I need to gather info for. I'm totally OK with writing a screen scraper for pretty much anything but i've encountered a major problem. Im using get_file_contentsto gather the html of the product page. The problem is: You need to be logged in, to view the product page. Its a standard magento login, so how can I get round this in my screen scraper? I don't require a full script, just advice on a method. Thanks

    Read the article

  • scrape a user's entire tweets

    - by whitman
    I'd like to pull all of a user's tweets. I could do this the hard way (manually scraping twitter) or the easy way: using their api. The problem with the easy (api) way is that I seem to be limited to the 200 most recent tweets. What's a simple way to get all tweets? Thanks

    Read the article

  • Scrape zipcode table for different urls based on county

    - by Dr.Venkman
    I used lxml and ran into a wall as my new computer wont install lxml and the code doesnt work. I know this is simple - maybe some one can help with a beautiful soup script. this is my code: import codecs import lxml as lh from selenium import webdriver import time import re results = [] city = [ 'amador'] state = [ 'CA'] for state in states: for city in citys: browser = webdriver.Firefox() link2 = 'http://www.getzips.com/cgi-bin/ziplook.exe?What=3&County='+ city +'&State=' + state + '&Submit=Look+It+Up' browser.get(link2) bcontent = browser.page_source zipcode = bcontent[bcontent.find('<td width="15%"'):bcontent.find('<p>')+0] if len(zipcode) > 0: print zipcode else: print 'none' browser.quit() Thanks for the help

    Read the article

  • How does Cell Minute Tracker work?

    - by embedded
    It's been a mystery how does Cell Minute Tracker manage to fetch AT&T users data. Maybe someone here has the long waited answer. I'm really curious rather they got a confirmation to scrape user’s cellular report And how they can fire up multiple requests to AT&T site without being banned? I'm waiting for someone who could shed some light on this mystery Thanks link: http://www.uquery.com/apps/311637771-cell-minute-tracker-for-att

    Read the article

  • How to scrape user's data without being banned by the server?

    - by embedded
    I'm developing a site which monitors user's date. It uses the cURL over PHP. It first gets authorized using cookie and then parses the required data. My problem is that it needs to fire multiple requests to the server (for all registered users) and this may Get me banned by the remote server. I would like to know if there is something I could do to prevent being banned. (This activity is legal - the users have provided their login information) Thanks

    Read the article

  • Screen Scraping When All You Have Is A Hammer

            I had decided to create a list of what videos were already available on the Learning Pages of Silverlight.net.  When I clicked on the page for the entire list, however, I was quite daunted by the sheer number. I opened the source for the page, and found that there was an easy screen scraping [...]...Did you know that DotNetSlackers also publishes .net articles written by top known .net Authors? We already have over 80 articles in several categories including Silverlight. Take a look: here.

    Read the article

  • Alternative, more efficient scraping method for a noncoder, than Google doc's importxml and xpath?

    - by binarybunny
    I've searched throughout the net for a simple solution, but it seems everyone has their own unique method (coding language) of achieving this. I'm only just beginning to learn Linux, and my coding skills are thoroughly lacking (non-existent). I love the simplicity of using importxml and xpath, but copying and pasting values after reaching the spreadsheet limit of 50 is getting old. Now that I've seen the light, I would really just like to know of a simple, yet scalable solution to get more data into more spreadsheets/databases. Before I really start getting my hands dirty, I would love to know some of the ways you guys go about accomplishing this?

    Read the article

  • Screen-scraping a site with a asp.net form login in C#?

    - by Ajit
    Hi Friends, i've created a web application in asp.net so far. where i've tried to get some data(site scraping) from secure page of a web site.I've used the HttpWebRequest class for this functionality but i haven't accessed the secure page yet. Every time the login pages was scraped not secure page.I have the site user id and password and don't know that which language site has been developed in. Please advice what should i do ?

    Read the article

  • Screen scraping software that will traverse pages

    - by nilbus
    We're creating a mashup site that pulls information from many sources all over the web. Many of these sites don't provide RSS feeds or APIs to access the information they provide. This leaves us with screen scraping as our method for collecting the data. There are many scripting tools out there written in different scripting languages for screen scraping that require you to write scraping scripts in the language the scraper was written in. Scrapy, scrAPI, and scrubyt are a few written in Ruby and Python. There are other web-based tools I've seen like Dapper that create XML or RSS feeds based on a webpage. It has a beautiful web-based interface that requires no scripting skills to use. This would be a great tool, if it were able to traverse multiple pages to gather data from hundreds pages of results. We need something that will scrape information from paginated web sites, much like scrubyt, but with a user interface that a non-programmer could use. We'll script up our own solution if we need to, probably using scrubyt, but if there's a better solution out there, we want to use it. Does anything like this exist?

    Read the article

  • Web scraping: how to get scraper implementation from text link?

    - by isme
    I'm building a java web media-scraping application for extracting content from a variety of popular websites: youtube, facebook, rapidshare, and so on. The application will include a search capability to find content urls, but should also allow the user to paste a url into the application if they already where the media is. Youtube Downloader already does this for a variety of video sites. When the program is supplied with a URL, it decides which kind of scraper to use to get the content; for example, a youtube watch link returns a YoutubeScraper, a Facebook fanpage link returns a FacebookScraper and so on. Should I use the factory pattern to do this? My idea is that the factory has one public method. It takes a String argument representing a link, and returns a suitable implementation of the Scraper interface. I guess the Factory would hold a list of Scraper implementations, and would match the link against each Scraper until it finds a suitable one. If there is no suitable one, it throws an Exception instead.

    Read the article

  • How to use regular expressions to pull a substring? (screen scraping)

    - by Diego
    Hey guys, i'm really trying to understand regular expressions while scraping a site, i've been using it in my code enough to pull the following, but am stuck here. I need to quickly grab this: http://www.example.com/online/store/TitleDetail?detail&sku=123456789 from this: ('<a href="javascript:if(handleDoubleClick(this.id)){window.location=\'http://www.example.com/online/store/TitleDetail?detail&sku=123456789\';}" id="getTitleDetails_123456789">\r\n\t\t\t \tcheck store inventory\r\n\t\t\t </a>', 1) This is where I got confused. any ideas?

    Read the article

  • Screen scraping C application without using OCR or DOM?

    - by Mrgreen
    We have a legacy system that is essentially a glorified telnet interface. We cannot use an alternative telnet client program to connect to the system since there are special features built into the client software they have provided us. I want to be able to screen scrape from this program, however that's proving very difficult. I have tried using WindowSpy and Spy++ to check the window text and it comes up blank. It's a custom C program written by the vendor (they have even disabled selecting text). I'm really looking for a free option and something I may perhaps be able to use in conjuction with a scripting language. It seems the only ways to grab text is directly from the Windows GDI or from memory, but that seems a little extreme. Can anyone recommend any software/DLLs that might be able to accomplish this? I'd be extremely appreciative.

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >