Search Results

Search found 346 results on 14 pages for 'scraping'.

Page 7/14 | < Previous Page | 3 4 5 6 7 8 9 10 11 12 13 14  | Next Page >

  • R: extracting "clean" UTF-8 text from a web page scraped with RCurl

    - by SlowLearner
    Using R, I am trying to scrape a web page save the text, which is in Japanese, to a file. Ultimately this needs to be scaled to tackle hundreds of pages on a daily basis. I already have a workable solution in Perl, but I am trying to migrate the script to R to reduce the cognitive load of switching between multiple languages. So far I am not succeeding. Related questions seem to be this one on saving csv files and this one on writing Hebrew to a HTML file. However, I haven't been successful in cobbling together a solution based on the answers there. The pages are from Yahoo! Japan Finance and my Perl code that looks like this. use strict; use HTML::Tree; use LWP::Simple; #use Encode; use utf8; binmode STDOUT, ":utf8"; my @arr_links = (); $arr_links[1] = "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203"; $arr_links[2] = "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201"; foreach my $link (@arr_links){ $link =~ s/"//gi; print("$link\n"); my $content = get($link); my $tree = HTML::Tree->new(); $tree->parse($content); my $bar = $tree->as_text; open OUTFILE, ">>:utf8", join("","c:/", substr($link, -4),"_perl.txt") || die; print OUTFILE $bar; } This Perl script produces a CSV file that looks like the screenshot below, with proper kanji and kana that can be mined and manipulated offline: My R code, such as it is, looks like the following. The R script is not an exact duplicate of the Perl solution just given, as it doesn't strip out the HTML and leave the text (this answer suggests an approach using R but it doesn't work for me in this case) and it doesn't have the loop and so on, but the intent is the same. require(RCurl) require(XML) links <- list() links[1] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7203" links[2] <- "http://stocks.finance.yahoo.co.jp/stocks/detail/?code=7201" txt <- getURL(links, .encoding = "UTF-8") Encoding(txt) <- "bytes" write.table(txt, "c:/geturl_r.txt", quote = FALSE, row.names = FALSE, sep = "\t", fileEncoding = "UTF-8") This R script generates the output shown in the screenshot below. Basically rubbish. I assume that there is some combination of HTML, text and file encoding that will allow me to generate in R a similar result to that of the Perl solution but I cannot find it. The header of the HTML page I'm trying to scrape says the chartset is utf-8 and I have set the encoding in the getURL call and in the write.table function to utf-8, but this alone isn't enough. The question How can I scrape the above web page using R and save the text as CSV in "well-formed" Japanese text rather than something that looks like line noise? Edit: I have added a further screenshot to show what happens when I omit the Encoding step. I get what look like Unicode codes, but not the graphical representation of the characters. So it may be some kind of locale-related issue, but in the exact same locale the Perl script does provide useful output. So this is still puzzling.

    Read the article

  • Is there a way to programmatically extract the feed of a podcast from the iTunes page?

    - by J. Pablo Fernández
    From an iTunes page, like http://itunes.apple.com/us/podcast/this-week-in-tech-mp3-edition/id73329404, is there a way to extract the corresponding feed address? In this case it would be http://leoville.tv/podcasts/twit.xml. I know that if you open on iTunes you can extract it manually, but I want to do it programmatically. There's a link to the website of the podcast, but it may not be accurate. In this case it points to a web site with 20 podcasts on it.

    Read the article

  • C# WebClient - View source question

    - by Jim
    I'm using a C# WebClient to post login details to a page and read the all the results. The page I am trying to load includes flash (which, in the browser, translates into HTML). I'm guessing it's flash to avoid being picked up by search engines??? The flash I am interested in is just text (not an image/video) etc and when I "View Selection Source" in firefox I do actually see the text, within HTML, that I want to see. (Interestingly when I view the source for the whole page I do not see the text, within HTML, that I want to see. Could this be related?) Currently after I have posted my login details, and loaded the HTML back, I see the page which does NOT show the flash HTML (as if I had viewed source for the whole page). Thanks in advance, Jim PS: I should point out that the POST is actually working, my log in is successful.

    Read the article

  • CasperJS Load next page in loop

    - by SquiresSquire
    I've been working on a script which collates the scores for a list of user from a website. One problem is though, I'm trying to load the next page in the while loop, but the function is not being loaded... this.thenOpen("http://www.url.com?ul=" + currentName + "&sortdir=desc&sort=lastfound", function (id) { return function () { this.capture("Screenshots/" + json.username[id] + ".png"); if (!casper.exists(x("//*[contains(text(), 'That username does not exist in the system')]"))) { if (casper.exists(x('//*[@id="ctl00_ContentBody_ResultsPanel"]/table[2]'))){ this.thenEvaluate(tgsagc.tagNextLink); tgsagc.cacheCount = 0; tgsagc.continue = true; this.echo("------------ " + json.username[id] + " ------------"); while (tgsagc.continue) { this.then(function(){ this.evaluate(tgsagc.tagNextLink); var findDates, pageNumber; pageNumber = this.evaluate(tgsagc.pageNumber); findDates = this.evaluate(tgsagc.getFindDates); this.echo("Found " + findDates.length + " on page " + pageNumber); tgsagc.checkFinds(findDates); this.echo(tgsagc.cacheCount + " Caches for " + json.username[id]); this.echo("Continue? " + tgsagc["continue"]); return this.click("#tgsagc-link-next"); }); } leaderboard[json.username[id]] = tgsagc.cacheCount; console.log("Final Count: " + leaderboard[json.username[id]]); console.log(JSON.stringify(leaderboard)); } else { this.echo("------------ " + json.username[id] + " ------------"); this.echo("0 Caches Found"); leaderboard[json.username[id]] = 0; console.log(JSON.stringify(leaderboard)); } } else { this.echo("------------ " + json.username[id] + " ------------"); this.echo("No User found with that Username"); leaderboard[json.username[id]] = null; console.log(JSON.stringify(leaderboard)); }

    Read the article

  • How can I use Perl to grab text from a web page that is dynamically generated with JavaScript?

    - by bstullkid
    There is a website I am trying to pull information from in Perl, however the section of the page I need is being generated using javascript so all you see in the source is: <div id="results"></div> I need to somehow pull out the contents of that div and save it to a file using Perl/proxies/whatever. e.g. the information I want to save would be document.getElementById('results').innerHTML; I am not sure if this is possible or if anyone had any ideas or a way to do this. I was using a lynx source dump for other pages but since I cant straight forward screen scrape this page I came here to ask about it! If anyone is interested, the page is http://downloadcenter.trendmicro.com/index.php?clk=left_nav&clkval=pattern_file&regs=NABU and the info I am trying to get is the row about the ConsumerOPR

    Read the article

  • Add a script tag to HTML with JSoup preserve path tags

    - by Bhuvnesh Pratap
    I am adding a new script tag to my DOM with JSoup like this : rootElement.after("<script src='<?=base_url()?>js/read.js'></script>") but what I end up getting is this : <script src="&lt;?=base_url()?&gt;js/read.js"></script> I intend to preserve this "<?=base_url()?>" and achieve something like: <script src="<?=base_url()>js/read.js"></script> string here , How could I do this ?

    Read the article

  • Programmatically login to a website and redirect the user to the logged in page?

    - by Santhosh
    Hi, Right now, I have all the employees of my company login to an external website using the company id, username and a password. We are trying to integrate it into an intranet portal which should provide seamless access to this website without requiring the user to enter these credentials. Is there any way of doing this programmatically (.NET C#)? Very similar to screenscraping, Can I simulate the appropriate POST action and then redirect the user to the logged in page? Any help is appreciated. Thanks.

    Read the article

  • How to grab data on website?

    - by Doug
    So, often, I check my accounts for different numbers. For example, my affiliate accounts- i check for cash increase. I want to program a script where it can login to all these websiets and then grab the money value for me and display it on one page. How can I program this?

    Read the article

  • Extract anything that looks like links from large amount of data in python

    - by Riz
    Hi, I have around 5 GB of html data which I want to process to find links to a set of websites and perform some additional filtering. Right now I use simple regexp for each site and iterate over them, searching for matches. In my case links can be outside of "a" tags and be not well formed in many ways(like "\n" in the middle of link) so I try to grab as much "links" as I can and check them later in other scripts(so no BeatifulSoup\lxml\etc). The problem is that my script is pretty slow, so I am thinking about any ways to speed it up. I am writing a set of test to check different approaches, but hope to get some advices :) Right now I am thinking about getting all links without filtering first(maybe using C module or standalone app, which doesn't use regexp but simple search to get start and end of every link) and then using regexp to match ones I need.

    Read the article

  • Programatticaly grabing text from a web page that is dynamically generated.

    - by bstullkid
    There is a website I am trying to pull information from in perl, however the section of the page I need is being generated using javascript so all you see in the source is <div id="results"></div> I need to somehow pull out the contents of that div and save it to a file using perl/proxies/whatever. e.g. the information I want to save would be document.getElementById('results').innerHTML; I am not sure if this is possible or if anyone had any ideas or a way to do this. I was using a lynx source dump for other pages but since I cant straight forward screen scrape this page I came here to ask about it!

    Read the article

  • How to extract the data from a website using java?

    - by giri
    Hi I am familier with java programming language I like to extract the data from a website and store it to my database running on my machine.Is that possible in java.If so which API I should use. For example the are number of schools listed on a website How can I extract that data and store it to my database using java.

    Read the article

  • getting text that will be displayed to user from html

    - by gordatron
    Bit of a random one, i am wanting to have a play with some NLP stuff and I would like to: Get all the text that will be displayed to the user in a browser from HTML. My ideal output would not have any tags in it and would only have fullstops (and any other punctuation used) and new line characters, though i can tolerate a fairly reasonable amount of failure in this (random other stuff ending up in output). If there was a way of inserting a newline or full stop in situations where the content was likely not to continue on then that would be considered an added bonus. e.g: items in an ul or option tag could be separated by full stops (or to be honest just ignored). I am working Java, but would be interested in seeing any code that does this. I can (and will if required) come up with something to do this, just wondered if there was anything out there like this already, as it would probably be better than what I come up with in an afternoon ;-). An example of the code I might write if I do end up doing this would be to use a SAX parser to find content in p tags, strip it of any span or strong etc tags, and add a full stop if I hit a div or another p without having had a fullstop. Any pointers or suggestions very welcome.

    Read the article

  • Web Scraper via Web Service API?

    - by 001
    How would I go about doing the following... I want to build a web service for my application to grab a piece of data from an external website, that requires the user to login. The website has no public API , hence the reason for the scrapper. Is there a library to perform the following functions? or what do I do? automate fill-in form, auto click Automate submit button check which URL the user has landed on, and redirect user to URL Grab data from label. EDIT: what im asking for is there a web service, library etc to make it easier to perform screen scrapping/automation functions???

    Read the article

  • What must I learn to parse dynamic HTML sites with PHP?

    - by butteff
    What I must to learn to write php web-site grabber (parser)? It must collect information from other websites, such as as weather forecast, wiki "on this day", some news and other useful and interesting "every day" information! what i must to read for writing m3u player on php? sorry for my bad english

    Read the article

  • utf-8 convertion doesn't work always

    - by Marco Piccinni
    I searched into other stack before to type here and I didn't find anythong similar. I have to scrape different utf-8 webpages which contain text like "Oggi è una bellissima giornata" the problem is on the characther "è" I extract this text with jtidy and xpath query expression and I convert it with byte[] content = filteredEncodedString.getBytes("utf-8"); String result = new String(content,"utf-8"); where filteredEncodedString contains the text "Oggi è una bellissima giornata". This procedures works on the most webpages analyzed so far but in some case it doesn't extract a utf-8 string. Page encoding is always the same as the text is similar. Any ideas about the problem? thanks Marco

    Read the article

< Previous Page | 3 4 5 6 7 8 9 10 11 12 13 14  | Next Page >