Search Results

Search found 261 results on 11 pages for 'crawler'.

Page 1/11 | 1 2 3 4 5 6 7 8 9 10 11 | Next Page >

What techniques can be used to detect so called "black holes" (a spider trap) when creating a web crawler?

- by Tom

When creating a web crawler, you have to design somekind of system that gathers links and add them to a queue. Some, if not most, of these links will be dynamic, which appear to be different, but do not add any value as they are specifically created to fool crawlers. An example: We tell our crawler to crawl the domain evil.com by entering an initial lookup URL. Lets assume we let it crawl the front page initially, evil.com/index The returned HTML will contain several "unique" links: evil.com/somePageOne evil.com/somePageTwo evil.com/somePageThree The crawler will add these to the buffer of uncrawled URLs. When somePageOne is being crawled, the crawler receives more URLs: evil.com/someSubPageOne evil.com/someSubPageTwo These appear to be unique, and so they are. They are unique in the sense that the returned content is different from previous pages and that the URL is new to the crawler, however it appears that this is only because the developer has made a "loop trap" or "black hole". The crawler will add this new sub page, and the sub page will have another sub page, which will also be added. This process can go on infinitely. The content of each page is unique, but totally useless (it is randomly generated text, or text pulled from a random source). Our crawler will keep finding new pages, which we actually are not interested in. These loop traps are very difficult to find, and if your crawler does not have anything to prevent them in place, it will get stuck on a certain domain for infinity. My question is, what techniques can be used to detect so called black holes? One of the most common answers I have heard is the introduction of a limit on the amount of pages to be crawled. However, I cannot see how this can be a reliable technique when you do not know what kind of site is to be crawled. A legit site, like Wikipedia, can have hundreds of thousands of pages. Such limit could return a false positive for these kind of sites. Any feedback is appreciated. Thanks.

Read the article
Web Crawler for Learnign Topics on Wikipedia

- by Chris Okyen

When I want to learn a vast topic on wikipedia, I don't know where to start. For instance say I want to learn about Binary Stars, I then have to know other things linked on that pages and linked pages on all the linked pages and so on for the specified number of levels. I want to write a web crawler like HTTracker or something similiar, that will display a heiarchy of the links on a certain page and the links on those linked pages.I wish to use as much prewritten code as possible. Here is an example: Pretending we are bending the rules by grabing links from only the first sentence of each pages The example archives and "processes" two levels deep The page is Ternary operation The First Level In mathematics a ternary operation is an N-ary operation The Second Level Under Mathmatics: Mathematics (from Greek µ???µa máthema, “knowledge, study, learning”) is the abstract study of topics encompassing quantity, structure, space, change and others; it has no generally accepted definition. Under N-ary In logic,mathematics, and computer science, the arity i/'ær?ti/ of a function or operation is the number of arguments or operands that the function takes Under Operation In its simplest meaning in mathematics and logic, an operation is an action or procedure which produces a new value from one or more input values ------------------------------------------------------------------------- I need some way to determine what oder to approach all these wiki pages to learn the concept ( in this case ternary operations )... Following along with this exmpakle, one way to show the path to read would a printout flowout like so: This shows that the first sentence of the Mathematics page doesn't link to the first sentence of pages linked on ternary page two levels deep. (Please tell me how I should explain this ) --- In otherwords, the child node of the top pages first sentence, ternary_operation, does not have any child nodes that reference the children of the top pages other children nodes- N-ary and operation. Thus it is safe to read this first. Since N-ary has a link to operations we shoudl read the operation page second and finally read the N-ary page last. Again, I wish to use as much prewritten code as possible, and was wondering what language to use and what would be the simpliest way to go about doing this if there isn't already somethign out there? Thank You!

Read the article
Asp.net Crawler Webresponse Operation Timed out.

- by Leon

Hi I have built a simple threadpool based web crawler within my web application. Its job is to crawl its own application space and build a Lucene index of every valid web page and their meta content. Here's the problem. When I run the crawler from a debug server instance of Visual Studio Express, and provide the starting instance as the IIS url, it works fine. However, when I do not provide the IIS instance and it takes its own url to start the crawl process(ie. crawling its own domain space), I get hit by operation timed out exception on the Webresponse statement. Could someone please guide me into what I should or should not be doing here? Here is my code for fetching the page. It is executed in the multithreaded environment. private static string GetWebText(string url) { string htmlText = ""; HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url); request.UserAgent = "My Crawler"; using (WebResponse response = request.GetResponse()) { using (Stream stream = response.GetResponseStream()) { using (StreamReader reader = new StreamReader(stream)) { htmlText = reader.ReadToEnd(); } } } return htmlText; } And the following is my stacktrace: at System.Net.HttpWebRequest.GetResponse() at CSharpCrawler.Crawler.GetWebText(String url) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 366 at CSharpCrawler.Crawler.CrawlPage(String url, List1 threadCityList) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 105 at CSharpCrawler.Crawler.CrawlSiteBuildIndex(String hostUrl, String urlToBeginSearchFrom, List1 threadCityList) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 89 at crawler_Default.threadedCrawlSiteBuildIndex(Object threadedCrawlerObj) in c:\myAppDev\myApp\site\crawler\Default.aspx.cs:line 108 at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context(Object state) at System.Threading.ExecutionContext.runTryCode(Object userData) at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData) at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx) at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() at System.Threading.ThreadPoolWorkQueue.Dispatch() at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback() Thanks and cheers, Leon.

Read the article
Command-line HTTP crawler for Windows?

- by Pekka

Would somebody have a recommendation for a web site crawler that can be invoked and equipped with settings from the command line? This would need to run in a Windows environment. Saving the data, following stylesheet links etc. is not an issue. I only need the crawler to start with a page, parse it, and follow all the links on the same domain so that in the end, all pages on the site have been requested once. Background: I'm setting up a web site that gets frequently uploaded from an office location. Combining data from various sources, it has several levels of caching. I don't want the first user to visit the site after a fresh upload to have to wait until the page has been generated and saved in the cache.

Read the article
Site crawler/spider that tosses results into mysql

- by ian.evans

It's been suggested that we use mysql for our site's search as it'd be running on the same server that hosts our web server (nginx) and our db (mysql). Since not all of our pages are created from the database, it's been suggested that we have a crawler that can crawl the site, and toss the page url and data into mysql and have sphinx index on that. Does anyone know of an open source spider that has a mysql storing option out of the box. Thanks.

Read the article
Web crawler update strategy

- by superb

I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy. The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated resource. So I want to know is there any good strategy for a web crawler to get updated pages? Here's a coarse algorithm I've thought of. I divided the crawl process into rounds. Each round URL repository will give crawler a certain number (like , 10000) of URLs to crawl. And then next round. The detailed steps are: crawler add start URLs to URL repository crawler ask URL repository for at most N URL to crawl crawler fetch the URLs, and update certain information in URL repository, like the page content, the fetch time and whether the content has been changed. just go back to step 2 To further specify that, I still need to solve following question: How to decide the "refresh-ness" of a web page, which indicates the probability that this web page has been updated ? Since that is an open question, hopefully it will brought some fruitful discussion here.

Read the article
can a crawler be written entirely in javascript?

- by goss

I was wondering - can a crawler be written entirely in javascript? That way, the crawler is only called when a user needs the information and everything is run from the individual user's computer

Read the article
How to write a crawler?

- by Jason

Hi All, I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content. Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it finds, etc,etc. Thanks! -Jason

Read the article
Remove subdomain from Google Crawler

- by Walter White

Hi all, I recently removed a sub-domain from my domain so I just have 1 website to manage. However, if I do a google search, my old domain is still there, I removed the sub-domain well over a week ago and if you try to access the domain directly, you will get an error saying the website can not be found (the records have been deleted). What is the easiest way to remove that sub-domain from google searches since they no longer exist? Shouldn't google see that the domain no longer exists and delete those entries? Walter

Read the article
Website crawler/spider to get site map

- by ack__

I need to retrieve a whole website map, in a format like : http://example.org/ http://example.org/product/ http://example.org/service/ http://example.org/about/ http://example.org/product/viewproduct/ I need it to be linked-based (no file or dir brute-force), like : parse homepage - retrieve all links - explore them - retrieve links, ... And I also need the ability to detect if a page is a "template" to not retrieve all of the "child-pages". For example if the following links are found : http://example.org/product/viewproduct?id=1 http://example.org/product/viewproduct?id=2 http://example.org/product/viewproduct?id=3 I need to get only once the http://example.org/product/viewproduct I've looked into HTTtracks, wget (with spider-option), but nothing conclusive so far. The soft/tool should be downloadable, and I prefer if it runs on Linux. It can be written in any language. Thanks

Read the article
Crawler do not create custom crawled properties

- by user173739

These days i have faced with very strange problem. I have development environment with MOSS 2007 SP 2 and WS 2008, i have search configured and everything works great. I have started to configuring staging environment (MOSS 2007 SP2 with June CU) and create new farm and new SSP. I have deployed my changes with package (wsp) and manually create site collections, sub webs, pages and so on. When fill crawl finishes, i see in Crawl log that all my pages have been successfully crawled and when i use some test tools to query search, my pages have been found. In crawl log there is few errors like http://mysite/sites/de/pages "The crawler could not communicate with the server. Check that the server is available and that the firewall access is configured correctly..", but all pages in this Page library were indexed. The problem is that i use custom managed properties (mapped to custom crawled properties) in search queries, but crawler didn't create crawled properties for all my new site columns. For example for site column IsAccent the crawler didn't create cralwed property ows_isAccesnt. I'm sure that i have created pages for specific content type and all my crawl categories have checked "Automatically discover new properties when a crawl takes place ". In site settings - Searchable columns i haven't got any column selected as Nocrowl. I tried to export my managed and crawled properties from dev environment to stage evironment but all my managed properties were empty, after that i recreated SSP...the result was the same... I checked specific page with tools like Sharepoint Manager 2007 and U2U Caml Query Builder 2007 that content type is correct, and i can see values of my custom site collumns.... Using U2U Caml Query Builder 2007 agains some Page library in Result tab i can see ows_IsAccent (my site collumn is IsAccent) and others site columns, but i can't find them in Crawled properties. Any idias?

Read the article
HTTP crawler in Erlang

- by ctp

I'm coding on a simple HTTP crawler but I have an issue running the code at the bottom. I'm requesting 50 URLs and get the content of 20+ back. I've generated few files with 150kB size each to test the crawler. So I think the 20+ responses are limited by the bandwidth? BUT: how to tell the Erlang snippet not to quit until the last file is not fetched? The test data server is online, so plz try the code out and any hints are welcome :) -module(crawler). -define(BASE_URL, "http://46.4.117.69/"). -export([start/0, send_reqs/0, do_send_req/1]). start() -> ibrowse:start(), proc_lib:spawn(?MODULE, send_reqs, []). to_url(Id) -> ?BASE_URL ++ integer_to_list(Id). fetch_ids() -> lists:seq(1, 50). send_reqs() -> spawn_workers(fetch_ids()). spawn_workers(Ids) -> lists:foreach(fun do_spawn/1, Ids). do_spawn(Id) -> proc_lib:spawn_link(?MODULE, do_send_req, [Id]). do_send_req(Id) -> io:format("Requesting ID ~p ... ~n", [Id]), Result = (catch ibrowse:send_req(to_url(Id), [], get, [], [], 10000)), case Result of {ok, Status, _H, B} -> io:format("OK -- ID: ~2..0w -- Status: ~p -- Content length: ~p~n", [Id, Status, length(B)]); Err -> io:format("ERROR -- ID: ~p -- Error: ~p~n", [Id, Err]) end. That's the output: Requesting ID 1 ... Requesting ID 2 ... Requesting ID 3 ... Requesting ID 4 ... Requesting ID 5 ... Requesting ID 6 ... Requesting ID 7 ... Requesting ID 8 ... Requesting ID 9 ... Requesting ID 10 ... Requesting ID 11 ... Requesting ID 12 ... Requesting ID 13 ... Requesting ID 14 ... Requesting ID 15 ... Requesting ID 16 ... Requesting ID 17 ... Requesting ID 18 ... Requesting ID 19 ... Requesting ID 20 ... Requesting ID 21 ... Requesting ID 22 ... Requesting ID 23 ... Requesting ID 24 ... Requesting ID 25 ... Requesting ID 26 ... Requesting ID 27 ... Requesting ID 28 ... Requesting ID 29 ... Requesting ID 30 ... Requesting ID 31 ... Requesting ID 32 ... Requesting ID 33 ... Requesting ID 34 ... Requesting ID 35 ... Requesting ID 36 ... Requesting ID 37 ... Requesting ID 38 ... Requesting ID 39 ... Requesting ID 40 ... Requesting ID 41 ... Requesting ID 42 ... Requesting ID 43 ... Requesting ID 44 ... Requesting ID 45 ... Requesting ID 46 ... Requesting ID 47 ... Requesting ID 48 ... Requesting ID 49 ... Requesting ID 50 ... OK -- ID: 49 -- Status: "200" -- Content length: 150000 OK -- ID: 47 -- Status: "200" -- Content length: 150000 OK -- ID: 50 -- Status: "200" -- Content length: 150000 OK -- ID: 17 -- Status: "200" -- Content length: 150000 OK -- ID: 48 -- Status: "200" -- Content length: 150000 OK -- ID: 45 -- Status: "200" -- Content length: 150000 OK -- ID: 46 -- Status: "200" -- Content length: 150000 OK -- ID: 10 -- Status: "200" -- Content length: 150000 OK -- ID: 09 -- Status: "200" -- Content length: 150000 OK -- ID: 19 -- Status: "200" -- Content length: 150000 OK -- ID: 13 -- Status: "200" -- Content length: 150000 OK -- ID: 21 -- Status: "200" -- Content length: 150000 OK -- ID: 16 -- Status: "200" -- Content length: 150000 OK -- ID: 27 -- Status: "200" -- Content length: 150000 OK -- ID: 03 -- Status: "200" -- Content length: 150000 OK -- ID: 23 -- Status: "200" -- Content length: 150000 OK -- ID: 29 -- Status: "200" -- Content length: 150000 OK -- ID: 14 -- Status: "200" -- Content length: 150000 OK -- ID: 18 -- Status: "200" -- Content length: 150000 OK -- ID: 01 -- Status: "200" -- Content length: 150000 OK -- ID: 30 -- Status: "200" -- Content length: 150000 OK -- ID: 40 -- Status: "200" -- Content length: 150000 OK -- ID: 05 -- Status: "200" -- Content length: 150000 Update: thanks stemm for the hint with the wait_workers. I've combined your and mine code but same behaviour :( -module(crawler). -define(BASE_URL, "http://46.4.117.69/"). -export([start/0, send_reqs/0, do_send_req/2]). start() -> ibrowse:start(), proc_lib:spawn(?MODULE, send_reqs, []). to_url(Id) -> ?BASE_URL ++ integer_to_list(Id). fetch_ids() -> lists:seq(1, 50). send_reqs() -> spawn_workers(fetch_ids()). spawn_workers(Ids) -> %% collect reference to each worker Refs = [ do_spawn(Id) || Id <- Ids ], %% wait for response from each worker wait_workers(Refs). wait_workers(Refs) -> lists:foreach(fun receive_by_ref/1, Refs). receive_by_ref(Ref) -> %% receive message only from worker with specific reference receive {Ref, done} -> done end. do_spawn(Id) -> Ref = make_ref(), proc_lib:spawn_link(?MODULE, do_send_req, [Id, {self(), Ref}]), Ref. do_send_req(Id, {Pid, Ref}) -> io:format("Requesting ID ~p ... ~n", [Id]), Result = (catch ibrowse:send_req(to_url(Id), [], get, [], [], 10000)), case Result of {ok, Status, _H, B} -> io:format("OK -- ID: ~2..0w -- Status: ~p -- Content length: ~p~n", [Id, Status, length(B)]), %% send message that work is done Pid ! {Ref, done}; Err -> io:format("ERROR -- ID: ~p -- Error: ~p~n", [Id, Err]), %% repeat request if there was error while fetching a page, do_send_req(Id, {Pid, Ref}) %% or - if you don't want to repeat request, put there: %% Pid ! {Ref, done} end. Running the crawler forks fine for a handful of files, but then the code even doesnt fetch the entire files (file size each 150000 bytes) - he crawler fetches some files partially, see the following web server log :( 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /10 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /1 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /3 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /8 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /39 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /7 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /6 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /2 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /5 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /50 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /9 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /44 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /38 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /47 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /49 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /43 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /37 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /46 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /48 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:00 +0200] "GET /36 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:01 +0200] "GET /42 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:01 +0200] "GET /41 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:01 +0200] "GET /45 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:01 +0200] "GET /17 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:01 +0200] "GET /35 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:01 +0200] "GET /16 HTTP/1.1" 200 150000 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:01 +0200] "GET /15 HTTP/1.1" 200 17020 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:01 +0200] "GET /21 HTTP/1.1" 200 120360 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:01 +0200] "GET /40 HTTP/1.1" 200 117600 "-" "-" 82.114.62.14 - - [13/Sep/2012:15:17:01 +0200] "GET /34 HTTP/1.1" 200 60660 "-" "-" Any hints are welcome. I have no clue what's going wrong there :(

Read the article
If an visitors IP address contains "google" or a similar keyword, does this mean they were a crawler?

- by Roscoe

Hi, I have a huge list of IP addresses recorded from various visitors to a website. A huge amount of the visitors, in some months over 70%, came from IP addresses that contained keywords such as google, yahoo, bot, crawler, etc. Does this mean that those users were infact search engine crawlers? If so, why are their so many crawlers in my visitor records in comparison to genuine human visitors? (and if not what's the explanation?) Thanks in advance.

Read the article
Building an automatic web crawler

- by Sakin

I am building a web application crawler that's meant not only to find all the links or pages in a web application, but also perform all the allowed actions in the app (such as pushing buttons, filling forms, notice changes in the DOM even if they did not trigger a request etc.) Basically, this is a kind of "browser simulator". I find WebKit a good option to implement my crawler, since it has all the needed technology (Javascript engine, parsers, DOM manipulation, etc.) but it seems kind of an overkill being a fully featured browser. Is there any toolkit you know that can provide the above functionality?

Read the article
What is a good Java crawler library?

- by DrDee

Hi, I am about to develop a crawler in Java but don't feel like reinventing the wheel. A quick Google search gives a whole bunch of Java libraries to build a web crawler. Besides that Nutch is of course a very robust package but seems a bit too advanced for my needs. I only need to crawl a handful websites a week containing a couple of 1000 pages each. Which open source Java library would you recommend considering: speed multithreading (or even distributed) extending it with new functionality active maintained and documentation?

Read the article
What is a good Java web crawler library?

- by DrDee

Hi, I am about to develop a crawler in Java but don't feel like reinventing the wheel. A quick Google search gives a whole bunch of Java libraries to build a web crawler. Besides that Nutch is of course a very robust package but seems a bit too advanced for my needs. I only need to crawl a handful websites a week containing a couple of 1000 pages each. Which open source Java library would you recommend considering: speed multithreading (or even distributed) extending it with new functionality active maintained and documentation?

Read the article
Web Crawler C# .Net

- by sora0419

I'm not sure if this is actually called the web crawler, but this is what I'm trying to do. I'm building a program in visual studio 2010 using C# .Net. I want to find all the urls that has the same first part. Say I have a homepage: www.mywebsite.com, and there are several subpage: /tab1, /tab2, /tab3, etc. Is there a way to get a list of all urls that begins with www.mywebsite.com? So by providing www.mywebsite.com, the program returns www.mywebsite.com/tab1, www.mywebsite.com/tab2, www.mywebsite.com/tab3, etc. ps. I do not know how many total sub pages there are. --edit at 12:04pm-- sorry for the lack of explanation. I want to know how to write a crawler in C# that do the above task. All I know is the main url www.mywebsite.com, and the goal is to find all its sub pages. -- edit at 12:16pm-- Also, there is no links on the main page, the html is basically blank. I just know that the subpages exist, but have no way to link to it except for providing the exact urls.

Read the article
Writing Crawler for Screen Scrapping

- by Muhammad Akhtar

I want to write crawler for screen scrapping What I want is, I want to get price of particular hotel from a website, like here is website e.g. In the above URL, there is list of hotels and its price. I want to get the price of the beaufort Please Advise how to accomplish this. Thanks

Read the article
web crawler needed

- by nightcoder1

does anybody know where i can get a free web crawler that actually works with minimal coding by me. ive googled it and can only find really old ones that dont work or openwebspider which doesnt seem to work. ideally id like to store just the web addresses and which links that page contains any suggestions? thanks

Read the article
Web crawler that can interpret javascript

- by user320662

Hi, I want to write a web crawler that can interpret JavaScript. Basically its a program in Java or PHP that takes a URL as input and outputs the DOM tree which is similar to the output in Firebug HTML window. The best example is Kayak.com where you can not see the resulting DOM displayed on the browser when you 'view source' but can save the resulting HTML though firebug.

Read the article
Writing a PHP web crawler using cron

- by Horse

Hi all I have written myself a web crawler using simplehtmldom, and have got the crawl process working quite nicely. It crawls the start page, adds all links into a database table, sets a session pointer, and meta refreshes the page to carry onto the next page. That keeps going until it runs out of links That works fine however obviously the crawl time for larger websites is pretty tedious. I wanted to be able to speed things up a bit though, and possibly make it a cron job. Any ideas on making it as quick and efficient as possible other than setting the memory limit / execution time higher?

Read the article
Is there an automated way to take site inventory?

- by leeand00

Is there a way to take site inventory using a crawler program that checks either the sources of images for specific servers that serve ads, or, that the crawler looks at a page for specific (html5?) tags like <aside> or some other tag to count the inventory of ad spaces available on a site? The crawler might additionally look at the size of the ads to categorize them into different classifications of ads. Also, what would a crawler like this be called?

Read the article
Copy a website and preserve the file & folder structure

- by DrStalker

I have an old web site running on an ancient version of Oracle Portal that we need to convert to a flat-html structure. Due to damage to the server we are not able to access the administrative interface, and even if we could there is no export functionality that can work with modern software versions. It would be enough to crawl the website and have all the pages & images saved to a folder, but the file structure needs to be preserved; that is, if a page is located at http://www.oldserver.com/foo/bar/baz/mypage.html then it needs to be saved to /foo/bar/baz/mypage.html so that the various Javascript bits will continue to function. None of the web crawlers I've found have been able to do this; they all want to rename the pages (page01.html, page02.html etc) and break the folder structure. Is there any crawler out there that will recreate the site structure as it appears to a user accessing the site? It doesn't need to redo any of teh content of the pages; once rehosted the pages will all have the same names they did originally so links will continue to work.

Read the article
MS Bing web crawler out of control causing our site to go down

- by akaDanPaul

Here is a weird one that I am not sure what to do. Today our companies e-commerce site went down. I tailed the production log and saw that we were receiving a ton of request from this range of IP's 157.55.98.0/157.55.100.0. I googled around and come to find out that it is a MSN Web Crawler. So essentially MS web crawler overloaded our site causing it not to respond. Even though in our robots.txt file we have the following; Crawl-delay: 10 So what I did was just banned the IP range in iptables. But what I am not sure to do from here is how to follow up. I can't find anywhere to contact Bing about this issue, I don't want to keep those IPs blocked because I am sure eventually we will get de-indexed from Bing. And it doesn't really seem like this has happened to anyone else before. Any Suggestions? Update, My Server / Web Stats Our web server is using Nginx, Rails 3, and 5 Unicorn workers. We have 4gb of memory and 2 virtual cores. We have been running this setup for over 9 months now and never had an issue, 95% of the time our system is under very little load. On average we receive 800,000 page views a month and this never comes close to bringing / slowing down our web server. Taking a look at the logs we were receiving anywhere from 5 up to 40 request / second from this IP range. In all my years of web development I have never seen a crawler hit a website so many times. Is this new with Bing?

Read the article
Appengine Apps Vs Google bot web crawler

- by sandeep koduri

i built an appengine web app cricket.hover.in. The web app consists of about 15k url's linked in it, But even after a long time of my launch, no pages are indexed on google. Any base link place on my root site hover.in are being indexed with in minutes. but i placed the same link home page of root site a long back. but its of no use. can any one analyse , if there is any issue with cricket.hover.in or if bots have any issues with Google app engine actually tested the url using labs app of webmaster tools of google there the return is fine and html is clear. but when tested the same (cricket.hover.in) at the following urls its showing different results of failure www.dnsqueries.com/en/googlebot_simulator.php www.smart-it-consulting.com/internet/google/googlebot-spoofer/ but if i test some of my php or word press links at the above url's the results are good and fine. please help me with this.

Read the article

Search Results

Search found 261 results on 11 pages for 'crawler'.

Page 1/11 | 1 2 3 4 5 6 7 8 9 10 11 | Next Page >

- by Tom

- by Chris Okyen

- by Leon

- by Pekka

- by ian.evans

- by superb

- by goss

- by Jason

- by Walter White

- by ack__

- by user173739

- by ctp

- by Roscoe

- by Sakin

- by DrDee

- by DrDee

- by sora0419

- by Muhammad Akhtar

- by nightcoder1

- by user320662

- by Horse

- by leeand00

- by DrStalker

- by akaDanPaul

- by sandeep koduri

1 2 3 4 5 6 7 8 9 10 11 | Next Page >