crawling - Page 3 - Developer IT

Problem with crawling oracle portal with SharePoint Server 2007 Search

- by John Hansen

We got "No Index Attribute" error when we try to indexing Oracle Portal from SharePoint Server 2007 Search crawler. The content source is added sucessfully. The error messages appeare in the crawler log.

Read the article

Robots Crawling Across Namespace?

- by Codex73

I migrated site from one domain to another. Also placed permanent redirection on old account. My stats logs are capturing this: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) /libro_metaboforte_chap5.php/members/members/file_chap6.php I placed this on robots which wasn't present at time of migration. Robots.txt Contents User-agent: * Allow: / Disallow: /members/ Disallow: /includes/ HTACCESS FILE CONTENTS DirectoryIndex index.php index.html Options +FollowSymlinks RewriteEngine On # Turn on the rewriting engine RewriteBase / RewriteCond %{REQUEST_FILENAME} !-d RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_URI} !^/store/?$ RewriteCond %{QUERY_STRING} !. RewriteRule ^.+/?$ index.php [QSA,L] RewriteCond %{QUERY_STRING} ^curlang=([a-z]*)$ RewriteRule ^.+/?$ index.php? [QSA,L] Will continue to log incoming bot captures. My htaccess does rewrite. I just added the robot file. The funny part is that is stepping in double directories... I don't know if the problem was not having the 'robots.txt' in place or the actual in place htaccess doing rewrites?

Read the article

Crawling within a pdf

- by Saubhagya

Hi, I'm developing a tool that searches the keyword entered by the user on a given site. My problem is, it searches the keyword only on html/web pages but not on the PDF/MS-Word files found on the site. Can anyone suggest me some api/tool or provide the code that can search text from the given online PDF/MS-Word/Text file?

Read the article

saving / mirroring / crawling web pages that use javascript to generate content

- by Nick Nolan

I want to download web pages that use javascript to output the data. Wget can do everything else, but run javascript. Even something like:firefox -remote "saveURL(www.mozilla.org, myfile.html)" would be great (unfortunately that kind of command does not exist).

Read the article

Problem with crawling oracale portal with SharePoint Server 2007 Search

- by John Hansen

We got "No Index Attribute" error when we try to indexing Oracla Portal from SharePoint Server 2007 Search crawler. The content source is added sucessfully. The error messages appeare in the crawler log.

Read the article

Logic behind crawling an webpages like that of Screaming Frog? [on hold]

- by sree

I would like to know what is the parameters to be considered while developing a crawler like that of Screaming Frog. Am looking forward for information on do's and dont's of webpage crawling. What are the problems the crawler may infuse on the webpages like loadtime (maybe?) or anything that effects webpage during crawling. What are the rules the crawler needs to follow etc. Basically anything info that makes the crawler look good and accurate. Just point me in a right direction to achieve it.. Hope my requirement is clear this time.. :)

Read the article

need help in site classification

- by goh

hi guys, I have to crawl the contents of several blogs. The problem is that I need to classify whether the blogs the authors are from a specific school and is talking about the school's stuff. May i know what's the best approach in doing the crawling or how should i go about the classification?

Read the article

How to enable indexing of pages with dynamic data?

- by mithunb

I have a site that has certain urls that point to pages with permanent data and others that point to dynamic web pages. Google indexes both these regularly. By the time a user finds one of the dynamic content urls, the data on the page has already changed and the user does not find what he was looking for. Further, the dynamic url pages contains links to the permanent urls (which I want Google or any crawler to index). Google crawler controls (webmaster tools) cannot be made to read urls from a page but not index them. Solutions? crawling strategies *system architecture*.

Read the article

Is there a way to disallow only crawling in https in robots.txt?

- by David Wilkins

I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so this seems frivolous. Is there a way to specify Disallow: / for https only? According to Wikipedia, each protocol has its own robots.txt And according to Google's Robots.txt Specification, the robots.txt applies to http AND https I don't want to Disallow: / for Bing totally, just over https.

Read the article

Is there a way to disallow crawling of only HTTPS in robots.txt?

- by David Wilkins

I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so this seems frivolous. Is there a way to specify Disallow: / for https only? According to Wikipedia, each protocol has its own robots.txt And according to Google's Robots.txt Specification, the robots.txt applies to http AND https I don't want to Disallow: / for Bing totally, just over https.

Read the article

scrapy - python question

- by tom smith

Hi.. Maybe not the correct place to post. But, I'm going to try anyway! I've got a couple of test python parsing scripts that I created. They work enough for me to test what I'm working on. However, I recently came across the python framework, Scrapy, which is used for web scraping. My app runs in a distributed process, across a testbed of multiple servers. I'm trying to understand scrapy, to see if it provides benefits over what I'm doing. So, if possible, I'd really like to talk with a few people who are grounded in/or who use scrapy. Thanks -tom [email protected]

Read the article

Legality, terms of service for performing a web crawl

- by Berlin Brown

I was going to crawl a site for some research I was collecting. But, apparently the terms of service is quite clear on the topic. Is it illegal to now "follow" the terms of service. And what can the site normally do? Here is an example clause in the TOS. Also, what about sites that don't provide this particular clause. Restrictions: "use any robot, spider, site search application, or other automated device, process or means to access, retrieve, scrape, or index the site" It is just research? Edit: "OK, from the standpoint of designing an efficient crawler. Should I provide some form of natural language engine to read terms of service and then abide by them."

Read the article

What techniques can be used to detect so called "black holes" (a spider trap) when creating a web crawler?

- by Tom

When creating a web crawler, you have to design somekind of system that gathers links and add them to a queue. Some, if not most, of these links will be dynamic, which appear to be different, but do not add any value as they are specifically created to fool crawlers. An example: We tell our crawler to crawl the domain evil.com by entering an initial lookup URL. Lets assume we let it crawl the front page initially, evil.com/index The returned HTML will contain several "unique" links: evil.com/somePageOne evil.com/somePageTwo evil.com/somePageThree The crawler will add these to the buffer of uncrawled URLs. When somePageOne is being crawled, the crawler receives more URLs: evil.com/someSubPageOne evil.com/someSubPageTwo These appear to be unique, and so they are. They are unique in the sense that the returned content is different from previous pages and that the URL is new to the crawler, however it appears that this is only because the developer has made a "loop trap" or "black hole". The crawler will add this new sub page, and the sub page will have another sub page, which will also be added. This process can go on infinitely. The content of each page is unique, but totally useless (it is randomly generated text, or text pulled from a random source). Our crawler will keep finding new pages, which we actually are not interested in. These loop traps are very difficult to find, and if your crawler does not have anything to prevent them in place, it will get stuck on a certain domain for infinity. My question is, what techniques can be used to detect so called black holes? One of the most common answers I have heard is the introduction of a limit on the amount of pages to be crawled. However, I cannot see how this can be a reliable technique when you do not know what kind of site is to be crawled. A legit site, like Wikipedia, can have hundreds of thousands of pages. Such limit could return a false positive for these kind of sites. Any feedback is appreciated. Thanks.

Read the article

how to write a script that logs into an application and checks a page

- by josh

Is it possible to write a script that will login to an application using uname/pwd? the username/password are not passed in through POST (they dont come in the URL) Basic steps I am looking for are: Visit url enter uname/pwd click a button click a link get the raw html to make sure it does not have 500 error Is that possible to do in any language? Please point me to some examples as well

Read the article

how to scrawl file hosting website with scrapy in python?

- by Veryel Hua

Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy. I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!

Read the article

Why is Google Webmaster Tools crawling invalid URLS and showing 500 errors?

- by Amos Kane

Google Webmaster tools is reporting 12k+ 500 errors. Eeek! None of the URLS are valid- they all contain www.youtube.com. First, why is Google crawling these URLS if they don't exist? I supplied a sitemap, and they are of course not in the sitemap. I don't have a robots.txt blocking anything. I've checked for invalid redirects--none, and checked for unclosed tags or something that would throw www.youtube.com into the URL by accident--none. In every 'linked from', the referring URL is also a bad URL, with www.youtube.com in it. The Google Tools report no malware, and I can't check the server logs because the host won't give me access. Really stuck!! Any ideas appreciated!

Read the article

Torchlight II Drops Today; New Classes and Miles of Atmospheric Dungeon Crawling Await

- by Jason Fitzpatrick

Torchlight II, sequel to the extremely popular Torchlight action-RPG, is available for sale today. With four new classes and a massively expanded world, you’ll have plenty to explore. The new release features extra classes, extra companion creatures, in-game weather systems, and of course: updated graphics and a massively expanded game universe. Trumping all these additions, however, is LAN/internet co-op multiplayer–by far the feature most requested and anticipated by Torchlight fans. Check out the trailer video above to take a peak at the game, read more about it at the Torchlight II site, and then hit up the link below to grab a copy on Steam–you can pre-order it any time but it won’t be officially available for download until 2PM EST, today. Torchlight II is Windows-only, $19.99 for a single copy or $59.99 for a friend 4-pack (which includes a copy of Torchlight I). Torchlight II How To Create a Customized Windows 7 Installation Disc With Integrated Updates How to Get Pro Features in Windows Home Versions with Third Party Tools HTG Explains: Is ReadyBoost Worth Using?

Read the article

No description for any page on the website is available in Google despite robots.txt allowing crawling

- by Abhijit

I seem to have the weirdest issue with Search Engine Optimization, and I asked the IT folks at my university, I asked people on Joomla forums and I am trying to sort this issue out using Google Webmaster Tools for more than 2 months to little avail. I want to know if I have some blatantly wrong configuration somewhere that is causing search engines to be unable to index this site. I noticed a similar issue with another website I searched for online (ECEGSA - The University of British Columbia at gsa.ece.ubc.ca), making me believe this might be a concern that people might be looking an answer for. Here are the details: The website in question is: http://gsa.ece.umd.edu/. It runs using Joomla 2.5.x (latest). The site was up since around mid December of 2013, and I noticed right from the get go that the site was not being indexed correctly on Google. Specifically I see the following message when I search for the website on Google: A description for this result is not available because of this site's robots.txt – learn more. The thing is in December till around March I used the default Joomla robots.txt file which is: User-agent: * Disallow: /administrator/ Disallow: /cache/ Disallow: /cli/ Disallow: /components/ Disallow: /images/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /libraries/ Disallow: /logs/ Disallow: /media/ Disallow: /modules/ Disallow: /plugins/ Disallow: /templates/ Disallow: /tmp/ Nothing there should stop Google from searching my website. And even more confusingly, when I go to Google Webmaster tools, under "Blocked URLs" tab, when I try many of the links on the site, they are all shown up as "Allowed". I then tried adding a sitemap, putting it in the robots.txt file. That did not help. Same exact search result, same behavior in the "Blocked URLs" tab on the webmaster tools. Now additionally, the "sitemaps" tab says for several links an error saying "URL is robotted out". I tried those exact links in the "Blocked URLs" and they are allowed! I then tried deleting the robots.txt file. No use. Same exact problem. Here is an example screenshot from Google's Webmaster Tools: At this point I cannot give a rational explanation to why this is happening and neither can anyone in the IT department here. No one on Joomla forums can seem to understand what is going on. Based on what I explained, does it seem that I have somehow set a setting in the robots.txt or in .htaccess or somewhere else, incorrectly?

Read the article

Does Fetch as Googlebot still support their ajax-crawling proposal?

- by Gunchars

I spent half a day implementing the server side html generation for modal pages based on their proposal (link), but it seems like the Fetch as Googlebot functionality in Webmaster tools completely ignores the URL fragment. I've verified that the _escaped_fragment_ functionality is working on my server (example), but when I submit a URL like /#!/recipes, the Googlebot just fetches /. There aren't any recent confirmations that it's working and, honestly, it wouldn't surprise me if they just silently dropped the functionality without even editing the docs.

Read the article

Google is still crawling and indexing my old, dummy, test pages which now are 404 not found

- by Ace

I have set up my site with sample pages and data (lorem ipsum, etc..) and Google has crawled these pages. I deleted all these pages and actually added real content but in webmaster tools, i still get a lot of 404 errors Google trying to crawl these pages. I have set them to "mark as resolved" but some pages still come back as 404. Furthermore, I have a lot of these sample pages still listed when i do a search of my site on Google. How to remove them. I think these irrelevant pages are hurting my rating. I actually wanted to erase all these pages and start getting my site being being indexed as a new one but I read it's not possible? (I have submitted a sitemap and used "Fetch as Google.")

Read the article

web spidering/crawling, can i do it or just search engines?

- by bboyreason

i already had a question answered about web-scraping with wget. but as i read a little more, i realize i may be looking for a web-crawling program. particularly the part about web-crawlers being able to get specific data like links or, in my case, products. all of the products on my site have the following naming convention, website.com/uniqueAlphaNumericID.html as far as i know, no dynamic content generation is being used and only one page per one item in the above format. should i just be thinking about: wget website.com | grep *.html or should i be looking into spiders/crawlers?

Read the article

Where can i learn about search engine crawling and SEO?

- by acidzombie24

I have asked What should i know about search engine crawling? Now i would like to know where can i learn about search engines and search engines optimization? Instead of reading dozen of articles with most saying the same thing as another i would like to read one book or resource and find everything i need to know.

Read the article

Scalable web-hosting for a youtube-like service (no, not porn) [closed]

- by Crawling Pasta Hellion

Possible Duplicate: How to find web hosting that meets my requirements? My business partner and I are looking for a European web-hosting service (we are situated in Europe). That service needs to be, needs to have: international servers, a server for each continent at the very least. a high amount of bandwidth. highly scalable, since we are expecting to start off small, but as our user base grows so will everything else (again, no porn or phallic jokes) need to do. a moderate to supreme customer service. of course a small downtime per annum. affordable at first, fair as we grow. I think that is all. Any input is greatly appreciated. Thank you in advance.

Read the article

Google indexing and ranking a custom domain served by Google App Engine

- by Hugues

I have a website served on the following URL : "http://www.plugimmo.com" which is a custom domain served by Google App Engine on the following URL : http://plugimmo.appspot.com Since a while I have tried to optimise the Google indexing and ranking with no success. The problem is that searching on Google the keywords in the title of my home page does not retrieve my website at all even not in the 1,000 first results : When checking the cached version of google ( cache:www.plugimmo.com), it says that the cached version is the one of 20-Aug-12 of "plugimmo.appspot.com". It looks there are several issues : 1 - The cached version is really old. I have made a lot of changes since the 20-Aug-12 and I saw the googlebot crawling my site several times. 2 - The cached version is for "plugimmo.appspot.com" 3 - When looking at the Google Webmaster tools, I see that the number of pages indexed for www.plugimmo.com is 0, but that can't be the case given the number of changes I made since then. My questions would therefore be the following : Why is the version of the cache so old although I saw the googlebot crawling the site many times since 20-Aug-12 ? Is there a problem with indexing a custom domain served by Google App Engine ? Why is the Google Webmaster tools showing 0 pages indexed although new pages have been crawled and that no errors have been reported in the indexing ? Also, the site has been developed with Google Web Toolkit. I have followed the guidelines regarding crawling Ajax sites. The home page when crawled by a robot is redirected to http://www.plugimmo.com/HomeSnapshot.html Thanks a lot for your help ! Hugues

Read the article

google changing crawl speed: doesn't seem to work. Why?

- by Olivier Pons

I've changed 3 days ago the google crawling speed of mywebsite. Here it is: This means: 2 demands by second. I've got the message on the google webmasters tools that the change speed has been taken in account: But after more than three days, nothing happens: still one request every ten seconds See here: My webserver is very fast and can handle up to twenty simultaneous connexions. And my website is brand new, this means google is almost the only one here crawling my website. After more than 30000 successful requests (= no 404), I think there's something going on... or maybe this is just a bug? Has anyone ever had this problem?

Search Results

Search found 241 results on 10 pages for 'crawling'.

Page 3/10 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page >

- by John Hansen

- by Codex73

- by Saubhagya

- by Nick Nolan

- by John Hansen

- by sree

- by goh

- by mithunb

- by David Wilkins

- by David Wilkins

- by tom smith

- by Berlin Brown

- by Tom

- by josh

- by Veryel Hua

- by Amos Kane

- by Jason Fitzpatrick

- by Abhijit

- by Gunchars

- by Ace

- by bboyreason

- by acidzombie24

- by Crawling Pasta Hellion

- by Hugues

- by Olivier Pons

< Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page >