Search Results

Search found 446 results on 18 pages for 'crawl'.

Page 10/18 | < Previous Page | 6 7 8 9 10 11 12 13 14 15 16 17 | Next Page >

How to make a jar file run on startup & and when you log out?

- by RanZilber

I have no idea where to start looking. I've been reading about daemons and didn't understand the concept. More details : I've been writing a crawler which never stops and crawlers over RSS in the internet. The crawler has been written in java - therefore its a jar right now. I'm an administrator on a machine that has Ubuntu 11.04 . There is some chances for the machine to crash , so I'd like the crawler to run every time you startup the machine. Furthermore, I'd like it to keep running even when i logged out. I'm not sure this is possible, but most of the time I'm logged out, and I still want to it crawl. Any ideas? Can someone point me in the right direction? Just looking for the simplest solution.

Read the article
Ubuntu slows down even after cpu-intensive process is ended

- by Matt2

After a Skype video call, or the use of virtualbox, Ubuntu slows down to a crawl, even after the process is ended. Running htop reveals that processes that used little CPU before are now all using about 30% cpu (namely Compiz, Firefox, Python, and Skype, but I'm sure there are others), to the point where all my cores are at 99%. All I can do from here is restart. Any idea why this is happpening? I'm running Ubuntu 12.04 64-bit on 3.7 GiB of memory, Intel® Core™ i3 CPU M 330 @ 2.13GHz × 4, VESA: M92 graphics driver. Not sure why I'm running VESA, I installed fglrx, but I suppose that's a different question. Thanks in advance!

Read the article
Site returning 404 header to google, not sure why

- by Damon

A Drupal site that works fine for regular users returns a 404 not found error when I try to use the W3C validator on it; it is also not being indexed by google at all (which is the main issue but I suspect there is a connection). It is a https:// site with .htaccess rule to redirect any http:// request to the https://. I had had it running in google webmaster tools and thought it was fine, but it turns out I had not added the https domain. After adding the https domain it's also returning the header as HTTP/1.1 404 Not Found Date: Mon, 15 Oct 2012 19:37:43 GMT Server: Apache Expires: Sun, 19 Nov 1978 05:00:00 GMT Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0 Robots.txt just has User-agent: * Crawl-delay: 10 # Files Disallow: /cron.php How can I check what the issue is here?

Read the article
Restricting A Directory Through .htaccess

- by Whitechapel

I'm trying to put all of my FTP accounts into a folder on /public_html/ftp and password protect it so search bots can't crawl their private files. I'm also trying to redirect all site traffic from the non-www to www. I keep getting 500 errors when accessing the site, and I need to point it to www.vivalanation.com/ftp to www.vivalanation.com/ftp/, because the /ftp just errors out, you need the trailing slash. Here is my .htaccess in the /public_html/ftp folder: RewriteEngine on RewriteBase / RewriteCond %{HTTP_HOST} !^www\. [NC] RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L] AuthName "FTP Access" AuthType Basic AuthUserFile /home1/vivalst/.htpasswds/public_html/ftp/passwd Require valid-user I created a passwd file in /.htpasswds/public_html/ftp And here is my basic .htaccess in the root of /public_html/: RewriteEngine on RewriteBase / RewriteCond %{HTTP_HOST} !^www\. [NC] RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L]

Read the article
If C-Panel Indexing Manager sets a folder to "No Indexing" can it be crawled by a webcrawler?

- by Graham

People are able to view directories / folders on my site right now. So, they could go to mysite.com/images and see the full index. To prevent this, C-Panel offers an option to set a directory / folder to "No Indexing" under the "Index Manager." Will this option allow webcrawlers to crawl / index the images? Or, is there a simpler alternative to block access to all folders directly while still having it SEO friendly? My old server restricted direct access to folders by default. But, the new one does not. Any ideas on this? Thanks!

Read the article
Hide from google while developing

- by user210757

I will be building a (wordpress) web site. While I am developing, other team members will be pushing content. I'd like to have it hidden from google while under development. It will be hosted on godaddy. I have thought of not pointing the domain name to it until live and using "preview dns", or buying a static IP during development. Or hosting dev site in a sub-directory ("/dev/") until ready and then moving it up a level. If in the dev directory I'd add htaccess or robots.txt to not crawl. Is any of this a bad idea? Will google penalize for any of this - like search by IP and then associate that with the domain later on? Any better ideas?

Read the article
SEO Influenc search result per device class (mobile/desktop)

- by user32224

We're currently building a new responsive website and while working on the site map figured that we don't want to show certain sections on mobile devices. This can be easily done by hiding the navigation parts using css/media queries. However, trouble is that the hidden sites would still show up in search engines' search results. If a user happens to click on one of these links she might happen to see a badly formatted page as we'd use desktop/tablet only code to show images and video. Is there any way to "influence" to exclude certain pages if the search is done on a mobile device? Do search engines crawl pages once or with a device specific view twice? Could we set a noindex meta tag for a specific device class?

Read the article
Exclude pages from search results based on device class (mobile/desktop)

- by user32224

We're currently building a new responsive website. While working on the site map, we figured that we don't want to show certain sections on mobile devices. This can be easily done by hiding the navigation parts using CSS/media queries. However, the trouble is that the hidden sites would still show up in search engine results. If a user happens to click on one of these links she might happen to see a badly formatted page as we'd use desktop/tablet only code to show images and video. Is there any way influence the search engines to exclude certain pages if the search is done on a mobile device? Do search engines crawl pages once or with a device specific view twice? Could we set a noindex meta tag for a specific device class?

Read the article
How to write a crawler?

- by Jason

Hi All, I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content. Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it finds, etc,etc. Thanks! -Jason

Read the article
need help in site classification

- by goh

hi guys, I have to crawl the contents of several blogs. The problem is that I need to classify whether the blogs the authors are from a specific school and is talking about the school's stuff. May i know what's the best approach in doing the crawling or how should i go about the classification?

Read the article
Oracle Secure Enterprise Search(SES) Intranet crawling problem.

- by vipin k.

I am using oracle Oracle Secure Enterprise Search(SES), and using the crawler to crawl the Intranet site. but i am getting the error. EQG-30008: http://site-name/: Not found I have added the Log on password and user name and also added the proxy settings. Any body who worked on SES crawling,please look in.

Read the article
C# Parsing html for general use?

- by Wardy

What is the best way to take a string of html and turn it in to something useful? Essentially if i take a url and go get the html from that url in .net i get a response but this would come in the form of either a file or stream or string. What if i want an actual document or something I can crawl like an xmldocument object? I have some thoughts and an already implemented solution on this but I am interested to see what the community thinks about this.

Read the article
Multiple SiteMap: entries in robots.txt?

- by user306942

I have been searching around using Google but I can't find an answer to this question. A robots.txt file can contain the following line: Sitemap: http://www.mysite.com/sitemapindex.xml but is it possible to specify MULTIPLE sitemap index files in the robots.txt and have the search engines recognize that and crawl ALL of the sitemaps referenced in each sitemap index file? For example, will this work: Sitemap: http://www.mysite.com/sitemapindex1.xml Sitemap: http://www.mysite.com/sitemapindex2.xml Sitemap: http://www.mysite.com/sitemapindex3.xml

Read the article
Backlink-reporting website crawler?

- by Stewart

What tools are there out there to crawl a website and report, for each page, a list of pages within the website that link to it?

Read the article
Website content crawling

- by klork

We have a Business Listings directory hosted on IIS 6 Windows 2003. Our competitors crawl and steal our content and customers. We have tried IP blocking using honeypot URLs and log parsing without much success. Is anyone aware of a network device or a proxy server that I can run in front of my web server to minimize this issue? All suggestions are highly appreciated.

Read the article
Nutch crawling with seeds urls are in range

- by user365345

Some site have url pattern as www..com/id=1 to www..com/id=1000. How can I crawl the site using nutch. Is there any why to provide seed for fetching in range??

Read the article
file_get_contents VS CURL, what has better performance?

- by ahmed

I am using PHP to build a web crawler, to crawl millions of URLs, what is better for me in terms of performance?file_get_contents or CURL? Thanks

Read the article
how to prevent all crawlers except good ones (google, bing, yahoo) access website content?

- by tranhuyhung

I just want to let Google, Bing, Yahoo crawl my website to build indexes. But I do not want my opposite website use crawling service to steal my website content. What should I do?

Read the article
is it possible to extract all PDFs from a site

- by deming

given a URL like www.mysampleurl.com is it possible to crawl through the site and extract links for all PDFs that might exist? I've gotten the impression that Python is good for this kind of stuff. but is this feasible to do? how would one go about implementing something like this? also, assume that the site does not let you visit something like www.mysampleurl.com/files/

Read the article
Configure HTTP Post data input to Nutch before crawling a site

- by user365345

I have to crawl a site which list item based on user input through http post submission. How to configure post http submission details in Nutch. I got help on how to do HttpPostAuthentication, but I got no help on "how to do post data submit other than username and password".

Read the article
Mod_rewrite - How to tell Google to dynamically delete pages from their index after 7 days

- by Sattvic

Search engines like to crawl and index webpages or URLs, but what if your webpages/URLs have expired content and you do not want them to be indexed after so many days? Can you put an expiration in the URL and have mod_rewrite 301 redirect pages after a given expiration date? Or maybe a cron job to add a 301 redirect header to all expired pages?

Read the article
A good open source web crawler for indexing Specific website for specific contents?

- by Peeyush

Hello Please suggest me a good open source web crawler written in C++,JAVA or PHP. i just need to crawl/index some specific websites for specific contents(images,text,videos). i know that their are already a lot of question & answers about this topic on this website but i am a little confused after reading all of them. So i am sorry if i am repeating the same question again. -Thanks in advance

Read the article
Investment advice data dump analysis

- by portoalet

For my year-end pet project, I'd like to analyze investment advices and their correlation to the stock market performance. The problem is, where do I get the dump of investment advice data (free) ? something like stackoverflow.com data dump will be nice. Or maybe it's easier to do distributed crawling and crawl the public finance webpages for investment advices? Investment advice is buy/sell advice for stocks/forex, issued by institution/investment advisor.

Read the article
How to do 404 link testing through selenium rc for complete website?

- by user1726460

How can i verify a complete website's link(mostly links that are redirecting to 404 page) by using selenium Rc. Previously i tried to do this thong by using xenu and web link validator but in there results most of the links are showing 500 internal serevr error.and for the pages they are showing 500 internal server error the pages actuallt don't exists in the web site. So what is the concept if we can crawl through the website using selenium rc.?

Read the article
how can i find unused css in ajax app?

- by Haroldo

I've been searching and i can't find any ff addons or javascript for finding unused css in ajax apps. dust-me selectors can do a site-crawl, but i'm looking for something that examines loaded-in content... I'd like something where i can press 'record' and then make a load of clicks which will check off the used selectors, and hoping to find an existing one rather than try to write my own with jquery!

Read the article

< Previous Page | 6 7 8 9 10 11 12 13 14 15 16 17 | Next Page >