Search Results

Search found 446 results on 18 pages for 'crawl'.

Page 10/18 | < Previous Page | 6 7 8 9 10 11 12 13 14 15 16 17  | Next Page >

  • How to make a jar file run on startup & and when you log out?

    - by RanZilber
    I have no idea where to start looking. I've been reading about daemons and didn't understand the concept. More details : I've been writing a crawler which never stops and crawlers over RSS in the internet. The crawler has been written in java - therefore its a jar right now. I'm an administrator on a machine that has Ubuntu 11.04 . There is some chances for the machine to crash , so I'd like the crawler to run every time you startup the machine. Furthermore, I'd like it to keep running even when i logged out. I'm not sure this is possible, but most of the time I'm logged out, and I still want to it crawl. Any ideas? Can someone point me in the right direction? Just looking for the simplest solution.

    Read the article

  • Ubuntu slows down even after cpu-intensive process is ended

    - by Matt2
    After a Skype video call, or the use of virtualbox, Ubuntu slows down to a crawl, even after the process is ended. Running htop reveals that processes that used little CPU before are now all using about 30% cpu (namely Compiz, Firefox, Python, and Skype, but I'm sure there are others), to the point where all my cores are at 99%. All I can do from here is restart. Any idea why this is happpening? I'm running Ubuntu 12.04 64-bit on 3.7 GiB of memory, Intel® Core™ i3 CPU M 330 @ 2.13GHz × 4, VESA: M92 graphics driver. Not sure why I'm running VESA, I installed fglrx, but I suppose that's a different question. Thanks in advance!

    Read the article

  • Restricting A Directory Through .htaccess

    - by Whitechapel
    I'm trying to put all of my FTP accounts into a folder on /public_html/ftp and password protect it so search bots can't crawl their private files. I'm also trying to redirect all site traffic from the non-www to www. I keep getting 500 errors when accessing the site, and I need to point it to www.vivalanation.com/ftp to www.vivalanation.com/ftp/, because the /ftp just errors out, you need the trailing slash. Here is my .htaccess in the /public_html/ftp folder: RewriteEngine on RewriteBase / RewriteCond %{HTTP_HOST} !^www\. [NC] RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L] AuthName "FTP Access" AuthType Basic AuthUserFile /home1/vivalst/.htpasswds/public_html/ftp/passwd Require valid-user I created a passwd file in /.htpasswds/public_html/ftp And here is my basic .htaccess in the root of /public_html/: RewriteEngine on RewriteBase / RewriteCond %{HTTP_HOST} !^www\. [NC] RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L]

    Read the article

  • Site returning 404 header to google, not sure why

    - by Damon
    A Drupal site that works fine for regular users returns a 404 not found error when I try to use the W3C validator on it; it is also not being indexed by google at all (which is the main issue but I suspect there is a connection). It is a https:// site with .htaccess rule to redirect any http:// request to the https://. I had had it running in google webmaster tools and thought it was fine, but it turns out I had not added the https domain. After adding the https domain it's also returning the header as HTTP/1.1 404 Not Found Date: Mon, 15 Oct 2012 19:37:43 GMT Server: Apache Expires: Sun, 19 Nov 1978 05:00:00 GMT Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0 Robots.txt just has User-agent: * Crawl-delay: 10 # Files Disallow: /cron.php How can I check what the issue is here?

    Read the article

  • If C-Panel Indexing Manager sets a folder to "No Indexing" can it be crawled by a webcrawler?

    - by Graham
    People are able to view directories / folders on my site right now. So, they could go to mysite.com/images and see the full index. To prevent this, C-Panel offers an option to set a directory / folder to "No Indexing" under the "Index Manager." Will this option allow webcrawlers to crawl / index the images? Or, is there a simpler alternative to block access to all folders directly while still having it SEO friendly? My old server restricted direct access to folders by default. But, the new one does not. Any ideas on this? Thanks!

    Read the article

  • Hide from google while developing

    - by user210757
    I will be building a (wordpress) web site. While I am developing, other team members will be pushing content. I'd like to have it hidden from google while under development. It will be hosted on godaddy. I have thought of not pointing the domain name to it until live and using "preview dns", or buying a static IP during development. Or hosting dev site in a sub-directory ("/dev/") until ready and then moving it up a level. If in the dev directory I'd add htaccess or robots.txt to not crawl. Is any of this a bad idea? Will google penalize for any of this - like search by IP and then associate that with the domain later on? Any better ideas?

    Read the article

  • Exclude pages from search results based on device class (mobile/desktop)

    - by user32224
    We're currently building a new responsive website. While working on the site map, we figured that we don't want to show certain sections on mobile devices. This can be easily done by hiding the navigation parts using CSS/media queries. However, the trouble is that the hidden sites would still show up in search engine results. If a user happens to click on one of these links she might happen to see a badly formatted page as we'd use desktop/tablet only code to show images and video. Is there any way influence the search engines to exclude certain pages if the search is done on a mobile device? Do search engines crawl pages once or with a device specific view twice? Could we set a noindex meta tag for a specific device class?

    Read the article

  • SEO Influenc search result per device class (mobile/desktop)

    - by user32224
    We're currently building a new responsive website and while working on the site map figured that we don't want to show certain sections on mobile devices. This can be easily done by hiding the navigation parts using css/media queries. However, trouble is that the hidden sites would still show up in search engines' search results. If a user happens to click on one of these links she might happen to see a badly formatted page as we'd use desktop/tablet only code to show images and video. Is there any way to "influence" to exclude certain pages if the search is done on a mobile device? Do search engines crawl pages once or with a device specific view twice? Could we set a noindex meta tag for a specific device class?

    Read the article

  • How to write a crawler?

    - by Jason
    Hi All, I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content. Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it finds, etc,etc. Thanks! -Jason

    Read the article

  • need help in site classification

    - by goh
    hi guys, I have to crawl the contents of several blogs. The problem is that I need to classify whether the blogs the authors are from a specific school and is talking about the school's stuff. May i know what's the best approach in doing the crawling or how should i go about the classification?

    Read the article

  • C# Parsing html for general use?

    - by Wardy
    What is the best way to take a string of html and turn it in to something useful? Essentially if i take a url and go get the html from that url in .net i get a response but this would come in the form of either a file or stream or string. What if i want an actual document or something I can crawl like an xmldocument object? I have some thoughts and an already implemented solution on this but I am interested to see what the community thinks about this.

    Read the article

  • Website content crawling

    - by klork
    We have a Business Listings directory hosted on IIS 6 Windows 2003. Our competitors crawl and steal our content and customers. We have tried IP blocking using honeypot URLs and log parsing without much success. Is anyone aware of a network device or a proxy server that I can run in front of my web server to minimize this issue? All suggestions are highly appreciated.

    Read the article

  • Multiple SiteMap: entries in robots.txt?

    - by user306942
    I have been searching around using Google but I can't find an answer to this question. A robots.txt file can contain the following line: Sitemap: http://www.mysite.com/sitemapindex.xml but is it possible to specify MULTIPLE sitemap index files in the robots.txt and have the search engines recognize that and crawl ALL of the sitemaps referenced in each sitemap index file? For example, will this work: Sitemap: http://www.mysite.com/sitemapindex1.xml Sitemap: http://www.mysite.com/sitemapindex2.xml Sitemap: http://www.mysite.com/sitemapindex3.xml

    Read the article

  • is it possible to extract all PDFs from a site

    - by deming
    given a URL like www.mysampleurl.com is it possible to crawl through the site and extract links for all PDFs that might exist? I've gotten the impression that Python is good for this kind of stuff. but is this feasible to do? how would one go about implementing something like this? also, assume that the site does not let you visit something like www.mysampleurl.com/files/

    Read the article

  • Investment advice data dump analysis

    - by portoalet
    For my year-end pet project, I'd like to analyze investment advices and their correlation to the stock market performance. The problem is, where do I get the dump of investment advice data (free) ? something like stackoverflow.com data dump will be nice. Or maybe it's easier to do distributed crawling and crawl the public finance webpages for investment advices? Investment advice is buy/sell advice for stocks/forex, issued by institution/investment advisor.

    Read the article

  • How to do 404 link testing through selenium rc for complete website?

    - by user1726460
    How can i verify a complete website's link(mostly links that are redirecting to 404 page) by using selenium Rc. Previously i tried to do this thong by using xenu and web link validator but in there results most of the links are showing 500 internal serevr error.and for the pages they are showing 500 internal server error the pages actuallt don't exists in the web site. So what is the concept if we can crawl through the website using selenium rc.?

    Read the article

  • A good open source web crawler for indexing Specific website for specific contents?

    - by Peeyush
    Hello Please suggest me a good open source web crawler written in C++,JAVA or PHP. i just need to crawl/index some specific websites for specific contents(images,text,videos). i know that their are already a lot of question & answers about this topic on this website but i am a little confused after reading all of them. So i am sorry if i am repeating the same question again. -Thanks in advance

    Read the article

  • how can i find unused css in ajax app?

    - by Haroldo
    I've been searching and i can't find any ff addons or javascript for finding unused css in ajax apps. dust-me selectors can do a site-crawl, but i'm looking for something that examines loaded-in content... I'd like something where i can press 'record' and then make a load of clicks which will check off the used selectors, and hoping to find an existing one rather than try to write my own with jquery!

    Read the article

< Previous Page | 6 7 8 9 10 11 12 13 14 15 16 17  | Next Page >