Search Results

Search found 149 results on 6 pages for 'crawlers'.

Page 1/6 | 1 2 3 4 5 6 | Next Page >

Stop bots from crawling old links with extensions

- by Jared

I've recently switched to MVC3 which is extension-less for the URL's, but Google and Bing have a wealth of links that they are crawling which no longer exist. So I'm trying to find out if there is a way to format robots.txt (or by some other method) to tell google/bing that any link that ends in an extension isn't a valid link... Is this possible? On pages that I'm concerned about a User having saved as a fav I'm displaying a 404 page that lists the links to take once they are redirected to the new page (I decided to not just redirect them as I don't want to maintain these forever). For Google/Bing sake I do have the canonical tag in the header. User-agent: * Allow: / Disallow: /*.* EDIT: I just added the 3rd line (in text above) and it APPEARS to do what I'm wanting. Allow a path, but disallow a file. Can anyone confirm this?

Read the article
Is it worthwhile to block malicious crawlers via iptables?

- by EarthMind

I periodically check my server logs and I notice a lot of crawlers search for the location of phpmyadmin, zencart, roundcube, administrator sections and other sensitive data. Then there are also crawlers under the name "Morfeus Fucking Scanner" or "Morfeus Strikes Again" searching for vulnerabilities in my PHP scripts and crawlers that perform strange (XSS?) GET requests such as: GET /static/)self.html(selector?jQuery( GET /static/]||!jQuery.support.htmlSerialize&&[1, GET /static/);display=elem.css( GET /static/.*. GET /static/);jQuery.removeData(elem, Until now I've always been storing these IPs manually to block them using iptables. But as these requests are only performed a maximum number of times from the same IP, I'm having my doubts if it does provide any advantage security related by blocking them. I'd like to know if it does anyone any good to block these crawlers in the firewall, and if so if there's a (not too complex) way of doing this automatically. And if it's wasted effort, maybe because these requests come from from new IPs after a while, if anyone can elaborate on this and maybe provide suggestion for more efficient ways of denying/restricting malicious crawler access. FYI: I'm also already blocking w00tw00t.at.ISC.SANS.DFind:) crawls using these instructions: http://spamcleaner.org/en/misc/w00tw00t.html

Read the article
The Know How Series - Understanding Search Engine Crawlers

While most internet users use a lot of search engines, hardly a handful really know how a search engine works. If you are an online marketer or your business relies heavily on the internet it becomes a prerogative that you understand search engines and web crawlers. Search engines provide data at the flick of a button or at a single click.

Read the article
Temporarily Utilizing 304 Header on Apache for Crawlers

- by Volomike

I have a client who has a hosting arrangement with 400 customer sites all hosted through SuPHP in CGI mode on Apache. The sysop is now gone and the client is calling on me for rolling out a new PHP thing. Trouble is -- server load is very high right now and we have found that it's due to the crawlers. We had one customer in particular who complained of slow websites, and we engaged a 304 header plugin in his site against most crawlers, and his site perked right up. We'd like to lower that load by issuing a global 304 header to all the crawlers, letting human visitors through. I have a long list of user agent keywords to trap for. What's the best way to temporarily engage that global 304 header, while allowing human visitors to get right on through? I mean, I could roll out 400 .htaccess file changes, but it would be ideal to make this change in like one central Apache config and then it automatically affect all the sites at once.

Read the article
Detecting 'stealth' web-crawlers

- by Jacco

What options are there to detect web-crawlers that do not want to be detected? (I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.) I'm not talking about the nice crawlers such as googlebot and Yahoo! Slurp. I consider a bot nice if it: identifies itself as a bot in the user agent string reads robots.txt (and obeys it) I'm talking about the bad crawlers, hiding behind common user agents, using my bandwidth and never giving me anything in return. There are some trapdoors that can be constructed updated list (thanks Chris, gs): Adding a directory only listed (marked as disallow) in the robots.txt, Adding invisible links (possibly marked as rel="nofollow"?), style="display: none;" on link or parent container placed underneath another element with higher z-index detect who doesn't understand CaPiTaLiSaTioN, detect who tries to post replies but always fail the Captcha. detect GET requests to POST-only resources detect interval between requests detect order of pages requested detect who (consistently) requests https resources over http detect who does not request image file (this in combination with a list of user-agents of known image capable browsers works surprisingly nice) Some traps would be triggered by both 'good' and 'bad' bots. you could combine those with a whitelist: It trigger a trap It request robots.txt? It doest not trigger another trap because it obeyed robots.txt One other important thing here is: Please consider blind people using a screen readers: give people a way to contact you, or solve a (non-image) Captcha to continue browsing. What methods are there to automatically detect the web crawlers trying to mask themselves as normal human visitors. Update The question is not: How do I catch every crawler. The question is: How can I maximize the chance of detecting a crawler. Some spiders are really good, and actually parse and understand html, xhtml, css javascript, VB script etc... I have no illusions: I won't be able to beat them. You would however be surprised how stupid some crawlers are. With the best example of stupidity (in my opinion) being: cast all URLs to lower case before requesting them. And then there is a whole bunch of crawlers that are just 'not good enough' to avoid the various trapdoors.

Read the article
AWStats: Visits from IP address vs Crawlers

- by user3651934

I use AWStats in cPanel to see stats of my website. Under Hosts section I see one IP address that has visited 150 pages. I am not sure if one person would have visited 150 pages using a browser. But if these 150 pages have been visited using a software application, then should not it be listed under Robots/Spider section. So how do I determine if I should block a certain IP address that has visited several hundred pages of my website? Thanks

Read the article
Discover How Affiliates Marketers Can Optimize Their Blog For Search Engine Web Crawlers - #1

Search Engine Optimization is a FREE way of internet marketing that all internet marketers should take full advantage of it. It is all about building links to your websites or blogs. There are 2 ways of doing it.

Read the article
Do web crawlers/spiders index azure web sites?

- by Clay Shannon

For somebody who wants their web site to be as discoverable as possible (and who doesn't?), are Microsoft's Azure web sites (azurewebsites.net) a feasible domain to host sites? I have a site that is both on an azurewebsites.net and hosted under a completely different name by discountasp.net Both of these sites are exactly the same, except for the URL; whenever I update the code, I republish the site to/in both places. So obviosuly, they both have the same H1 and H2 elements. Searching for the value/content in my H1 tag, I find my .com site listed #3 on google and #2 on both Bing and Yahoo; OTOH, my azurewebsites.net site doesn't show up on the first page at all, in any of them. This makes me wonder if azurewebsites.net should only be used for Web API hosting and such-like, not for generic/commercial "public" sites. Are my conclusions valid?

Read the article
Directing crawlers to content in language per language sub-domain

- by Noam

I have a site with multilingual website with many pages (40M). The site has UGC, and each translation is actually for the titles. Each sub-domain points to the same content with different titles per language. As far as I understand, each sub-domain should be indexed by search engines, meaning they will actually need to crawl 40M x supported-languages. So I thought it might be best to direct each subdomain crawler, to pages that are fully in that language (titles + UGC). Is there a way to do this? Should search engines understand this on their own?

Read the article
What Are Search Engine Crawlers?

People don't know how they get relevant results for their search queries on a search engine. Most of them believe that these websites were submitted to the search engine. Few others think that there is some software tool that is searching for the relevant websites. Robots and spiders are the software tools that keep on searching the web to find new pages.

Read the article
Robots.txt practices with .htaccess redirections (inherits)

- by Jayhal

I have a question regarding how to write robots.txt files for many domains and subdomains with redirects in place. We have a hosting account that enacts primary and add-on domains. All of our domains and subdomains, including the primary domain, is redirected via htaccess 301s to their own subdirectories in the primary domain's root directory. I'm confused about how I would write the robots.txt for certain directories. First, I wanted to confirm I am right in understanding that for domains and subdomains, crawlers will look to the directory that acts as that urls root directory for the crawling rules(robots.txt). Also, that a directory will not be affected by a robots.txt present in their parent directory if the directory has its own domain/subdomain, and that url is the one being accessed by crawlers. (Am pretty sure, but I wanted to confirm I didnt have a fundamentally flawed understanding of robots.txt) In the original root directory on the account(where the primary domain was directed before htaccess was put in place) what should the robots.txt contain? When crawlers look to crawl our primary domain, will they look to the original root directory for the robots.txt or will they reference the file contained in the new subdirectory where all the primary domain's site files are located? If so, what should the root's robot.txt include if anything at all. Would I be right to include a simple 'disallow: /' for all agents, and then include more specific robots.txt files in each subdirectory with more specific instructions. Would that affect the crawling of the directory where the primary domain is now redirected? Any help is greatly appreciated, Thanks!

Read the article
How do web crawlers affect site statistics?

- by LM

What are ways in which web crawlers (both from search engines and non-search engines) could affect site statistics (e.g., when doing AB-testing different page variations)? And what are ways to take care of these problems? For example: Do a lot of people writing web crawlers often delete their cookies and mask their IPs, so that web crawlers often show up as different users each time they crawl the site? What are heuristics to use to recognize that something is a bot? (I'm guessing any sophisticated enough bot can be indistinguishable from a real user, if it wants to -- is this correct?)

Read the article
Is there a list of known web crawlers?

- by J. Pablo Fernández

I'm trying to get accurate download numbers for some files on a web server. I look at the user agents and some are clearly bots or web crawlers, but many for many I'm not sure, they may or may not be a web crawler and they are causing many downloads so it's important for me to know. Is there somewhere a list of know web crawlers with some documentation like user agent, IPs, behavior, etc? I'm not interested in the official ones, like Google's, Yahoo's, or Microsoft's. Those are generally well behaved and self-indentified.

Read the article
Age verification forms and crawlers

- by user333763

I have created a website about some beer brand and had to include age verification page. The verification script is written in PHP and uses sessions to store verification variable. The script works the way that no matter form which link you will try to enter the website it will take you to the verification page first. The verification is very simple. There are 2 button: "I'm under 21" and "I'm over 21". If you click the latter, you can browse the website. After some time I discovered that the web crawlers are not able to get past verification page. I checked the website in Google webmaster tools and the only text content scanned was from the verification page. I read somewhere that crawlers are not able to submit form buttons, is it true? Considering the fact that age verification pages are useless anyways, maybe I should just leave it as a starting page but don't forbid going around it, e.g. from links to the subpages?

Read the article
How to fix Google 404 not found Crawl Errors?

- by Freeme

I was checking on Google webmater tool for my blog site to see if there's any indication on why my blog traffic decreased to half in one day and i saw 43 Not Found crawl errors and 5 in Sitemap Not Found errors. The 5 Not Found errors in Sitemap were the links to categories. I guess I renamed categories that's why google can't find the links. As for the 43 other Not Found errors, I see blog post titles that contains (' .) EX: McDonald's, O.N.E. They weren't found by google crawler. Blog post with /CachedYou at the end and blog posts with /www.example.com attached at the end, they weren't found by Google crawlers either. My question is how do I correct those Not Found Errors? Thanks

Read the article
Redirect Google crawler to different robots.txt via .htaccess

- by user3474818

I have googled for the answer all day and still couldn't find an answer. I have a virtual subdomain www.static.example.com which is a mirror site of www.example.com. It means I have just one root folder for subdomain and domain aswell. I want to redirect crawlers to different robots.txt file - robots_static.txt when they see .static in url in which I will forbid indexing via /disallow command. I want to do this because I have duplicated content in Google search results. Subdomain is showing the exact same content as the main domain. Does anyone know how could I achieve that crawlers sees robots_static.txt instead of robots.txt? What I have managed to find so far is this: RewriteCond %{HTTP_HOST} ^www.static.*$ [NC] RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*robots\.txt.*\ HTTP/ [NC] RewriteRule ^robots\.txt /robots_static.txt [NC,L] but when I check in webmaster tools, it still sees robots.txt as my robots file instead of robots_static.txt, so it crawls and index everything twice. What did I do wrong? Thanks EDIT: This is my .htaccess file ## # @package Joomla # @copyright Copyright (C) 2005 - 2013 Open Source Matters. All rights reserved. # @license GNU General Public License version 2 or later; see LICENSE.txt ## ## # READ THIS COMPLETELY IF YOU CHOOSE TO USE THIS FILE! # # The line just below this section: 'Options +FollowSymLinks' may cause problems # with some server configurations. It is required for use of mod_rewrite, but may already # be set by your server administrator in a way that dissallows changing it in # your .htaccess file. If using it causes your server to error out, comment it out (add # to # beginning of line), reload your site in your browser and test your sef url's. If they work, # it has been set by your server administrator and you do not need it set here. ## ## Can be commented out if causes errors, see notes above. Options +FollowSymLinks ## Mod_rewrite in use. RewriteEngine On RewriteEngine On RewriteCond %{HTTP_HOST} !^www\. RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L] RewriteCond %{HTTP_HOST} ^www.static.*$ [NC] RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*robots\.txt.*\ HTTP/ [NC] RewriteRule ^robots\.txt /robots_static.txt [NC,L] ## Begin - Rewrite rules to block out some common exploits. # If you experience problems on your site block out the operations listed below # This attempts to block the most common type of exploit `attempts` to Joomla! # # Block out any script trying to base64_encode data within the URL. RewriteCond %{QUERY_STRING} base64_encode[^(]*\([^)]*\) [OR] # Block out any script that includes a <script> tag in URL. RewriteCond %{QUERY_STRING} (<|%3C)([^s]*s)+cript.*(>|%3E) [NC,OR] # Block out any script trying to set a PHP GLOBALS variable via URL. RewriteCond %{QUERY_STRING} GLOBALS(=|\[|\%[0-9A-Z]{0,2}) [OR] # Block out any script trying to modify a _REQUEST variable via URL. RewriteCond %{QUERY_STRING} _REQUEST(=|\[|\%[0-9A-Z]{0,2}) # Return 403 Forbidden header and show the content of the root homepage RewriteRule .* index.php [F] # ## End - Rewrite rules to block out some common exploits. ## Begin - Custom redirects # # If you need to redirect some pages, or set a canonical non-www to # www redirect (or vice versa), place that code here. Ensure those # redirects use the correct RewriteRule syntax and the [R=301,L] flags. # ## End - Custom redirects ## # Uncomment following line if your webserver's URL # is not directly related to physical file paths. # Update Your Joomla! Directory (just / for root). ## # RewriteBase / RewriteCond %{THE_REQUEST} ^GET.*index\.php [NC] RewriteCond %{THE_REQUEST} !/system/.* RewriteRule (.*?)index\.php/*(.*) /$1$2 [R=301,L] RewriteCond %{THE_REQUEST} ^GET ## Begin - Joomla! core SEF Section. # RewriteRule .* - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization}] # # If the requested path and file is not /index.php and the request # has not already been internally rewritten to the index.php script RewriteCond %{REQUEST_URI} !^/index\.php # and the request is for something within the component folder, # or for the site root, or for an extensionless URL, or the # requested URL ends with one of the listed extensions RewriteCond %{REQUEST_URI} /component/|(/[^.]*|\.(php|html?|feed|pdf|vcf|raw))$ [NC] # and the requested path and file doesn't directly match a physical file RewriteCond %{REQUEST_FILENAME} !-f # and the requested path and file doesn't directly match a physical folder RewriteCond %{REQUEST_FILENAME} !-d # internally rewrite the request to the index.php script RewriteRule .* index.php [L] # ## End - Joomla! core SEF Section. <FilesMatch "\.(ico|pdf|flv|jpg|ttf|jpg|jpeg|png|gif|js|css|swf)$"> Header set Expires "Wed, 15 Apr 2020 20:00:00 GMT" Header set Cache-Control "public" </FilesMatch> <ifModule mod_headers.c> Header set Connection keep-alive </ifModule> ########## Begin - Remove Etags # FileETag none # ########## End - Remove Etags

Read the article
hosting company blocking google bots and crawlers [closed]

- by Jayapal Chandran

Hi, I am having a site for the past three years and it is very active for the past two years. Until not the site is working well and also now but not after the hosting company blocked google bots. Many pages appeared in the first page of the google search. After they started blocking i couldn't see my links in the first page instead they appeared after 5 pages or they did not appear at all. Will hosting companies be so stupid that they block and dont mention it to their users. They want to protect themselves by making the websites at stake. I display google ads and not this month i got only half for this 10 days. I have made requests to other hosting companies like blue host and monster host that i wan to transfer my domain by making a condition that the will not block google bots which stops the business indirectly. so any kind of help will be helpful. how can i claim what i lost from the hosting company. what other hosting companies consider the users (by informing the events like changing the IP or blocking google bot.) It was really working hard to bring up my site but these people just crashed down my site in a few days. :-(

Read the article
Firewall - Preventing Content Theft & Rogue Crawlers

- by drodecker

Our websites are being crawled by content thieves on a regular basis. We obviously want to let through the nice bots and legitimate user activity, but block questionable activity. We have tried IP blocking at our firewall, but this becomes to manage the block lists. Also, we have used IIS-handlers, however that complicates our web applications. Is anyone familiar with network appliances, firewalls or application services (say for IIS) that can reduce or eliminate the content scrapers?

Read the article
How to best develop web crawlers

- by Fernando Barrocal

Heyall, I am used to create some crawlers to compile information and as I come to a website I need the info I start a new crawler specific for that site, using shell scripts most of the time and sometime PHP. The way I do is with a simple for to iterate for the page list, a wget do download it and sed, tr, awk or other utilities to clean the page and grab the specific info I need. All the process takes some time depending on the site and more to download all pages. And I often steps into an AJAX site that complicates everything I was wondering if there is better ways to do that, faster ways or even some applications or languages to help such work.

Read the article
Can the .htaccess file slow down a website to a crawl? If so, are there better ways to solve these problems with different rewrite rules and such?

- by Parimal

here is my htaccess file...... RewriteCond %{REQUEST_URI} ^/patients/billing/FAQ_billing\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/billing/getintouch\.html$ RewriteRule ^patients/billing/(.*)\.html$ $1.php [L,NC] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/a\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/b\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/c\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/d\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/e\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/f\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/g\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/h\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/i\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/j\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/k\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/l\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/m\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/n\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/o\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/p\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/q\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/r\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/s\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/t\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/u\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/v\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/w\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/x\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/y\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/z\.html$ RewriteRule ^patients/findadoctor/(.*)\.html$ findadoctor.php?id=$1 [L,NC] like that there is lots of rules around 250 line please help me...

Read the article
suspicious crawler activity

- by ithkuil

I'm noticing that I get accesses 66.249.66.198 - - [01/Jul/2011:17:13:46 +0200] "GET /img/clip.incubus.torrent.phtml HTTP/1.1" 404 143 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.66.198 - - [01/Jul/2011:17:13:48 +0200] "GET /img/clip.global.deejays.download.phtml HTTP/1.1" 404 143 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" that files don't exist and there is no file on my site that has this content (I hope). Why is googlebot trying out these links? reverse dns and whois state that 66.249.66.198 is really googlebot.

Read the article
Screen resolution of Googlebot mobile?

- by Baumr

Does Googlebot-Mobile have a viewport resolution it sends across? If so, what is it? It's a general question with broad relevance, but I am asking with reference to responsive design: particularly when serving different image resolution to different viewports via JavaScript. While Googlebot has its issues with JavaScript, it will become better with time. Thus, it would be good to know which version of the same image would be crawled (since most responsive image JS solutions base their logic on resolution). Feature phones Googlebot-Mobile: SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html) DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html) Smartphone Googlebot-Mobile: Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

Read the article
Is hiding content with JavaScript or "text-indent: -9999px" bad for SEO?

- by Samuel

So apparently hiding content using "display: none" is bad for SEO and seen by googlebot as being deceptive. This according to a lot of the posts I read online and questions even on this site. But what if I hide keyword rich text using javascript? A jquery example: $(function() { $('#keywordRichTextContainer').hide(); }); or using visibility hidden: $(function() { $('#keywordRichTextContainer').css({ visibility: 'hidden', position: 'absolute' }); }); Would any of these techniques cause my site to be penalized? If googlebot can't read javascript then if I'm hiding through js it shouldn't know right? What about using "text-indent: -9999px"?

Read the article
Google Webmaster Tools reports fake 404 errors

- by Edgar Quintero

I have a website where Google Webmaster Tools reports 15,000 links as 404 errors. However, all links return a 200 when I visit them. The problem is, that eventhough I can visit these pages and return a 200, all those 15,000 pages won't index in Google. They aren't appearing in search results. These are constant errors Google Webmaster Tools keeps reporting and I'm not sure what the problem is. We've thought of a DNS issue, but it shouldn't be a DNS issue, because if it were, no page would be indexed (I have 10,000 perfectly indexed). Regarding URL parameters, my pages do not share a similarity in URL parameters that can make it obvious to me what could be causing the error.

Read the article
Why deny access to website for msnbot/bingbot?

- by Quandary

I've seen quite a lot of tutorials that recommend you to ban user agents containing the strings libwww-perl and msnbot. I understand why one would ban libwww-perl, it's mainly if not only used for hacking and spamming. But why are there so many sites recommending to ban msnbot/bingbot? Since it's a search engine, even if only with a marginal market share, I would except one would want this bot to crawl one's sites. What is it that msnbot does that makes people ban it?

Read the article

1 2 3 4 5 6 | Next Page >