Search Results

Search found 499 results on 20 pages for 'robots'.

Page 2/20 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • Valid robots.txt? [closed]

    - by psot
    I am waiting for Google to crawl my site and display the results in search. Is my robots.txt alright and will it let google, bing etc crawl my site? Thanks! User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /wp-content/ Disallow: /build/ Disallow: /css/ Disallow: /trackback/ Disallow: /comments Disallow: /assets/graphics/ Disallow: /assets/visual/ Disallow: /category/*/* Disallow: */trackback Disallow: */feed Disallow: */comments Disallow: /*?* Disallow: /*? User-agent: Slurp Disallow: / User-agent: Baiduspider Disallow: / User-agent: ia_archiver Disallow: / User-agent: duggmirror Disallow: / User-agent: Yandex Disallow: / Sitemap: http://example.com/sitemap.xml.gz

    Read the article

  • Googlebot fetches my pages very frequent, rel-nofollow, meta-noindex or robots.txt-disallow

    - by trante
    Googlebot fetches pages in my site very frequently. And this slowens my website. I don't want Googlebot to crawl too frequent. I decreased crawl rate from Google webmaster tools. But I'm supposing to use these three tools: Adding rel="nofollow" to my inner pages. So Googlebot won't crawl and index them. Adding meta tag "noindex" so Google will remove this page from index and won't get it again. Adding Disallow: /mySomeFolder/ to robots.txt and Googlebot won't crawl that pages. I'm planning to use these methods for my 56.000 pages, except the most important 6-7 pages. Which method would you prefer and what would be disadvantages or advantages ? Or won't it change my website speed etc..

    Read the article

  • Grapeshot crawler ignoring robots.txt

    - by QF_Developer
    Has anyone come across a crawler called Grapeshot? They are hammering the same page repeatedly on our website. I believe they are looking for ad related keywords, based on previous content ad campaigns. The odd thing is we never ran any such campaigns on the page they are so interested in. We do have only a few pages running AdSense, is this what has attracted Grapeshot? I've added the following declaration to my robots.txt, but they don't seem to be honouring it? User-agent: grapeshot Disallow: / Any ideas on how to block this nuisance crawler? I'm starting to think the best way is by setting up IP rules in IIS?

    Read the article

  • Robots meta tag with "noimageindex"

    - by jimy
    I have some doubt regarding noimageindex value in meta robots tag. If I add this tag on my page say http://www.example.com/somepage/someaction.php and on that page the images are served from another page say http://www.exampleimg.com. Then will the tag has meaning. I mean to say will the images will be ignored by bots? Or exampleimg is not affected by that tag. And all images will be indexed? Note: We want to stop indexing of the images on that particular page.

    Read the article

  • robots.txt dissalow url containing string with a '/' at the end

    - by thanili
    i have a website with thousands of dynamic pages. I want to use the robots.txt file in order to dissalow certain url patterns corresponding to pages with duplicate content. For example i have a page for article itemA belonging to category catA/subcatA, with URL: /catA/subcatA/itemA this is the URL that i want to be indexed from google. this article is also visible via tagging in various other places in the web site. The URLs produced via tagging is like: /tagA1/itemA this URL i want NOT to be indexed from google. However i want to have indexed all tag listings: /tagA1 so how can i achieve this? dissalow URLs of including a specific string with a '/' at the end? /tagA1/ itemA - dissalow /tagA1 - allow

    Read the article

  • Google Webmaster Tools robots test not working

    - by tracy_snap
    Within Webmaster Tools I have supplied my test content: User-agent: * Disallow:/admin/ Disallow: /tag/ When I specify the URL to test against, for example: http://www.site.com/tag/ It gives me this result: "Allowed: Detected as a directory; specific files may have different restrictions" As far as I know I have set this up correctly, shouldn't Google be saying that the /tag/ directory is "disallowed"?

    Read the article

  • Stop bots from crawling old links with extensions

    - by Jared
    I've recently switched to MVC3 which is extension-less for the URL's, but Google and Bing have a wealth of links that they are crawling which no longer exist. So I'm trying to find out if there is a way to format robots.txt (or by some other method) to tell google/bing that any link that ends in an extension isn't a valid link... Is this possible? On pages that I'm concerned about a User having saved as a fav I'm displaying a 404 page that lists the links to take once they are redirected to the new page (I decided to not just redirect them as I don't want to maintain these forever). For Google/Bing sake I do have the canonical tag in the header. User-agent: * Allow: / Disallow: /*.* EDIT: I just added the 3rd line (in text above) and it APPEARS to do what I'm wanting. Allow a path, but disallow a file. Can anyone confirm this?

    Read the article

  • Googlebot cant access my site webmaster tools reply Unreachable robots.txt

    - by Ahmad Ahmadi
    When I try to fetch my site as a googlebot in webmaster tools it return Unreachable robots.txt, after investigate I understood google bot can see my server: tcpdump | grep google it return that google can access my server with IP 66.249.81.172 or 66.249.75.111. but there is not any think in access log or error log or other apache logs. cat access_log | grep google or cat error_log | grep 66.249.81.172 Other bot (bing,...) can access apache but google cant. there is not any problem in my robots.txt or its permissions because as you know robots.txt is not necessary so I delete it but again webmaster tools returned Unreachable robots.txt not 404 not found! information about server: Server OS : CentOS 6 Web Server : Apache 2.x Firewall : IPTables is stoped SELinux is Disabled There is not any think else for security on my server. how can I investigate the problem and is there any other command that can help me to find the problem.

    Read the article

  • apache robots.txt with SSL

    - by user224013
    I have an .htaccess file with a rewrite rule to get a redirect of every HTTP request to HTTPS. But now I have a problem that my robots.txt is not recognized by some online checker. If I remove the redirect from the .htaccess file the robots.txt is recognized correctly. Maybe I should exclude that the robots.txt is redirects to an HTTPS connection? This is the part of .htaccess for redirecting to HTTPS RewriteCond %{SERVER_PORT} !^443$ RewriteRule (.*) https://%{HTTP_HOST}/$1 [L]

    Read the article

  • Is there a way to disallow only crawling in https in robots.txt?

    - by David Wilkins
    I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so this seems frivolous. Is there a way to specify Disallow: / for https only? According to Wikipedia, each protocol has its own robots.txt And according to Google's Robots.txt Specification, the robots.txt applies to http AND https I don't want to Disallow: / for Bing totally, just over https.

    Read the article

  • Strange robots.txt - how and why did it get there?

    - by Mick
    I recently created a very simple, pure HTML website which I have hosted with "hostmonster". Hostmonster had very good reviews on some comparison website and in general so far they appear to be perfectly good in every way... At least I thought so until just now... I have been making lots of edits to my site on an almost daily basis. My site now appears on the first page (7th on the list) for my most important keyphrase when doing a google search. But I did notice some problem with the snippet chosen by google. I asked a question on this site about snippets and got some great answers. I then made some modifications to my meta data and within 48hrs the google snippet for my search was perfect. The odd thing though was that looking at the "cached" version google had, it appeared that the cache was still very odl- like three weeks previous. This seemed very odd - how could it be that the google robots had read my new metadata without updating the cache? This puzzled me greatly. Just now it occurred to me that maybe I had some goofey setting in my robots.txt file. I didn't actually remember even making one - but I thought I'd have a look just in case. Much to my horror, I saw that there was a robots.txt and it contained the disturbing text below: sitemap: http://cdn.attracta.com/sitemap/728687.xml.gz Intuitively this looks like some kind of junk, spam trick, and I had indeed been getting some spam from "attracta". So my questions are: 1. Should I simply delete this robots.txt? 2. Was the file there all along - placed there because of some commercial tie-in between attracta and hostmonster. 3. Does the attracta robots file explain the lack of re-caching?

    Read the article

  • Disallow robots.txt from being accessed in a browser but still accessible by spiders?

    - by Michael Irigoyen
    We make use of the robots.txt file to prevent Google (and other search spiders) from crawling certain pages/directories in our domain. Some of these directories/files are secret, meaning they aren't linked (except perhaps on other pages encompassed by the robots.txt file). Some of these directories/files aren't secret, we just don't want them indexed. If somebody browses directly to www.mydomain.com/robots.txt, they can see the contents of the robots.txt file. From a security standpoint, this is not something we want publicly available to anybody. Any directories that contain secure information are set behind authentication, but we still don't want them to be discoverable unless the user specifically knows about them. Is there a way to provide a robots.txt file but to have it's presence masked by John Doe accessing it from his browser? Perhaps by using PHP to generate the document based on certain criteria? Perhaps something I'm not thinking of? We'd prefer a way to centrally do it (meaning a <meta> tag solution is less than ideal).

    Read the article

  • Is there a way to disallow crawling of only HTTPS in robots.txt?

    - by David Wilkins
    I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so this seems frivolous. Is there a way to specify Disallow: / for https only? According to Wikipedia, each protocol has its own robots.txt And according to Google's Robots.txt Specification, the robots.txt applies to http AND https I don't want to Disallow: / for Bing totally, just over https.

    Read the article

  • Disadvantages of a fake phpMyAdmin honeypot that causes ip blacklisting and robots.txt disallow/exclusion of the honeypot?

    - by Tchalvak
    I'm trying to figure out whether I should set up a honeypot system with a fake phpMyAdmin (site gets hits all the time with people spidering for insecurities with that app). My thought was to create a honeypot php script that would mimic a phpMyAdmin login, and then blacklist ips that hit that url (and aren't already whitelisted). I would then add the appropriate urls to the robots.txt so that spiders that actually respect my robots.txt wouldn't be caught by the blacklist. Are there disadvantages to this approach, do legit robots sometimes not respect robots.txt in certain circumstances, are there any problems with this that I should consider in advance?

    Read the article

  • What dangers await if I block non-standard, non-major-usa search engine bots from my USA only website?

    - by Ryan
    I noticed tons of bandwidth being used by non-USA search engine bots, so I began blocking them in an effort to save bandwidth and cpu cycles for actual users and the search engines they come from (Google, Bing, Yahoo, Ask, etc.). Other than potentially losing some international traffic (which isn't really important to us since all of our content is very USA-centric), what additional dangers should I be concerned about? I'm using a modified version of Jeff Starr's User Agent Blocklist

    Read the article

  • No description for any page on the website is available in Google despite robots.txt allowing crawling

    - by Abhijit
    I seem to have the weirdest issue with Search Engine Optimization, and I asked the IT folks at my university, I asked people on Joomla forums and I am trying to sort this issue out using Google Webmaster Tools for more than 2 months to little avail. I want to know if I have some blatantly wrong configuration somewhere that is causing search engines to be unable to index this site. I noticed a similar issue with another website I searched for online (ECEGSA - The University of British Columbia at gsa.ece.ubc.ca), making me believe this might be a concern that people might be looking an answer for. Here are the details: The website in question is: http://gsa.ece.umd.edu/. It runs using Joomla 2.5.x (latest). The site was up since around mid December of 2013, and I noticed right from the get go that the site was not being indexed correctly on Google. Specifically I see the following message when I search for the website on Google: A description for this result is not available because of this site's robots.txt – learn more. The thing is in December till around March I used the default Joomla robots.txt file which is: User-agent: * Disallow: /administrator/ Disallow: /cache/ Disallow: /cli/ Disallow: /components/ Disallow: /images/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /libraries/ Disallow: /logs/ Disallow: /media/ Disallow: /modules/ Disallow: /plugins/ Disallow: /templates/ Disallow: /tmp/ Nothing there should stop Google from searching my website. And even more confusingly, when I go to Google Webmaster tools, under "Blocked URLs" tab, when I try many of the links on the site, they are all shown up as "Allowed". I then tried adding a sitemap, putting it in the robots.txt file. That did not help. Same exact search result, same behavior in the "Blocked URLs" tab on the webmaster tools. Now additionally, the "sitemaps" tab says for several links an error saying "URL is robotted out". I tried those exact links in the "Blocked URLs" and they are allowed! I then tried deleting the robots.txt file. No use. Same exact problem. Here is an example screenshot from Google's Webmaster Tools: At this point I cannot give a rational explanation to why this is happening and neither can anyone in the IT department here. No one on Joomla forums can seem to understand what is going on. Based on what I explained, does it seem that I have somehow set a setting in the robots.txt or in .htaccess or somewhere else, incorrectly?

    Read the article

  • Why do Google search results include pages disallowed in robots.txt?

    - by Ilmari Karonen
    I have some pages on my site that I want to keep search engines away from, so I disallowed them in my robots.txt file like this: User-Agent: * Disallow: /email Yet I recently noticed that Google still sometimes returns links to those pages in their search results. Why does this happen, and how can I stop it? Background: Several years ago, I made a simple web site for a club a relative of mine was involved in. They wanted to have e-mail links on their pages, so, to try and keep those e-mail addresses from ending up on too many spam lists, instead of using direct mailto: links I made those links point to a simple redirector / address harvester trap script running on my own site. This script would return either a 301 redirect to the actual mailto: URL, or, if it detected a suspicious access pattern, a page containing lots of random fake e-mail addresses and links to more such pages. To keep legitimate search bots away from the trap, I set up the robots.txt rule shown above, disallowing the entire space of both legit redirector links and trap pages. Just recently, however, one of the people in the club searched Google for their own name and was quite surprised when one of the results on the first page was a link to the redirector script, with a title consisting of their e-mail address followed by my name. Of course, they immediately e-mailed me and wanted to know how to get their address out of Google's index. I was quite surprised too, since I had no idea that Google would index such URLs at all, seemingly in violation of my robots.txt rule. I did manage to submit a removal request to Google, and it seems to have worked, but I'd like to know why and how Google is circumventing my robots.txt like that and how to make sure that none of the disallowed pages will show up in their search results. Ps. I actually found out a possible explanation and solution, which I'll post below, while preparing this question, but I thought I'd ask it anyway in case someone else might have the same problem. Please do feel free to post your own answers. I'd also be interested in knowing if other search engines do this too, and whether the same solutions work for them also.

    Read the article

  • How to resolve "Google can't find your site's robots.txt" error?

    - by Manivasagam
    I've recently found that "Google can't find your site's robots.txt" in crawl errors. When I tried Fetching as Google, I got result "SUCCESS", then I tried looking at crawl errors and it still shows "Google can't find your site's robots.txt". What can I do to resolve this issue? Before this issue arose, my site was indexed within a few mintues, but now I find that it took time to be indexed in Google's search. When I access http://mydomain.com/robots.txt, it shows the data below: User-agent: *Disallow: /wp-admin/ Disallow: /wp-includes/ I found Blocked URLs = 0, also no any other errors. Is there any other thing I need to change? Or what could be the solution for this? Any help would be appreciated.

    Read the article

  • Cross-submission robots.txt for multiple domains on single host

    - by sidd.darko
    We are running a site with multiple languages hosted in a single environment on IIS7. For example, oursite.com - english oursite.de - german oursite.es - spanish This is a single-host environment. All of these sites are in the same application space on the same physical machine. I need to do cross-submission of sitemaps via robots.txt. Looking at the sitemap.org guidelines for this suggest this is possible, but the example indicates different physical machines. Will the following entries in oursite.com/robots.txt work? http://www.oursite.com/sitemap-oursite-de.xml http://www.oursite.com/sitemap-oursite-es.xml

    Read the article

  • Should I add a "nofollow" attribute to download links, or disallow the URLs in robots.txt?

    - by Laurent
    I have a download link very similar to Opera's one - it's just a script that sends the file. It doesn't have an extension and there's no obvious way to tell that it's actually a download link. So since I don't want robots to crawl this link, do I need to add it to robots.txt or maybe add a "nofollow" attribute to it? I see that on Opera's website they didn't do either of this, so perhaps it's not necessary?

    Read the article

  • Disallowed images in the robots.txt of my Joomla site can't be displayed when shared in Facebook

    - by opk
    I have noticed that since I have disallowed images using the robots.txt in my Joomla site, when sharing an article in Facebook, the image will not be displayed. Why is that? Is it indeed related? My robots.txt file: User-agent: * Disallow: /administrator/ Disallow: /cache/ Disallow: /cli/ Disallow: /components/ Disallow: /images/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /libraries/ Disallow: /logs/ Disallow: /media/ Disallow: /modules/ Disallow: /plugins/ Disallow: /templates/ Disallow: /tmp/

    Read the article

  • Google Sitemap and Robots.txt Issue

    - by Sarfaraz Soomro
    Hi, We have a sitemap at our site, http://www.gamezebo.com/sitemap.xml Some of the urls in the sitemap, are being reported in the webmaster central as being blocked by our robots.txt, see, gamezebo.com/robots.txt ! Although these urls are not Disallowed in Robots.txt. There are other such urls aswell, for example, gamezebo.com/gamelinks is present in our sitemap, but it's being reported as "URL restricted by robots.txt". Also I have this parse result in the Webmaster Central that says, "Line 21: Crawl-delay: 10 Rule ignored by Googlebot". What does it mean? I appreciate your help, Thanks.

    Read the article

  • How difficult is it to write our own Robots API, similar to G Wave Robots API ? Please read the deta

    - by user169650
    Consider the following entities : a) My own Wave-server b) My own Robots API c) Tomcat d) Google wave server/any other wave server Let us consider that a and d interact with one another via Google wave federation protocol. Now, I want to write my own Robots API in Java (similar to that of G Wave Robots API) using which I want to create Robots; which I want to host in entity c), which may in-turn connect to a) for listening to events and responding with operations. Let us consider that a) is already in place, i.e. implemented. Let us also consider that the Robot running on tomcat and entity a) are co-located, so that we do not need to use JSON-RPC for receiving events/sending operations; instead we can use Java interfaces. Now, my questions are : 1.How much of an effort is it to write my own Robots API to run on a tomcat container ? 2.What are the salient points to be taken care of ? Am I missing some important point here ? 3.How can I reuse some of the classes/packages/interfaces (e.g. com.google.wave.api.AbstractRobot, com.google.wave.api.event) with little/no changes at all ?

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >