robots - Page 4 - Developer IT

Asterisk in robots.txt

- by Alexey

Wondering if following will work for google in robots.txt Disallow: /*.action I need to exclude all urls ending with .action. Is this correct?

Meaning of Disallow: /*? in robots.txt

- by hussain

Yahoo's robots.txt contains: User-agent: * Disallow: /p/ Disallow: /r/ Disallow: /*? What does the last line mean? ("Disallow: /*?")

my robots.txt file in web application

Hi, I am using asp.net with C#. To increase the searchibility of my site in google, I have searched & found out that I can do it by using my robots.txt , but I really don't have any idea how to create it and where can I place my tag like 'asp.net, C#' in my txt file. Also, the necessary steps to to include it in my application. Please help. Thanks in advance

Read the article

"X-Robots-Tag: noindex" on an HTTP 301 response

- by Peter O.

I understand that a resource with X-Robots-Tag: noindex forces some search engines, including Google, not to index the resource further. I also understand that an HTTP 301 response causes search engines to use the redirected URL instead of the original URL to refer to the resource. But what happens if both "X-Robots-Tag: noindex" and status code 301 occur on the same response? It's likely that the original URL will no longer be indexed, but will that cause the redirected URL to no longer be indexed too? This possibility is not mentioned in the X-Robots-Tag specification.

Read the article

How can robots beat CAPTCHAs?

- by totymedli

I have a website e-mail form. I use a custom CAPTCHA to prevent spam from robots. Despite this, I still get spam. Why? How do robots beat the CAPTCHA? Do they use some kind of advanced OCR or just get the solution from where it is stored? How can I prevent this? Should I change to another type of CAPTCHA? I am sure the e-mails are coming from the form, because it is sent from my email-sender that serves the form messages. Also the letter style is the same. For the record, I am using PHP + MySQL, but I'm not searching for a solution to this problem. I was interested in the general situation how the robots beat these technologies. I just told this situation as an example, so you can understand better what I'm asking about.

Read the article

Google indexed page a day before also reflecting in search but today everything vanish

- by ganesh

We had robots.txt which disallow all robots as we were in development. We are live now. We change robots.txt as per our requirement a day before. Submit indexes using Google Webmaster Tools index status. After this we can see proper result in search as well as Google images search was working as expected. Suddenly today all these things vanish from Google Search. Now again I can see old result i.e. under construction message. I checked robots.txt in Google Webmaster Tools, it's ok - no crawling errors. Kindly let me know what exactly happened? How I can inform this issue to Google?

Read the article

mod evasive not working properly on ubuntu 10.04

- by Joe Hopfgartner

I have an ubuntu 10.04 server where I installed mod_evasive using apt-get install libapache2-mod-evasive I already tried several configurations, the result stays the same. The blocking does work, but randomly. I tried with low limis and long blocking periods as well as short limits. The behaviour I expect is that I can request websites until either page or site limit is reached per given interval. After that I expect to be blocked until I did not make another request for as long as the block period. However the behaviour is that I can request sites and after a while I get random 403 blocks, which increase and decrase in percentage, however they are very scattered. This is an output of siege, so you get an idea: HTTP/1.1 200 0.09 secs: 75 bytes ==> /robots.txt HTTP/1.1 403 0.08 secs: 242 bytes ==> /robots.txt HTTP/1.1 200 0.08 secs: 75 bytes ==> /robots.txt HTTP/1.1 403 0.08 secs: 242 bytes ==> /robots.txt HTTP/1.1 200 0.11 secs: 75 bytes ==> /robots.txt HTTP/1.1 403 0.08 secs: 242 bytes ==> /robots.txt HTTP/1.1 200 0.08 secs: 75 bytes ==> /robots.txt HTTP/1.1 403 0.09 secs: 242 bytes ==> /robots.txt HTTP/1.1 200 0.08 secs: 75 bytes ==> /robots.txt HTTP/1.1 200 0.09 secs: 75 bytes ==> /robots.txt HTTP/1.1 200 0.08 secs: 75 bytes ==> /robots.txt HTTP/1.1 200 0.09 secs: 75 bytes ==> /robots.txt HTTP/1.1 403 0.08 secs: 242 bytes ==> /robots.txt HTTP/1.1 200 0.08 secs: 75 bytes ==> /robots.txt HTTP/1.1 403 0.08 secs: 242 bytes ==> /robots.txt HTTP/1.1 200 0.10 secs: 75 bytes ==> /robots.txt HTTP/1.1 403 0.08 secs: 242 bytes ==> /robots.txt HTTP/1.1 200 0.08 secs: 75 bytes ==> /robots.txt HTTP/1.1 403 0.09 secs: 242 bytes ==> /robots.txt HTTP/1.1 200 0.10 secs: 75 bytes ==> /robots.txt HTTP/1.1 403 0.09 secs: 242 bytes ==> /robots.txt HTTP/1.1 200 0.09 secs: 75 bytes ==> /robots.txt HTTP/1.1 200 0.08 secs: 75 bytes ==> /robots.txt HTTP/1.1 200 0.09 secs: 75 bytes ==> /robots.txt HTTP/1.1 200 0.08 secs: 75 bytes ==> /robots.txt HTTP/1.1 200 0.10 secs: 75 bytes ==> /robots.txt HTTP/1.1 200 0.08 secs: 75 bytes ==> /robots.txt The exac limits in place during this test run were: DOSHashTableSize 3097 DOSPageCount 10 DOSSiteCount 100 DOSPageInterval 10 DOSSiteInterval 10 DOSBlockingPeriod 120 DOSLogDir /var/log/mod_evasive DOSEmailNotify ***@gmail.com DOSWhitelist 127.0.0.1 So I would expect to be blocked at least 120 seconds after being blocked once. Any ideas aobut this? I also tried adding my configuration at different places (vhost, server config, directory context) and with of without ifmodule directive... This doesnt change anything.

Read the article

Blocking Just the Parent Domain via robots.txt

- by Bryan Hadaway

Let's say you have a parent domain: parent.com and children subdomains under that parent domain: child1.com child2.com child3.com Is there a way to use just the following within parent.com: User-agent: * Disallow: / Considering each child has their own robots.txt stating: User-agent: * Allow: / Or is the parent robots.txt still going to have to make an exception for every single subdomain: User-agent: * Disallow: / Allow: /child1/ Allow: /child2/ Allow: /child3/ Obviously this is important and tricky territory SEO wise so I'm looking to learn the definitive and safe, best practice method here to sharpen my skills. Thanks, Bryan

Read the article

Rewrite for robots.txt and favicon.ico [closed]

- by BHare

I have setup some rules in which subdomains (my users) will default to where I have located the robots.txt, favicon.ico, and crossdomain.xml therefore if a user creates a site say testing.mywebsite.com and they don't make their own favicon.ico at testing.mywebsite.com/favicon.ico, then it will use the favicon.ico I have in /misc/favicon.ico This works perfect, but it doesn't work for the main website. If you attempt to go to mywebsite.com/favicon.ico it will check if "/" exists, in which it does. And then never redirects to /misc/favicon.ico How can I get it so both instances redirect to /misc/favicon.ico ? # Set all crossdomain (openpalace file) favorite icons and robots.txt doesnt exist on their # side, then redirect to site's just to have something to go on. RewriteCond %{REQUEST_URI} crossdomain.xml$ RewriteCond ^(.+)crossdomain.xml !-f RewriteRule ^(.*)$ /misc/crossdomain.xml [L] RewriteCond %{REQUEST_URI} favicon.ico$ RewriteCond ^(.+)favicon.ico !-f RewriteRule ^(.*)$ /misc/favicon.ico [L] RewriteCond %{REQUEST_URI} robots.txt$ RewriteCond ^(.+)robots.txt !-f RewriteRule ^(.*)$ /misc/robots.txt [L]

Read the article

Make Google Apps site publicly accessible while disabling crawlers with robots.txt?

- by Joannes Vermorel

I would like to create a publicly accessible Google Apps site (i.e. users do not need to be authenticated to access the content) while maintaining a policy crawlers and bots exclusion with Robots.txt. Does anyone know how to do that?

Read the article

How to test robots.txt in googlebot to find out what is being indexed

- by Amar Jarubula

This question is a continuation for this answer How to check if googlebot will index a given url? As was told I did go to the Webmaster Tools and tested contents of my robots.txt file. However this is just giving me the info if that content is good enough or not. However for my scenario I need to test whether disallowing some patterns is being indexed or not. For example I have something like this below in my robots.txt disallow:/pattern* My understanding is the URLs with word pattern should not crawled, but how do I test this pattern is enforced while indexing the website?

Read the article

Prevent azure subdomain indexation

- by Leg10n

Let me explain my situation, I have an azure website (with azurewebsites.net sub domain), and a custom domain.com, built with asp.net MVC Both are being indexed by Google, but I've noticed the custom domain is being penalized and it doesn't show up in results, it only shows when I search for "site:domain.com" I want to remove and block the azurewebsites.net subdomain from Google. I've read the "possible" solutions: Adding robots.txt: won't work, because the subdomain and the domain are the exact same content, so subdomain.azures.net/robots.txt will lead to domain.com/robots.txt, removing the domain as well. Adding the tag, is the same situation as the previous point. I'm using a CNAME register to redirect the domain to the subdomain, so I can't redirect to a sub directory. Do you have any other ideas?

Read the article

How to create robots.txt for a domain that contains international websites in subfolders?

- by aaandre

Hi, I am working on a site that has the following structure: site.com/us - us version site.com/uk - uk version site.com/jp - Japanese version etc. I would like to create a robots.txt that points the local search engines to a localized sitemap page and has them exclude everything else from the local listings. So, google.com (us) will index ONLY site.com/us and take in consideration site.com/us/sitemap.html google.co.uk will index only site.com/uk and site.com/uk/sitemap.html Same for the rest of the search engines, including Yahoo, Bing etc. Any idea on how to achieve this? Thank you!

Read the article

Ranking drop after using reverse proxy for blog subdirectory and robots.txt for old blog subdomain

- by user40387

We have a 3Dcart store and a WordPress blog hosted on a separate server. Originally, we had a CNAME set up to point the blog to http://blog.example.com/. However, in our attempt to boost link-based and traffic-based authority on the main site, we've opted to do a reverse proxy to http://www.example.com/blog/. It’s been about two months since we finished the reverse proxy migration. It appears that everything is technically working as intended, including some robots and sitemap changes; the new URLs are even generating some traffic, as indicated on Google Analytics. While Google has been indexing the new URL locations, they’re ranking very poorly, even for non-competitive, long-tail keywords. Meanwhile, the old subdomain URLs are still ranking mostly as well as they used to (even though they aren’t showing meta titles and descriptions due to being blocked by robots.txt). Our working theory is that Google has an old index of the subdomain URLs, and is considering the new URLs to be duplicate content, since it’s being told not to crawl the subdomain and therefore can’t see the rel canonicals we have in place. To resolve this, we’ve updated the subdomain’s robot.txt to no longer block crawling and indexing. Theoretically, seeing the canonical tag on the subdomain pages will resolve any perceived duplicate content issues. In the meantime, we were wondering if anyone would have any other ideas. We are very concerned that we’ll be losing valuable traffic, as we’re entering our on season at the moment.

Read the article

<meta name="robots" content="noindex"> in "Fetch as Google"

- by Rodrigo Azevedo

I don't know why but when I execute "fetch as Google" it returns me HTTP/1.1 200 OK Cache-Control: private Content-Type: text/html Content-Encoding: gzip Vary: Accept-Encoding Server: Microsoft-IIS/7.5 Set-Cookie: ASPSESSIONIDQACRADAQ=ECAINNFBMGNDEPAEBKBLOBOP; path=/ X-Powered-By: ASP.NET Date: Wed, 26 Jun 2013 15:18:29 GMT Content-Length: 153 <meta name="robots" content="noindex"> The noindex doesn't exist. Does anybody know what could be wrong?

Read the article

New features for Robots: Bundled Annotations, Inline Blips, Read-Only Roles

Over the last few releases, we've been rolling out incremental improvements to the robots API , based on the feedback from all of you developers. For those of...

Read the article

After Caldera.com's Robots.txt is Removed, Some Evidence Surfaces

Groklaw: "Now that SCO has sold off the caldera.com domain name, their previous robots.txt file no longer blocks access to the legacy Caldera web pages on Internet Archive. And what has popped up?"

Read the article

Android OS Now Used To Drive Real Robots

Robot Reviews: "For those wondering about the propriety of the name "Android" as a mobile device operating system, wonder no more because its real purpose has finally been revealed. It's really an operating system for robots."

Read the article

Qbo, Based On Linux, To Join Growing Field Of Open Source Robots

OStatic: "Now, one of the more interesting new open source robots is Qbo (shown), a Linux-based robot from the folks at thecorpora.com."

Read the article

How do I remove a LOT of indexed pages from Google?

- by Thierry

A few weeks ago we have figured out that Google has indexed some information we would rather keep in some confidentiality, in the format of individual PDF files. Our assumption was that this was a problem with our robots.txt we had overlooked. Even though we are not sure whether or not this is the case, we are certain that the robots.txt file is in a valid format and is, according to Google's webmaster tools, blocking the files. However, even after this adjustment that has been made weeks ago, Google still has the PDF files indexed, but does tell us further information cannot be provided due to the robots.txt file being present. As you can hopefully understand, this is unwanted behaviour due to the nature of the documents. I am aware that there is a request page being provided by Google for this purpose, but there are a lot of files. Is there an easier way to get Google to remove all of the files from its search engine? If not, is there anything else you could advise us to do besides manually requesting Google to remove every single page? Thanks in advance.

Read the article

What does robots.txt file do in PHP project?

- by OM The Eternity

What does robots.txt file do in PHP project?

Read the article

Google-Bot fell in love with my 404-page

- by 32bitfloat

Every day my access-log looks kind of this: 66.249.78.140 - - [21/Oct/2013:14:37:00 +0200] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.78.140 - - [21/Oct/2013:14:37:01 +0200] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.78.140 - - [21/Oct/2013:14:37:01 +0200] "GET /vuqffxiyupdh.html HTTP/1.1" 404 1189 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" or this 66.249.78.140 - - [20/Oct/2013:09:25:29 +0200] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.75.62 - - [20/Oct/2013:09:25:30 +0200] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.78.140 - - [20/Oct/2013:09:25:30 +0200] "GET /zjtrtxnsh.html HTTP/1.1" 404 1186 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" The bot calls the robots.txt twice and after that tries to access a file (zjtrtxnsh.html, vuqffxiyupdh.html, ...) which cannot exist and must return a 404 error. The same procedure every day, just the unexisting html-filename changes. The content of my robots.txt: User-agent: * Disallow: /backend Sitemap: http://mysitesname.de/sitemap.xml The sitemap.xml is readable and valid, so there seems to be no reason why the bot should want to force a 404-error. How should I interpret this behaviour? Does it point to a mistake I've done or should I ignore it?

Read the article

How to submit sitemap when your website has partial https? - Error: "Not in Domain"

- by Ralph N

My website is an ecommerce that is set up to do http for the item browsing portion, but https for things like shopping cart, contact us, etc.. (anything that has forms on it). I've submitted my website a long time ago to google webmaster tools as http://www.mywebsite.com. I also submitted a sitemap with about 40 links - 8 of them are https. I've noticed that for the longest time, google webmaster tools was reporting that 32 out of the 40 links have been crawled. I tested all the links against my robots.txt and realized that my robots text was blocking the https links. Google says those links are "Not In Domain". Is there a way i'm supposed to get around this so that I can have a hybrid-ssl site? I understand the concept that one site is mywebsite.com:80 and the other is mywebsite.com:443, but i'd like to avoid submitting and maintaining 2 seperate websites on google webmaster tools.

Read the article

Weird entry for robots.txt on a Naked Domain in Google Webmaster Tools

- by Metalshark

We own a .co.uk address and use an Internet hosting company that has made mistakes around DNS in the past. Our main site is hosted on www. and their reluctance to allow editing of AAAA records on-line means our naked domain does not resolve. Currently when we attempt to reach the naked version there is no entry for the browser to go to and it displays an unreachable page (nslookup just says Name: name of domain with no further entries such as an IP or Canonical Name). We recently added the relevant TXT records to verify us to view both the www. version and the naked version of the domain in Google Webmaster Tools (in anticipation of the requests to our Internet host coming to fruition). Imagine our shock when double checking the Site configuration Crawler access and finding a (admittedly failing) robots.txt with a dynamically generated HTML page (full of crude pop-up JavaScript) with references to 3 of our most prominent competitors. What could cause this to happen? As we are in the UK I am assuming some DNS server is serving Google bad information. We are going to contact the Internet hosting company to fix our A and AAAA records once and for all, then check that they work in the US (using something like OpenDNS). Should we be doing more though, for instance informing Google (through Webmaster Tools) that we are now aware there is something currently wrong with our naked domain? UPDATE: We have fixed our A records (not AAAA) and that has resolved the issue. But if there are further actions we should take for effectively having a parking page hosted on our active visitor-heavy, SEO-rich domain that advertised our competitors to US visitors, what would they be?

Read the article

Best way to prevent Google from indexing a directory [duplicate]

- by Gkhan14

This question already has an answer here: Stopping Google index some web pages I have 5 answers I've researched many methods on how to prevent Google/other search engines from crawling a specific directory. The two most popular ones I've seen are: Adding it into the robots.txt file: Disallow: /directory/ Adding a meta tag: <meta name="robots" content="noindex, nofollow"> Which method would work the best? I want this directory to remain "invisible" from search engines so it does not affect any of my site's ranking. In other words, I want this directory to be neutral/invisible and "just there." I don't want it to affect any ranking. Which method would be the best to achieve this?

Search Results

Search found 499 results on 20 pages for 'robots'.

Page 4/20 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by Alexey

- by hussain

- by Zerotoinfinite

- by Peter O.

- by totymedli

- by ganesh

- by Joe Hopfgartner

- by Bryan Hadaway

- by BHare

- by Joannes Vermorel

- by Amar Jarubula

- by Leg10n

- by aaandre

- by user40387

- by Rodrigo Azevedo

- by Thierry

- by OM The Eternity

- by 32bitfloat

- by Ralph N

- by Metalshark

- by Gkhan14

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >