Search Results

Search found 4984 results on 200 pages for 'robots txt'.

Page 1/200 | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • Robots.txt and "Bad" Robots

    - by Lynda
    I understand robots.txt and its purpose. I have read some people saying that using a Robots.txt gives "bad" robots or robots who do not obey a robots.txt a way to access pages on your site that you do not want accessed. While I am not looking to get into a debate about that I do have a question: If I have a structure like this: /Folder/ /Sub-Folder 1/ /Sub-Folder 2/ (Note: There are no pages within /Folder/ only other folders.) If I Disallow: /Folder/ it will prevent "good" robots from accessing the directory and any contents within the sub-folders. While we know that bad robots will see the /Folder/ will they be able to see and acess the sub-folders and the pages within the subfolders if they are not listed in the robots.txt? (Note: I do not fully understand how robots good or bad crawl a site beyond using a robots.txt and links within the site.)

    Read the article

  • Google Webmaster Central tells me that robots is blocking access to the sitemap

    - by Gaia
    This is my robots.txt User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/ Sitemap: http://www.mydomain.org/sitemap.xml.gz But Google Webmaster Central tells me that robots is blocking access to the sitemap: We encountered an error while trying to access your Sitemap. Please ensure your Sitemap follows our guidelines and can be accessed at the location you provided and then resubmit: URL restricted by robots.txt I read that Google Webmaster Central caches robots.txt, but the file has been updated more than 10 hours ago.

    Read the article

  • Do or can robots cause considerable performance issues?

    - by Anicho
    So the question in the title is exactly what I am trying to find out. My case is: At work we are in a discussion with team members who seem to think bots will cause us problems relating to performance when running on our services website. Out setup: Lets say I have site www.mysite.co.uk this is a shop window to our online services which sit on www.mysiteonline.co.uk. When people search in google for mysite they see mysiteonline.co.uk as well as mysite.co.uk. Cases against stopping bots crawling: We don't store gb's of data publicly available on the web Most friendly bots, if they were to cause issues would have done so already In our instance the bots can't crawl the site because it requires username & password Stopping bots with robot .txt causes an issue with seo (ref.1) If it was a malicious bot, it would ignore robot.txt or meta tags anyway Ref 1. If we were to block mysiteonline.co.uk from having robots crawl this will affect seo rankings and make it inconvenient for users who actively search for mysite to find mysiteonline. Which we can prove is the case for a good portion of our users.

    Read the article

  • Robots.txt practices with .htaccess redirections (inherits)

    - by Jayhal
    I have a question regarding how to write robots.txt files for many domains and subdomains with redirects in place. We have a hosting account that enacts primary and add-on domains. All of our domains and subdomains, including the primary domain, is redirected via htaccess 301s to their own subdirectories in the primary domain's root directory. I'm confused about how I would write the robots.txt for certain directories. First, I wanted to confirm I am right in understanding that for domains and subdomains, crawlers will look to the directory that acts as that urls root directory for the crawling rules(robots.txt). Also, that a directory will not be affected by a robots.txt present in their parent directory if the directory has its own domain/subdomain, and that url is the one being accessed by crawlers. (Am pretty sure, but I wanted to confirm I didnt have a fundamentally flawed understanding of robots.txt) In the original root directory on the account(where the primary domain was directed before htaccess was put in place) what should the robots.txt contain? When crawlers look to crawl our primary domain, will they look to the original root directory for the robots.txt or will they reference the file contained in the new subdirectory where all the primary domain's site files are located? If so, what should the root's robot.txt include if anything at all. Would I be right to include a simple 'disallow: /' for all agents, and then include more specific robots.txt files in each subdirectory with more specific instructions. Would that affect the crawling of the directory where the primary domain is now redirected? Any help is greatly appreciated, Thanks!

    Read the article

  • how to get this working if exist tasklist=="notepad %DATE%.txt" [closed]

    - by blade19899
    @echo off title Log file creator if exist "%CD%\%DATE%.txt" (msg * "the log of today(%DATE%.txt) Has already been made" goto :notepad) else (goto :start) goto :eof :start echo %date%>>"%date%".txt echo %username% >>"%date%".txt echo.>>"%date%".txt echo.>>"%date%".txt echo ----------------------------------------Reports---------------------------------------- >>"%date%".txt echo.>>"%date%".txt echo.>>"%date%".txt echo.>>"%date%".txt echo.>>"%date%".txt echo.>>"%date%".txt echo.>>"%date%".txt echo.>>"%date%".txt echo.>>"%date%".txt echo ----------------------------------------TO-DO---------------------- >>"%date%".txt echo.>>"%date%".txt echo.>>"%date%".txt :notepad if exist tasklist=="notepad %DATE%.txt" (msg * "the log of today(%DATE%.txt) Has already been made, en is opened") else (goto :start) goto :eof :start start notepad %DATE%.txt This code right now pops up the log of today(%DATE%.txt) Has already been made and after clicking OK it doesn't do anything it should open the msg the log of today(%DATE%.txt) Has already been made, en is opened i have notepad opened. with process explorer it shows notepad and the date.txt my question is how to make it show the the log of today(%DATE%.txt) Has already been made, en is opened box... and perhaps bring notepad to the foreground? ps not sure if this question belongs here. Apologize if it doesn't!

    Read the article

  • Redirect Google crawler to different robots.txt via .htaccess

    - by user3474818
    I have googled for the answer all day and still couldn't find an answer. I have a virtual subdomain www.static.example.com which is a mirror site of www.example.com. It means I have just one root folder for subdomain and domain aswell. I want to redirect crawlers to different robots.txt file - robots_static.txt when they see .static in url in which I will forbid indexing via /disallow command. I want to do this because I have duplicated content in Google search results. Subdomain is showing the exact same content as the main domain. Does anyone know how could I achieve that crawlers sees robots_static.txt instead of robots.txt? What I have managed to find so far is this: RewriteCond %{HTTP_HOST} ^www.static.*$ [NC] RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*robots\.txt.*\ HTTP/ [NC] RewriteRule ^robots\.txt /robots_static.txt [NC,L] but when I check in webmaster tools, it still sees robots.txt as my robots file instead of robots_static.txt, so it crawls and index everything twice. What did I do wrong? Thanks EDIT: This is my .htaccess file ## # @package Joomla # @copyright Copyright (C) 2005 - 2013 Open Source Matters. All rights reserved. # @license GNU General Public License version 2 or later; see LICENSE.txt ## ## # READ THIS COMPLETELY IF YOU CHOOSE TO USE THIS FILE! # # The line just below this section: 'Options +FollowSymLinks' may cause problems # with some server configurations. It is required for use of mod_rewrite, but may already # be set by your server administrator in a way that dissallows changing it in # your .htaccess file. If using it causes your server to error out, comment it out (add # to # beginning of line), reload your site in your browser and test your sef url's. If they work, # it has been set by your server administrator and you do not need it set here. ## ## Can be commented out if causes errors, see notes above. Options +FollowSymLinks ## Mod_rewrite in use. RewriteEngine On RewriteEngine On RewriteCond %{HTTP_HOST} !^www\. RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L] RewriteCond %{HTTP_HOST} ^www.static.*$ [NC] RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*robots\.txt.*\ HTTP/ [NC] RewriteRule ^robots\.txt /robots_static.txt [NC,L] ## Begin - Rewrite rules to block out some common exploits. # If you experience problems on your site block out the operations listed below # This attempts to block the most common type of exploit `attempts` to Joomla! # # Block out any script trying to base64_encode data within the URL. RewriteCond %{QUERY_STRING} base64_encode[^(]*\([^)]*\) [OR] # Block out any script that includes a <script> tag in URL. RewriteCond %{QUERY_STRING} (<|%3C)([^s]*s)+cript.*(>|%3E) [NC,OR] # Block out any script trying to set a PHP GLOBALS variable via URL. RewriteCond %{QUERY_STRING} GLOBALS(=|\[|\%[0-9A-Z]{0,2}) [OR] # Block out any script trying to modify a _REQUEST variable via URL. RewriteCond %{QUERY_STRING} _REQUEST(=|\[|\%[0-9A-Z]{0,2}) # Return 403 Forbidden header and show the content of the root homepage RewriteRule .* index.php [F] # ## End - Rewrite rules to block out some common exploits. ## Begin - Custom redirects # # If you need to redirect some pages, or set a canonical non-www to # www redirect (or vice versa), place that code here. Ensure those # redirects use the correct RewriteRule syntax and the [R=301,L] flags. # ## End - Custom redirects ## # Uncomment following line if your webserver's URL # is not directly related to physical file paths. # Update Your Joomla! Directory (just / for root). ## # RewriteBase / RewriteCond %{THE_REQUEST} ^GET.*index\.php [NC] RewriteCond %{THE_REQUEST} !/system/.* RewriteRule (.*?)index\.php/*(.*) /$1$2 [R=301,L] RewriteCond %{THE_REQUEST} ^GET ## Begin - Joomla! core SEF Section. # RewriteRule .* - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization}] # # If the requested path and file is not /index.php and the request # has not already been internally rewritten to the index.php script RewriteCond %{REQUEST_URI} !^/index\.php # and the request is for something within the component folder, # or for the site root, or for an extensionless URL, or the # requested URL ends with one of the listed extensions RewriteCond %{REQUEST_URI} /component/|(/[^.]*|\.(php|html?|feed|pdf|vcf|raw))$ [NC] # and the requested path and file doesn't directly match a physical file RewriteCond %{REQUEST_FILENAME} !-f # and the requested path and file doesn't directly match a physical folder RewriteCond %{REQUEST_FILENAME} !-d # internally rewrite the request to the index.php script RewriteRule .* index.php [L] # ## End - Joomla! core SEF Section. <FilesMatch "\.(ico|pdf|flv|jpg|ttf|jpg|jpeg|png|gif|js|css|swf)$"> Header set Expires "Wed, 15 Apr 2020 20:00:00 GMT" Header set Cache-Control "public" </FilesMatch> <ifModule mod_headers.c> Header set Connection keep-alive </ifModule> ########## Begin - Remove Etags # FileETag none # ########## End - Remove Etags

    Read the article

  • Different robots.txt for two different domains point to same folder

    - by Ali
    Hi, I have the following two domains: domain.com test.domain.com Both point to same folder which is "public_html". What I want is a different robots.txt file for each domain. So when someone browse domain.com/robots.txt then a different file is shown. And when someone go to test.domain.com/robots.txt then a different file is shown. How can I do this using URL rewriting in .htacces? Thanks

    Read the article

  • How to remove old robots.txt from google as old file block the whole site

    - by KnowledgeSeeker
    I have a website which still shows old robots.txt in the google webmaster tools. User-agent: * Disallow: / Which is blocking Googlebot. I have removed old file updated new robots.txt file with almost full access & uploaded it yesterday but it is still showing me the old version of robots.txt Latest updated copy contents are below User-agent: * Disallow: /flipbook/ Disallow: /SliderImage/ Disallow: /UserControls/ Disallow: /Scripts/ Disallow: /PDF/ Disallow: /dropdown/ I submitted request to remove this file using Google webmaster tools but my request was denied I would appreciate if someone can tell me how i can clear it from the google cache and make google read the latest version of robots.txt file.

    Read the article

  • Quick Question, robots.txt Disallow: /*/ does what exactly?

    - by Exit
    A SEO firm suggested changing the robots.txt to: User-agent: * Disallow: /*/ Allow: /ims/ I'm not sure what that would do, but my guess is that is would tell all robots to index nothing but the ims folder. I understand the wildcard, but I'm confused by the slashes and don't know how they would play out in conjunction with the wildcard. * Update * I didn't mention that there is a sitemap listed in the robots.txt file, but according to one tech blogger, he realized that sitemaps trump robots exclusions. So, even though this says in Google Webmaster Tools that everything with a trailing slash will not be indexed, the sitemap contains the important links. I did notice that the link count on Google went from 360 to 336, and the sitemap links under the URL scaled back to 3 from 6. I'm not sure the cause or what links were removed, though. Perhaps it cleaned out garbage. I'm still clueless why they would add in 'Allow: /ims/', that seems pointless. And a quick list of what would index according to the robots rules above (withouth the sitemap) using /*/: domain.com Indexed domain.com/page.html Indexed domain.com/folder/ Not Indexed domain.com/folder/page.html Not Indexed

    Read the article

  • robots.txt not updated

    - by Haridharan
    I have updated some url's and files in robots.txt file to block url's and files from google search results but, still files displaying in search results. As per a suggestion from a site I tried to update the robots.txt by below steps. In Google Webmaster tools, Health - Fetch as Google - type the url and click the fetch button. but, still files displaying in search results. Note: in Google Webmaster tools, Health - Blocked URL's - robots.txt file - downloaded date looks two dates back.

    Read the article

  • Recovering from an incorrectly deployed robots.txt?

    - by Doug T.
    We accidentally deployed a robots.txt from our development site that disallowed all crawling. This has caused traffic to dip dramatically, and google results to report: A description for this result is not available because of this site's robots.txt – learn more. We've since corrected the robots.txt about a 1.5 weeks ago, and you can see our robots.txt here. However, search results still report the same robots.txt message. The same appears to be true for Bing. We've taken the following action: Submitted site to be recrawled through google webmaster tools Submitted a site map to google (basically doing everything possible to say "Hey we're here! and we're crawlable!") Indeed a lot of crawl activity seems to be happening lately, but still no description is crawled. I noticed this question where the problem was specific to a 303 redirect back to a disallowed path. We are 301 redirecting to /blog, but crawling is allowed here. This redirect is due to a site redesign, wordpress paths for posts such as /2012/02/12/yadda yadda have been moved to /blog/2012/02/12. We 301 redirect to wordpress for /blog to keep our google juice. However, the sitemap we submitted might have /blog URLs. I'm not sure how much this matters. We clearly want to preserve google juice for URLs linked to us from before our redesign with the /2012/02/... URLs. So perhaps this has prevented some content from getting recrawled? How can we get all of our content, with links pointed to our site from pre-and-post redesign reporting descriptions? How can we resolve this problem and get our search traffic back to where it used to be?

    Read the article

  • Ignoring Robots - Or Better Yet, Counting Them Separately

    - by [email protected]
    It is quite common to have web sessions that are undesirable from the point of view of analytics. For example, when there are either internal or external robots that check the site's health, index it or just extract information from it. These robotic session do not behave like humans and if their volume is high enough they can sway the statistics and models.One easy way to deal with these sessions is to define a partitioning variable for all the models that is a flag indicating whether the session is "Normal" or "Robot". Then all the reports and the predictions can use the "Normal" partition, while the counts and statistics for Robots are still available.In order for this to work, though, it is necessary to have two conditions:1. It is possible to identify the Robotic sessions.2. No learning happens before the identification of the session as a robot.The first point is obvious, but the second may require some explanation. While the default in RTD is to learn at the end of the session, it is possible to learn in any entry point. This is a setting for each model. There are various reasons to learn in a specific entry point, for example if there is a desire to capture exactly and precisely the data in the session at the time the event happened as opposed to including changes to the end of the session.In any case, if RTD has already learned on the session before the identification of a robot was done there is no way to retract this learning.Identifying the robotic sessions can be done through the use of rules and heuristics. For example we may use some of the following:Maintain a list of known robotic IPs or domainsDetect very long sessions, lasting more than a few hours or visiting more than 500 pagesDetect "robotic" behaviors like a methodic click on all the link of every pageDetect a session with 10 pages clicked at exactly 20 second intervalsDetect extensive non-linear navigationNow, an interesting experiment would be to use the flag above as an output of a model to see if there are more subtle characteristics of robots such that a model can be used to detect robots, even if they fall through the cracks of rules and heuristics.In any case, the basic and simple technique of partitioning the models by the type of session is simple to implement and provides a lot of advantages.

    Read the article

  • URL blocked in robots.txt but still showing up on Google search [closed]

    - by Ahmad Alfy
    Possible Duplicate: Why do Google search results include pages disallowed in robots.txt? In my robots.txt I am disallowing a lot of URLs. Google webmaster tools says there're +750 URL blocked. The problem is the URLs are still showing on Google search. For example I have the following rule: Disallow: /entity/child-health/ But when I search some-keyword + child health the following URL shows up : http://www.sitename.com/entity/child-health/ Am I doing anything wrong? Is is possible for a URL to be blocked using robots.txt and still show up on search results?

    Read the article

  • HTTP 303 redirection and robots.txt

    - by Ian Dickinson
    On a site I'm working on, we're using the HTTP 303 redirect pattern (see this article for background) to distinguish between information and non-information resources. So: some URL's under /id get redirected to dynamically-created pages under /doc. These dynamic pages are built from a database, and contain links to other /doc/ resources, so in general we don't want them to be crawled. Our robots.txt contains: Disallow: /doc However, we do want the non-redirected pages under /id to get indexed by Google et al: Allow: /id So the question I have, which I can't find an answer to so far, is: if an allowed /id page 303-redirects to a /doc page, will it still be blocked by robots.txt? If yes, we're OK, but otherwise I'm going to disallow all /id resources in the robots file, as having the crawler hammer the db would be worse than losing search indexing for the /id pages.

    Read the article

  • Restricting crawler activity to certain directories with robots.txt

    - by neimad
    I would like to use robots.txt to prevent indexing of some parts of my website. I want search engines to index only the / directory and not search inside my controllers. In my robots.txt, I have this: User-Agent: * Disallow: /compagnies/ Disallow: /floors/ Disallow: /spaces/ Disallow: /buildings/ Disallow: /users/ Disallow: / I put this file in /mysite/public. I tested the file with a robots.txt validator and got no errors. However, Google always returns the result of my site. For testing, I added Disallow: /, but again, Google indexed all pages. floors, spaces, buildings, etc. are not physical directories. Is this a bug? How can I work around it?

    Read the article

  • Is my robots.txt working as it should?

    - by TigerBlood
    I want crawlers to have access to http://www.example.com but not http://www.example.com/ My robots.txt is as follows: User-agent: * Allow: /$ Disallow: / My site is in google search results, but I am not coming up in Bing, Yahoo, etc. I have had the same robots.txt since last year, and I initially requested inclusion ~1 year ago, having also resubmitted the URL to those latter search engines several times since as well. Is my robots.txt blocking those other crawlers? And if so, why not google as well? Thanks in advance!

    Read the article

  • Robots.txt Disallow command [on hold]

    - by Saahil Sinha
    How to disallow folders through Robots.txt, which are been crawled due to wrong url structure, which thus cause duplicate page error The URL been crawled as incorrectly by Google leading to duplicate page error: www.abc.com/forum/index.php?option=com_forum However, The actual correct pages however are: www.abc.com/index.php?option=com_forum Is this a correct way by excluding them through robots.txt: To exclude www.abc.com/forum/index.php?option=com_forum Below is command Disallow: /forum/ Will it not block in legitimate component folder 'Forum' of site?

    Read the article

  • Can search engine robots read file with permission 640?

    - by dkjain
    I am on a shared web hosting linux server. I want search engine robots/spiders to be able to read the robots.txt but not any one typing www.mysite.com/robots.txt. As per the following google group post, the user specifies that by setting file permission to 640, it's possible to deny access to robots.txt file by the world but still enable search engine robots to read them. Is that true? If not how it's possible to deny general public access to robots.txt but still allow Search engine robots to read them.

    Read the article

  • How come the ls command prints in multiple columns on tty but only one column everywhere else?

    - by David Lou
    Even after using Unix-like OSes for a couple years, this behaviour still baffles me. When I use the ls command in a directory that has lots of files, the output is usually nicely formatted into multiple columns. Here's an example: $ ls a.txt C.txt f.txt H.txt k.txt M.txt p.txt R.txt u.txt W.txt z.txt A.txt d.txt F.txt i.txt K.txt n.txt P.txt s.txt U.txt x.txt Z.txt b.txt D.txt g.txt I.txt l.txt N.txt q.txt S.txt v.txt X.txt B.txt e.txt G.txt j.txt L.txt o.txt Q.txt t.txt V.txt y.txt c.txt E.txt h.txt J.txt m.txt O.txt r.txt T.txt w.txt Y.txt However, if I try to redirect the output to a file, or pipe it to another command, only a single column appears in the output. Using the same example directory as above, here's what I get when I pipe ls to wc: $ ls | wc 52 52 312 In other words, wc thinks there are 52 lines, even though the output to the terminal has only 5. I haven't observed this behaviour in any other command. Would you like to explain this to me?

    Read the article

  • Google indexing page with parameters but page is Disallowed in robots.txt

    - by Jakobud
    I have the following in robots.txt: User-agent: * Disallow: /refer.php User-agent: NinjaBot Allow: / Sitemap: http://www.mysite.com/sitemap.xml The refer.php file does various things depending on what GET parameters are passed to it. When I do a Google search, I see tons of results for pages like this: http://www.mysite.com/refer.php?o=23945 http://www.mysite.com/refer.php?o=39858 http://www.mysite.com/refer.php?o=9683 http://www.mysite.com/refer.php?o=10569 http://www.mysite.com/refer.php?o=58304 http://www.mysite.com/refer.php?o=69604 Is the reason that Google is indexing these because I don't have an asterisk * after refer.php in the robots.txt ? Should changing it to Disallow: /refer.php* fix the problem?

    Read the article

  • Need assistance making a batch file for renaming files in separate folders

    - by Carnaxus
    Ok, here's one for you. I'm trying to use a batch file to rename a bunch of files, but none of them are in the same folder as the batch file itself. The command prompt keeps telling me that the directory can't be found. I suppose I could just rename all the files in all the folders that match the filename, but I don't want to do that either; I only want to change certain ones. My batch file as it stands is: @echo off ren "engine/info.txt" "disabled.txt" ren "gravplating/info.txt" "disabled.txt" ren "HAWX content/info.txt" "disabled.txt" ren "laserz/info.txt" "disabled.txt" ren "NeuroNaval/info.txt" "disabled.txt" ren "NeuroPlanes/info.txt" "disabled.txt" ren "NeuroTanks/info.txt" "disabled.txt" ren "NeuroWeapons/info.txt" "disabled.txt" ren "WAC Base/info.txt" "disabled.txt" ren "WAC DamageSystem/info.txt" "disabled.txt" ren "WAC GravityController/info.txt" "disabled.txt" ren "WAC Helicopters/info.txt" "disabled.txt" ren "WAC Sweps/info.txt" "disabled.txt" ren "weapons/info.txt" "disabled.txt" ren "AFF_ships/info.txt" "disabled.txt" ren "AntiTakeRifle/info.txt" "disabled.txt" ren "Catmull-Rom Cameras/info.txt" "disabled.txt" ren "Displacer Cannon/info.txt" "disabled.txt" ren "Drumdevil's Trains/info.txt" "disabled.txt" ren "EVEOnline/info.txt" "disabled.txt" ren "gm_botmap_v3/info.txt" "disabled.txt" ren "gm_construct_flatgrass_v5-2/info.txt" "disabled.txt" ren "gm_mobenix_v3_final/info.txt" "disabled.txt" ren "gm_mobenix_v3_highquality_Water/info.txt" "disabled.txt" ren "gm_snabbansairfield_b1/info.txt" "disabled.txt" ren "gm_XhS_construct/info.txt" "disabled.txt" ren "linedraw/info.txt" "disabled.txt" ren "ModelManipulator/info.txt" "disabled.txt" ren "NeuroCars/info.txt" "disabled.txt" ren "Propeller Engine/info.txt" "disabled.txt" ren "VanDookie and Predaaator's pack/info.txt" "disabled.txt" ren "WAC ECM/info.txt" "disabled.txt" ren "WAC Extra Helicopters/info.txt" "disabled.txt" echo Done! pause

    Read the article

  • Robots.txt never downloaded but some blocked URLs in GWT

    - by Zistoloen
    There is something I don't understand in Google Webmaster Tools (GWT) for my Wordpress site. In menu "Blocked URLs", it mention that my robots.txt has never been downloaded but there are some blocked URLs. It's kind of weird and not logical. Am i missing something? User-agent : * Disallow: /*? Disallow: /wp-login.php Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content Allow: /wp-content/uploads Disallow: */trackback Disallow: /*/feed Disallow: /*/comments Disallow: /cgi-bin Disallow: /*.php$ Disallow: /*.inc$ Disallow: /*.gz$ Disallow: /*.cgi$ Disallow: /author/* I'm afraid my robots.txt doesn't block several URLs I want to block.

    Read the article

  • How to set robots.txt globally in nginx for all virtual hosts

    - by anup
    I am trying to set robots.txt for all virtual hosts under nginx http server. I was able to do it in Apache by putting the following in main httpd.conf: <Location "/robots.txt"> SetHandler None </Location> Alias /robots.txt /var/www/html/robots.txt I tried doing something similar with nginx by adding the lines given below (a) within nginx.conf and (b) as include conf.d/robots.conf location ^~ /robots.txt { alias /var/www/html/robots.txt; } I have tried with '=' and even put it in one of the virtual host to test it. Nothing seemed to work. What am I missing here? Is there another way to achieve this?

    Read the article

  • How to secure robots.txt file?

    - by CompilingCyborg
    I would like for User-agents to index my relative pages only without accessing any directory on my server. As initial thought, i had this version in mind: User-agent: * Disallow: */* Sitemap: http://www.mydomain.com/sitemap.xml My Questions: Is it correct to block all directories like that - Disallow: */*? Would still search engines be able to see and index my sitemap if i disallowed all directories? What are the best practices for securing the robots.txt file? For Reference: Here is a good tutorial for robots.txt #Add this if you want to stop Alexa from indexing your site. User-agent: ia_archiver Disallow: / #Add this to stop duggmirror User-agent: duggmirror Disallow: / #Add this to allow specific agents User-agent: Googlebot Disallow: #Add this to allow all agents while blocking specific directories User-agent: * Disallow: /cgi-bin/ Disallow: /*?*

    Read the article

  • My First robots.txt

    - by Whitechapel
    I'm creating my first robots.txt and wanted to get a second opinion on it. Basically I have a FTP setup on my board for some special users to transfer files between each other and I do NOT want that included in the search by the bots. I also want to point to my sitemap which gets auto generated by a PHP page. So here is what I have, what else should I include, and if I need to fix anything with it? Also, it's linking to xmlsitemap.php because that generates the sitemap when called. My goal is to allow any search bot crawl the forums to grab meta data. User-agent: * Disallow: /admin/ Disallow: /ali/ Disallow: /benny/ Disallow: /cgi-bin/ Disallow: /ders/ Disallow: /empire/ Disallow: /komodo_117/ Disallow: /xanxan/ Disallow: /zeroordie/ Disallow: /tmp/ Sitemap: http://www.vivalanation.com/forums/xmlsitemap.php Edit, I'm not sure how to handle all the user's folders under /public_html/ since the robots.txt will be going in /public_html.

    Read the article

1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >