Search Results

Search found 499 results on 20 pages for 'robots'.

Page 3/20 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • Robots.txt help

    - by Kyle R
    Google have just thrown up thousands of errors for duplicate content on my link tracker I am using. I want to make it so Google & any other search engines do not visit my pages on the link tracker. I want these pages to disallow these robots, my pages are: http://www.site.com/page1.html http://www.site.com/page2.html How would I write my robots.txt to make all robots not visit these links when they are in my page?

    Read the article

  • Is it safe to block redirected (but still linked) URLs with robots.txt?

    - by Edgar Quintero
    I have a website that has all URLs optimized and 301 redirected from nasty URLs to clean ones. However, everywhere throughout the site the unclean URLs are linked in menus, content, products, etc. Google currently has all clean URLs indexed, along with a few unclean URLs too. So the site still has linked everywhere the old URLs (ideally this wouldn't be the case but this is how it is ATM). I would like to block the unclean URLs with robots.txt. The question: if I block these unclean URLs with the robots.txt, when the entire website is linked with them (but they all redirect to the clean version), will this affect the indexing status at all?

    Read the article

  • Which token from a long User-Agent should I use in robots.txt?

    - by Gaia
    The definition of User-Agent states that several tokens can be included, as deemed necessary by the client. I want to block certain bots via robots.txt and I am confused as to which part of the User-Agent string to use, especially for more obscure bots. For example: Mozilla/5.0 (compatible; uMBot-LN/1.0; mailto: [email protected])" JS-Kit URL Resolver, http://js-kit.com/ Mozilla/5.0 (compatible; SEOkicks-Robot +http://www.seokicks.de/robot.html Do I use the second token? Can tokens contain spaces, or did the SEOkicks folks forget a semicolon after SEOkicks-Robot? I don't actually intend on making my question specific to a couple bots - I want to know the guideline: which part of UA do I place in robots.txt for these exotic bots with UA as long as a haiku? User-agent: uMBot-LN/1.0 Disallow: / PS: Thank you but I do not need to hear that undesirable bots are better blocked with mod_security. I already have commercial mod_sec rules in place.

    Read the article

  • Can I include a robots meta tag outside of the head in HTML snippets indeded to be SSIed?

    - by Dan
    I have a number of files in my site which are not intended for independent viewing, but rather to be AJAXed into content within the site. They obviously don't meet HTML standards (no body, head, etc.) as independent entities. I would like to prevent search engines from indexing these pages, but do not have access to /robots.txt (which would be much more ideal). My question is, could I include the following at the top of these partial HTML files and get the desired results? <meta name="robots" content="noindex, noarchive"> I guess there are two parts to this question. Will this cause any rendering issues in any browsers? Will search engines (at least Google & Bing) interpret this as intended?

    Read the article

  • Is it safe to Block These URLs with Robots.txt?

    - by Edgar Quintero
    I have a website that has all URLs optimized and 301 redirected from nasty URLs to clean ones. However, everywhere throughout the site the unclean URLs are linked in menus, content, products, etc. Google currently has all clean URLs indexed, along with a few unclean URLs too. So the site still has linked everywhere the old URLs (ideally this wouldn't be the case but this is how it is ATM). I would like to block the unclean URLs with robots.txt. The question: If I block these unclean URLs with the robots.txt, when the entire website is linked with them (but they all redirect to the clean version), will this affect the indexing status at all?

    Read the article

  • robots.txt file with more restrictive rules for certain user agents

    - by Carson63000
    Hi, I'm a bit vague on the precise syntax of robots.txt, but what I'm trying to achieve is: Tell all user agents not to crawl certain pages Tell certain user agents not to crawl anything (basically, some pages with enormous amounts of data should never be crawled; and some voracious but useless search engines, e.g. Cuil, should never crawl anything) If I do something like this: User-agent: * Disallow: /path/page1.aspx Disallow: /path/page2.aspx Disallow: /path/page3.aspx User-agent: twiceler Disallow: / ..will it flow through as expected, with all user agents matching the first rule and skipping page1, page2 and page3; and twiceler matching the second rule and skipping everything?

    Read the article

  • PHP robots.txt parsing

    - by omfgroflmao
    Is there an easiest way to do this? function parse_robots_txt($URL){ $parsed = parse_url($URL); $robots = file_get_contents('http://'.$parsed['host'].'/robots.txt',FILE_TEXT); $exploded = explode('user-agent:',strtolower($robots)); foreach($exploded as $user_agent){ $user_agent = trim($user_agent); if(substr($user_agent,0,1) == '*'){ $user_agent = str_replace('#','',preg_replace('/#.*\\n/i','',$user_agent)); $user_agent = str_replace('disallow:','',substr($user_agent,1)); $user_agent = preg_replace('/allow:/i', '+-+-+-+', $user_agent, 1); $user_agent = str_replace('allow:','',$user_agent); print_r(explode('+-+-+-+',$user_agent)); } } }

    Read the article

  • Robots Crawling Across Namespace?

    - by Codex73
    I migrated site from one domain to another. Also placed permanent redirection on old account. My stats logs are capturing this: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) /libro_metaboforte_chap5.php/members/members/file_chap6.php I placed this on robots which wasn't present at time of migration. Robots.txt Contents User-agent: * Allow: / Disallow: /members/ Disallow: /includes/ HTACCESS FILE CONTENTS DirectoryIndex index.php index.html Options +FollowSymlinks RewriteEngine On # Turn on the rewriting engine RewriteBase / RewriteCond %{REQUEST_FILENAME} !-d RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_URI} !^/store/?$ RewriteCond %{QUERY_STRING} !. RewriteRule ^.+/?$ index.php [QSA,L] RewriteCond %{QUERY_STRING} ^curlang=([a-z]*)$ RewriteRule ^.+/?$ index.php? [QSA,L] Will continue to log incoming bot captures. My htaccess does rewrite. I just added the robot file. The funny part is that is stepping in double directories... I don't know if the problem was not having the 'robots.txt' in place or the actual in place htaccess doing rewrites?

    Read the article

  • Rewrite for robots.txt and favicon.ico

    - by BHare
    I have setup some rules in which subdomains (my users) will default to where I have located the robots.txt, favicon.ico, and crossdomain.xml therefore if a user creates a site say testing.mywebsite.com and they don't make their own favicon.ico at testing.mywebsite.com/favicon.ico, then it will use the favicon.ico I have in /misc/favicon.ico This works perfect, but it doesn't work for the main website. If you attempt to go to mywebsite.com/favicon.ico it will check if "/" exists, in which it does. And then never redirects to /misc/favicon.ico How can I get it so both instances redirect to /misc/favicon.ico ? # Set all crossdomain (openpalace file) favorite icons and robots.txt doesnt exist on their # side, then redirect to site's just to have something to go on. RewriteCond %{REQUEST_URI} crossdomain.xml$ RewriteCond ^(.+)crossdomain.xml !-f RewriteRule ^(.*)$ /misc/crossdomain.xml [L] RewriteCond %{REQUEST_URI} favicon.ico$ RewriteCond ^(.+)favicon.ico !-f RewriteRule ^(.*)$ /misc/favicon.ico [L] RewriteCond %{REQUEST_URI} robots.txt$ RewriteCond ^(.+)robots.txt !-f RewriteRule ^(.*)$ /misc/robots.txt [L]

    Read the article

  • Should I disallow(robots.txt) archive/author pages with links already available on the front page? [on hold]

    - by WPRookie82
    I am working on a simple Wordpress blog where when an article is published, it appears on ALL these pages: Homepage - Headline(clickable) + 3-line summary Parent category page - Headline(clickable) + 3-line summary Child category page - Headline(clickable) + 3-line summary Author page - Headline(clickable) sitemap.xml I've been told that I should add all author pages to my robots.txt, under disallow, so as search engine bots do not spider /author/* since all links on these pages are available elsewhere. Is this a good approach or maybe rel=nofollow is better, or maybe I shouldn't worry about this at all?

    Read the article

  • Robots.txt syntax

    - by Sinan
    I not expert on robots.txt and i have the following in one of my clients robots.txt User-agent: * Disallow: Disallow: /backup/ Disallow: /stylesheets/ Disallow: /admin/ I am not sure about the second line. Is this line disallows all spiders?

    Read the article

  • Procedural modeling of Robots?

    - by anon
    Procedural techniques is common for texture synthesis, modeling plants, and modeling terrains. However, I've seen very little work on algorithmic construction of robots, which is a bit surprising given how mechanical these systems are. Anyone have a good resource on the algorithmic construction of robots / robotic humanoids? Thanks!

    Read the article

  • Multiple SiteMap: entries in robots.txt?

    - by user306942
    I have been searching around using Google but I can't find an answer to this question. A robots.txt file can contain the following line: Sitemap: http://www.mysite.com/sitemapindex.xml but is it possible to specify MULTIPLE sitemap index files in the robots.txt and have the search engines recognize that and crawl ALL of the sitemaps referenced in each sitemap index file? For example, will this work: Sitemap: http://www.mysite.com/sitemapindex1.xml Sitemap: http://www.mysite.com/sitemapindex2.xml Sitemap: http://www.mysite.com/sitemapindex3.xml

    Read the article

  • mean of robots.txt at yahoo.com

    - by hussain
    i want to know the mean of yahoo robots.txt that website( http://www.yahoo.com/robots.txt ) have the following lines User-agent: * Disallow: /p/ Disallow: /r/ Disallow: /*? i dont know the mean of last line(Disallow: /*?) please let me know... thanks and advance

    Read the article

  • wget not respecting my robots.txt. Is there an interceptor?

    - by Jane Wilkie
    I have a website where I post csv files as a free service. Recently I have noticed that wget and libwww have been scraping pretty hard and I was wondering how to circumvent that even if only a little. I have implemented a robots.txt policy. I posted it below.. User-agent: wget Disallow: / User-agent: libwww Disallow: / User-agent: * Disallow: / Issuing a wget from my totally independent ubuntu box shows that wget against my server just doesn't seem to work like so.... http://myserver.com/file.csv Anyway I don't mind people just grabbing the info, I just want to implement some sort of flood control, like a wrapper or an interceptor. Does anyone have a thought about this or could point me in the direction of a resource. I realize that it might not even be possible. Just after some ideas. Janie

    Read the article

  • Is there any advantage/disadvantage to using robots.txt to disallow access to legal pages such as terms, privacy policy, etc.?

    - by CaptainCodeman
    As I understand, having repetitive content is a detriment to search engine placement. Given that many websites that use similar or even identical "Terms and Conditions" and "Privacy Policy" pages due to similar legal wording or due to copy & pasting from the same source, would it be a good idea to disallow access to these pages via robots.txt, in order to avoid being penalized for "non-original content"? Or, on the contrary, could the search engines identify this as circumvention and penalize the site for trying to hide content? Or does it not matter?

    Read the article

  • robots.txt, how effective is it and how long does it take?

    - by Stefan
    We recently updated the site to a single page site using jQuery to slide between "pages". So we now have only index.php. When you search the company on engines such as Google, you get the site and a listing of its sub pages which now lead to outdated pages. Our plan doesn't allow us to edit the .htaccess and the old pages are .html docs so I cannot use PHP redirects either. So if I put in place a robots.txt telling the engines to not crawl beyond index.php, how effective will this be in preventing/removing crawled sub pages. And rough guess, how long before the search engines would update?

    Read the article

  • Cross-domain jQuery using YQL gives robots.txt error

    - by Jens Roland
    On the page http://qxlapps.dk/test.htm I am trying to perform an Ajax load from another domain, qxlapp.dk. I am using James Padolsey's xdomainajax.js plugin from: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/ When I open my test page, I get no output, but FireBug shows the JSON result, including the error message: "forbidden":"robots.txt for the domain disallows crawling for url: http://qxlapp.dk/projects/dagens_kup/show.php". The robots.txt on the qxlapp.dk domain contains the following: User-agent: Yahoo Pipes 2.0 Allow: / User-agent: * Allow: / So I don't see what the problem is? Shouldn't it pull the page just fine with those settings?

    Read the article

  • SEO chaos from changing robots.txt file in Wordpress site

    - by Seedorf
    Hi there, I recently edited the robots.txt file in my site using a wordpress plugin. However, since i did this, google seems to have removed my site from their search page. I'd appreciate if I could get an expert opinion on why this is so, and a possible solution. I'd initially done it to increase my search ranking by limiting the pages being accessed by google. This is my robots.txt file in wordpress: User-agent: * Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /trackback Disallow: /feed Disallow: /comments Disallow: /category/*/* Disallow: */trackback Disallow: */feed Disallow: */comments Disallow: /*?* Disallow: /*? Allow: /wp-content/uploads Sitemap: http://www.instant-wine-cellar.co.uk/wp-content/themes/Wineconcepts/Sitemap.xml

    Read the article

  • disallow certain url in robots.txt

    - by chrism
    We implemented a rating system on a site a while back that involves a link to a script. However, with the vast majority of ratings on the site at 3/5 and the ratings very even across 1-5 we're beginning to suspect that search engine crawlers etc. are getting through. The urls used look like this: http://www.thesite.com/path/to/the/page/rate?uid=abcdefghijk&value=3 When we started we add the following to our robots.txt: User-agent: * Disallow: /rate Is this incorrect or are googlebot and others simply ignoring our robots.txt?

    Read the article

  • How can I use varnish to generate a robots.txt file even for subdomain of the same site?

    - by Sam
    I want to generate a robots.txt file using Varnish 2.1. That means that domain.com/robots.txt is served using Varnish and also subdomain.domain.com/robots.txt is also served using Varnish. The robots.txt must be hardcoded into default.vcl file. is that possible? I know Varnish can generate a maintenance page on error. I'm trying to make it generate a robots.txt file. Can anyone help? sub vcl_error { set obj.http.Content-Type = "text/html; charset=utf-8"; synthetic {" <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <head> <title>Maintenance in progress</title> </head> <body> <h1>Maintenance in progress</h1> </body> </html> "}; return (deliver); }

    Read the article

  • Google Wave Robots API v2

    Google Wave Robots API v2 Pamela Fox describes how Wave Robots works, and new features in Robots API v2. From: GoogleDevelopers Views: 2 0 ratings Time: 17:28 More in Science & Technology

    Read the article

  • Google I/O 2010 - Making smart & scalable Wave robots

    Google I/O 2010 - Making smart & scalable Wave robots Google I/O 2010 - Making smart & scalable Wave robots Wave 201 David Byttow, Marcel Prasetya A smart robot must be able to store persistent data. Wave robots can store data in wave structures, like wavelets, datadocs, and annotations, instead of traditional datastores. A scalable robot must perform operations with minimal bandwidth. Wave robots can optimize by selecting the appropriate amount of context, the optimal events, and narrow filters for events. In this talk, we'll share best practices on data storage and scaling. For all I/O 2010 sessions, please go to code.google.com From: GoogleDevelopers Views: 9 0 ratings Time: 58:25 More in Science & Technology

    Read the article

  • pages still show up in google search even after disallowed in robots.txt [duplicate]

    - by Jota Onasys
    This question already has an answer here: With Robots.txt disallow all, why was my site still getting traffic? 5 answers Why is it that some pages still show up in google search even though disallowed in robots.txt? Is the best solution here to remove the Disallow from Robots.txt and just add noindex, nofollow meta tag to those pages you want blocked? Or should I submit a request to Google directly to remove those pages?

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >