robots - Page 6 - Developer IT

CDN virtual subdomain causes duplicated content

- by user3474818

I have created a subdomain and a CNAME record which points to the domain root. The subdomain www.static.example.com is actually a copy of the entire website www.example.com and it is supposed to act as an CDN and serve static content in order to improve speed. However, all of my content can be accessed via subdomain aswell, so Google has indexed it all and now I am dealing with duplicated content. How could I deny access to crawlers for the subdomain baring in mind that I do not have different subfolder for subdomain, so I can't create a separate robots.txt file?

Read the article

SEO: disallowing Google from indexing forms in iframes or not?

- by Marco Demaio

I usually place forms in iframes (i.e. order form, request assistance form, contact forms, ect.). Just the forms, I never place other contents or pages in iframes. From a SEO point of view, would you exclude forms from being indexed/crawled by Google or not? I mean my forms hardly ever contains keyword/keyphrases, moreover I obviously place empty title/meta description tags in pages shown in iframe to display forms, cause those titles are never displaied in browser title bar. So I'm wondering what's the point of letting Google index them? Moreover I think these form pages might suck out PR from all other pages that are more valuable for SEO. If your answer is "yes I would exclude them form indexing" would you simply use robots.txt to exclude them? Thanks!

Read the article

Googlebot can't access my site when crawling from rootdomain

- by PéCé

I can't explain why I get this message for my rootdomain result in Google : trocmalin.com/ A description for this result is not available because of this site's robots.txt – learn more. Here is my site specifics : vide-greniers.trocmalin.com is the site address www.trocmalin.com redirects (301) to vide-greniers.trocmalin.com trocmalin.com redirects (301) to vide-greniers.trocmalin.com too... User-agent: * Disallow: /orga/ User-agent: * Disallow: /sitemap-update Google results for vide-greniers.trocmalin.com are well rendered, as well as sub pages allowed for bots. But the result for my rootdomain (trocmalin.com) gives this message... Can you help me ?

Read the article

How do I deal with content scrapers? [closed]

- by aem

Possible Duplicate: How to protect SHTML pages from crawlers/spiders/scrapers? My Heroku (Bamboo) app has been getting a bunch of hits from a scraper identifying itself as GSLFBot. Googling for that name produces various results of people who've concluded that it doesn't respect robots.txt (eg, http://www.0sw.com/archives/96). I'm considering updating my app to have a list of banned user-agents, and serving all requests from those user-agents a 400 or similar and adding GSLFBot to that list. Is that an effective technique, and if not what should I do instead? (As a side note, it seems weird to have an abusive scraper with a distinctive user-agent.)

Read the article

Search engine bots accessing strange URLs

- by casasoft

We have ELMAH enabled on our site and get errors whenever a Page Not Found error is triggered on the website. We have recently redesigned a new website and so we understand that search engine robots might have previously indexed pages which they try to access and result in a Page Not Found errors. For this reason, we have set up permanent redirects for such previously indexed pages to the respective new pages. The website in mention is www.chambercollege.com and for example, a previously indexed URL was www.chambercollege.com/special-offers.aspx. This page is no longer accessible so we have created the necessary permanent redirect to redirect to the respective page on www.chambercollege.com/en/content/special-offers-161/. Now we are starting to receive Page Not Found errors of search engine bots (e.g. MSN bot) trying to access the URL www.chambercollege.com/special-offers.aspx/images/shadow_right.jpg/. Any idea how could a search engine make up that strange URL and whether you have any suggestions of what to do best?

Read the article

Will duplicate international (i18n) content hinder SEO rankings?

- by Rhys

Google clearly states that duplicate content within a single, or multiple, domains is not advised. This is understood, but I am not sure of any exceptions for sites with region-specific content that is often replicated across locales. For example, a site's /en-us/about page could be identical to /en-uk/about, whereas most likely /en-ja/about is unique. Are GYM smart enough to understand that the initial URL depth is a locale specifier? Is there any robots.txt or header, etc, trickery that I should include to outline the site's international structure?

Read the article

Htaccess/robots.txt to allow search bots to explore main domain but not directory on other domain

- by gX

Ok, I understand the Title didn't make any sense so here I've tried to explain it in detail. I'm using a hosting that gives me space for my domain and lets me "add on" other domains on it. So lets say I have a domain A, and I add on a domain B. Basically my hosting gives me a public_html where I can put stuff that shows when someone visits website A. But, when I add the domain B, it lets me put the content of B, INSIDE of that public_html so that website B.com can also be visited by going to A.com/siteB... Thats all good, except that Google has started indexing B.com as well as A.com/siteB, I'm ok with it indexing B.com, but I somehow want to prevent it from indexing A.com/siteB so that when people search for B, it doesn't end up showing A.com/siteB. Any ideas? Let me know if the question is still unclear.

Read the article

Handling SEO for Infinite pages that cause external slow API calls

- by Noam

I have an 'infinite' amount of pages in my site which rely on an external API. Generating each page takes time (1 minute). Links in the site point to such pages, and when a users clicks them they are generated and he waits. Considering I cannot pre-create them all, I am trying to figure out the best SEO approach to handle these pages. Options: Create really simple pages for the web spiders and only real users will fetch the data and generate the page. A little bit 'afraid' google will see this as low quality content, which might also feel duplicated. Put them under a directory in my site (e.g. /non-generated/) and put a disallow in robots.txt. Problem here is I don't want users to have to deal with a different URL when wanting to share this page or make sense of it. Thought about maybe redirecting real users from this URL back to the regular hierarchy and that way 'fooling' google not to get to them. Again not sure he will like me for that. Letting him crawl these pages. Main problem is I can't control to rate of the API calls and also my site seems slower than it should from a spider's perspective (if he only crawled the generated pages, he'd think it's much faster). Which approach would you suggest?

Read the article

mod_rewrite and SEO friendliness

- by John Doe

My website has an atypical structure and I'm not sure if this could create problems in the long run, specially for SEO positioning purposes. I have a unique, large PHP script, and I use the Apache module mod_rewrite in the .htaccess file to create friendly URLs, for example: RewriteRule ^$ /index.php?section=Main RewriteRule ^createArticle$ /index.php?section=Main&view=CreateArticle RewriteRule ^configuration$ /index.php?section=Configuration RewriteRule ^article/([0-9]{1,10})$ /index.php?section=Article&view=Default&id=$1 RewriteRule ^deleteArticle/([0-9]{1,10})$ /index.php?section=Article&view=Delete&id=$1 RewriteRule ^reportArticle/([0-9]{1,10})$ /index.php?section=Article&view=Report&id=$1 RewriteRule ^logIn$ /index.php?section=Authentication ... So, www.example.com/index.php?section=Article&view=Default&id=105 would become www.example.com/article/105. The only real physical file is index.php, in which the parameters of the URL queried is processed and the corresponding result is outputted. My question is, do the crawling robots (e.g. Googlebot) recognize these links? Do they index the resulting HTML outputted by index.php with the specified parameters as if it was a actual HTML file? Also, would this become a problem when creating a Sitemap?

Read the article

What kind of redirect (301 or 302) for an email links tracker?

- by MaxiWheat

We are developing an email sending application ("à la" Mailchimp). Hyperlinks inserted by our users, in the emails they want to send, are replaced by a tracking URL on our application (https://ourdomain.com/trackingurl?blablabla) which then redirects the email reader to the original URL our users included in their emails. This allows us to record statistics about link clicks. Until now, we used 301 for those redirections, but we noticed that Google began indexing pages on our application which are in fact redirects to other domains. (The title and snippet in Google results are from the other domain, but the link in green is from our application). We took action by adding those urls to our robots.txt, but Google seems to take forever (months!) before removing them for its index and removing them by hand in Webmaster Tools would take a lot of time since there are lot. I would like to know which kind of HTTP redirect (301 or 302) is best suited for this kind of opreation ? Do you think switching to 302 redirects could improve this situation since we don't really want Google to index redirected links from our clients emails ?

Read the article

Does GoogleBot respect User-agent: *

- by rkulla

I blocked a page in robots.txt under User-agent: *, and tried to do a manual removal of that URL from Google's cache in the webmasters tools. Google said it wasn't being blocked in my robots.txt, so I then blocked it specifically under User-agent: GoogleBot and tried removing it again and this time it worked. Does that mean Google doesn't respect User-agent: * or what?

Read the article

[Disallow: /index.php] seems to block /my-beautiful-sef-url-123

- by Jaroslav Záruba

Hello I have robots.txt that looks like this: User-agent: * Disallow: /system/ Disallow: /admin/ Disallow: /index.php The obvious goal has been to prevent all the ugly URLs from being indexed, as they all begin with "/index.php". But for some reason all URLs like /my-beautiful-sef-url-123 are listed under Crawl errors in Google Webmaster Tools with "URL restricted by robots.txt". (When I test such URL it yields Allowed for both Googlebot and Googlebot-Mobile.) Can anyone help please?

Read the article

I'm getting a "Does not implement IController" error on images and robots.txt in MVC2

- by blesh

I'm getting a strange error on my webserver for seemingly every file but the .aspx files. Here is an example. Just replace '/robots.txt' with any .jpg name or .gif or whatever and you'll get the idea: The controller for path '/robots.txt' was not found or does not implement IController. I'm sure it's something to do with how I've setup routing but I'm not sure what exactly I need to do about it. Also, this is a mixed MVC and WebForms site, if that makes a difference.

Read the article

Unit testing Google wave robots in Java

- by Paul

Any tips or best practices for unit testing Google Wave robots written in Java? I'm expecting to deploy on AppEngine, if that helps. I'm a fan of TDD but new to both Wave Robots and AppEngine, so I'm hoping to use TDD to help me explore the design space.

Read the article

Why deny access to website for msnbot/bingbot?

- by Quandary

I've seen quite a lot of tutorials that recommend you to ban user agents containing the strings libwww-perl and msnbot. I understand why one would ban libwww-perl, it's mainly if not only used for hacking and spamming. But why are there so many sites recommending to ban msnbot/bingbot? Since it's a search engine, even if only with a marginal market share, I would except one would want this bot to crawl one's sites. What is it that msnbot does that makes people ban it?

Read the article

Using rel=canonical and noindex in a 1-n partners enviroment

- by Telemako Mako

We sell a whole site (domain, etc) to partners that create content that is shown together at the main site. What we want to achieve is that the main site copy is the original, but the one that is indexed is the partners copy. We want to do it this way so the search results point to the partner sites but never to the main site while the main site gets all the credit for the links. We are trying setting the main site article with a noindex, follow and a link to the partner article, and in the partner article we have a rel=canonical pointing to the main site article. Are we correct or the noindex at the main site will break the canonical reference?

Read the article

What bots are really worth letting onto a site?

- by blunders

Having written a number of bots, and seen the massive amounts of random bots that happen to crawl a site, I am wondering if the goal of the site allowing bots is for the potential for the bot to send real traffic back to the site if there is any reason to allow bots that are not known to be sending real traffic back, and how to spot these "good" bots; based on how they ID themselves, IPs they come from, behaviors, etc.

Read the article

How to get search engines to properly index an ajax driven search page

- by Redtopia

I have an ajax-driven search page that will allow users to search through a large collection of records. Each search result points to index.php?id=xyz (where xyz is the id of the record). The initial view does not have any records listed, and there is no interface that allows you to browse through all records. You can only conduct a search. How do I build the page so that spiders can crawl each record? Or is there another way (outside of this specific search page) that will allow me to point spiders to a list of all records. FYI, the collection is rather large, so dumping links to every record in a single request is not a workable solution. Outputting the records must be done in multiple requests. Each record can be viewed via a single page (eg "record.php?id=xyz"). I would like all the records indexed without anything indexed from the sitemap that shows where the records exist, for example: <a href="/result.php?id=record1">Record 1</a> <a href="/result.php?id=record2">Record 2</a> <a href="/result.php?id=record3">Record 3</a> <a href="/seo.php?page=2">next</a> Assuming this is the correct approach, I have these questions: How would the search engines find the crawl page? Is it possible to prevent the search engines from indexing the words "Record 1", etc. and "next"? Can I output only the links? Or maybe something like:

Read the article

Adsense click bot is click bombing my site

- by Graham

I have a site that get's roughly 7,000 - 10,000 page views per day right now. Starting around 1 AM on 7/1/12 I noticed the CTR was rising dramatically. These clicks would be credited then de-credited soon after. So, they were obviously fraudulent clicks. The next day I had about 200 clicks in account with about 100 of them being fraudulent. It's about 3 - 8 per hour evenly dispersed for each of the three ads 24 hours a day. This leads me to believe that it's some sort of Adsense click bot. Also, I removed the ads last evening then put them back up around 3AM and the invalid clicks started within 10 minutes. I signed up for statcounter.com to analyze the exit links on the Adsense. Then I conditionally blocked ads for the IP address of the person / bot I suspected doing this. But, I think that the bot has several proxies to choose from and can refresh IP addresses. I've notified Google through the invalid click form / email 4 times over the past two days in order to let them know I'm aware of the situation and am working on a solution. I've also temporally removed all ads on that site. How can I block a bot like this? Thank you.

Read the article

Should I block bots from my site and why?

- by Frank E

My logs are full of bot visitors, often from Eastern Europe and China. The bots are identified as Ahrefs, Seznam, LSSRocketCrawler, Yandex, Sogou and so on. Should I block these bots from my site and why? Which ones have a legitimate purpose in increasing traffic to my site? Many of them are SEO. I have to say I see less traffic if anything since the bots have arrived in large numbers. It would not be too hard to block these since they all admit in their User Agent that they are bots.

Read the article

Receiving requests where absolute URL on page are morphed to relative URLs

- by Jacob

In our web pages, we have a hyperlink with an href to an absolute URL: https://some.other.host.com/blah.aspx?var1=val1&var2=val2 For some reason, in our logs, we see a lot of requests to URLs of this format: http://our.site.com/https:/some.other.host.com/blah.aspx?var1=val1&var2=val2 We don't have any JavaScript that would request that URL; it only appears inside of a hyperlink. Is there some sort of known bot, browser plugin, bug, etc. that could be responsible for these requests being made?

Read the article

Yandex frequently replaces page names with ampersands

- by Guy

The Yandex spider is a frequent visitor to one of the sites I manage. On ocassion it replaces the page name with two ampersands and a space. So if the page is: /mypage.aspx?param=value then it will try and crawl it as: /&& ?param=value Any idea why it is doing this? [EDIT] If I remember correctly the IP that this "mistake" is coming from is based in California and not Russia. I believe that they crawl US sites from a US based IP address. Not sure if that helps. More Info about request: IP: 199.21.99.82 City: Palo Alto State: California Country: United States ISP: Yandex Inc. User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

Read the article

Uploading a non-finished website

- by Daniel

I have a pretty basic question. I developed a neet little website wich I'm ready to upload, but still needs a bit of work. The designer needs the html to do his work so the website needs to be uploaded. Besides that, I have to correct a couple details, do the friendly-urls, etc. What's the best way to set up the webpage in the definitive hosting with the defintive domain, blocking it to any unknown users and without affecting affecting SEO and those kind of things. If I were to just upload it, the non-definitive website might be crawled by a SE-bot. Thanks!

Read the article

How to hide download file from bots? [closed]

- by CJ7

Possible Duplicate: How to restrict the download of all files in a folder? I want to make a private file available for download but not use username/password protection. I want to put the file into a directory called something like download. How can I ensure: the file does not become part of search engine results, and the file cannot be accessed by bots that might guess the directory name?

Read the article

Is browser and bot whitelisting a practical approach?

- by Sn3akyP3t3

With blacklisting it takes plenty of time to monitor events to uncover undesirable behavior and then taking corrective action. I would like to avoid that daily drudgery if possible. I'm thinking whitelisting would be the answer, but I'm unsure if that is a wise approach due to the nature of deny all, allow only a few. Eventually someone out there will be blocked unintentionally is my fear. Even so, whitelisting would also block plenty of undesired traffic to pay per use items such as the Google Custom Search API as well as preserve bandwidth and my sanity. I'm not running Apache, but the idea would be the same I'm assuming. I would essentially be depending on the User Agent identifier to determine who is allowed to visit. I've tried to take into account for accessibility because some web browsers are more geared for those with disabilities although I'm not aware of any specific ones at the moment. The need to not depend on whitelisting alone to keep the site away from harm is fully understood. Other means to protect the site still need to be in place. I intend to have a honeypot, checkbox CAPTCHA, use of OWASP ESAPI, and blacklisting previous known bad IP addresses.

Search Results

Search found 499 results on 20 pages for 'robots'.

Page 6/20 | < Previous Page | 2 3 4 5 6 7 8 9 10 11 12 13 | Next Page >

- by user3474818

- by Marco Demaio

- by PéCé

- by aem

- by casasoft

- by Rhys

- by gX

- by Noam

- by John Doe

- by MaxiWheat

- by rkulla

- by Jaroslav Záruba

- by blesh

- by Paul

- by Quandary

- by Telemako Mako

- by blunders

- by Redtopia

- by Graham

- by Frank E

- by Jacob

- by Guy

- by Daniel

- by CJ7

- by Sn3akyP3t3

< Previous Page | 2 3 4 5 6 7 8 9 10 11 12 13 | Next Page >