Search Results

Search found 160 results on 7 pages for 'googlebot'.

Page 3/7 | < Previous Page | 1 2 3 4 5 6 7  | Next Page >

  • Do web crawlers/spiders index azure web sites?

    - by Clay Shannon
    For somebody who wants their web site to be as discoverable as possible (and who doesn't?), are Microsoft's Azure web sites (azurewebsites.net) a feasible domain to host sites? I have a site that is both on an azurewebsites.net and hosted under a completely different name by discountasp.net Both of these sites are exactly the same, except for the URL; whenever I update the code, I republish the site to/in both places. So obviosuly, they both have the same H1 and H2 elements. Searching for the value/content in my H1 tag, I find my .com site listed #3 on google and #2 on both Bing and Yahoo; OTOH, my azurewebsites.net site doesn't show up on the first page at all, in any of them. This makes me wonder if azurewebsites.net should only be used for Web API hosting and such-like, not for generic/commercial "public" sites. Are my conclusions valid?

    Read the article

  • HTTP 303 redirection and robots.txt

    - by Ian Dickinson
    On a site I'm working on, we're using the HTTP 303 redirect pattern (see this article for background) to distinguish between information and non-information resources. So: some URL's under /id get redirected to dynamically-created pages under /doc. These dynamic pages are built from a database, and contain links to other /doc/ resources, so in general we don't want them to be crawled. Our robots.txt contains: Disallow: /doc However, we do want the non-redirected pages under /id to get indexed by Google et al: Allow: /id So the question I have, which I can't find an answer to so far, is: if an allowed /id page 303-redirects to a /doc page, will it still be blocked by robots.txt? If yes, we're OK, but otherwise I'm going to disallow all /id resources in the robots file, as having the crawler hammer the db would be worse than losing search indexing for the /id pages.

    Read the article

  • When will my old page stop appearing on Google?

    - by Bane
    I recently bought a new address for my Blogger blog, from yannbane.blogspot.com to www.yannbane.com. However, www.yannbane.com addresses do not appear when they are searched for! Is this natural? How much time will it take for Google to update its index? yannbane.blogspot.com 301's to www.yannbane.com. Both are added to my Webmaster Tools account, but it shows no data for www.yannbane.com (strangely). And, finally, is there something I could do to speed up the process?

    Read the article

  • Foolproof way to ensure Google news pulls the correct image for it's thumbnails?

    - by Anthony
    Google news results have an acompanying thumbnail next to articles that show up in the results. If google's crawler can't find a thumbnail to pull from our site, it uses its next best guess from another site, therefore linking the image to another site but still uses our headline. Example: Headline from Reuters, Image from Livemint: Our pages absolutely have images, they are not massive in file-size or dimensions, yet we are not having them pulled / crawled correctly. We have read up on the suggestions from google, and from others around the web and nothing is panning out. Has anyone had any experience where they can ensure google news will pull a thumbnail of our choosing?

    Read the article

  • How do I make my hosting detect _escaped_fragment_ and fetch the corresponding HTML? [on hold]

    - by Eric
    I have an AJAX site and I'm using shebangs (#!) in my urls with the intention of then providing the correct HTML versions when google bots replace the #! with _escaped_fragment_. How do I go about routing/proxying/redirecting the url with _escaped_fragment_ to the corresponding html pages? I can't find documentation on this part of the process specifically, and my first thought was that I should be using a 301 or 302 redirect, but I was told that wasn't the case, albeit not given any more info.

    Read the article

  • Does Submit to Index on a page with new content update Content Keywords for the site?

    - by Dan Kanze
    Using Google Webmaster Tools I'm trying to update the Content Keywords of my site. I'm confused about the relationship between Submit to Index and Content Keywords Does Fetch as Google -- Submit to Index on a previously existing indexed page containing new content expidite updating the Content Keywords crawled by the real Google bot? Does Submit to Index only submit new URL's so that previously indexed URL's still point to the older cached version until Google crawls specifically for new content on its own? Does Submit to Index have anything to do with Content Keywords or crawling new content being a previously indexed page or never been indexed page?

    Read the article

  • How to disallow indexing but allow crawling?

    - by John Doe
    In the front page of my website, I have some previews to articles (with a small introduction to them) that link to the full articles. I want to disallow the front page to prevent duplicate content. But if I do this (in robots.txt), would it still be crawled? I mean, the full articles would be still reached by the crawler even though I disallowed the only page that links to them? I don't want the webcrawler not to access the page and enter the links in them, but I just don't want it to save the information (that will be repeated in the full articles).

    Read the article

  • Directing crawlers to content in language per language sub-domain

    - by Noam
    I have a site with multilingual website with many pages (40M). The site has UGC, and each translation is actually for the titles. Each sub-domain points to the same content with different titles per language. As far as I understand, each sub-domain should be indexed by search engines, meaning they will actually need to crawl 40M x supported-languages. So I thought it might be best to direct each subdomain crawler, to pages that are fully in that language (titles + UGC). Is there a way to do this? Should search engines understand this on their own?

    Read the article

  • How to find first cached date and time of a website?

    - by John Sanjay
    I found some of my webpage contents has been copied in other website. I need to check the cache date and time of both pages so that i can compare and guess which page is original and which one is duplicate according to Google. Because i know that if Google found two webpages with same content, it consider the first cached page by Google bot as unique. Is there any online tool for checking that? or other ways?

    Read the article

  • Is this Anti-Scraping technique viable with Crawl-Delay?

    - by skibulk
    I want to prevent web scrapers from abusing 1,000,000 on my website. I'd like to do this by returning a "503 Service Unavailable" error code for users that access an abnormal number of pages per minute. I don't want search engine spiders to ever receive the error. My inclination is to set a robots.txt crawl-delay which will ensure spiders access a number of pages per minute under my 503 threshold. Is this an appropriate solution? Do all major search engines support the directive? Could it negatively affect SEO? Are there any other solutions or recommendations?

    Read the article

  • Google SEO - Migrating website from a sub-directory of another website to its own domain name

    - by DarioP
    At the present moment I have a website hosted on example.com/myWebsite, where example.com hosts in its root directory a different website. I have the domain example.net, which redirects to example.com/myWebsite. The point, however, that right now when somebody accesses example.net they are redirected to example.com/myWebsite and consequently to example.com/myWebsite/dirA, example.com/myWebsite/dirB etc.. I am now thinking about upgrading my account so that example.net no longer redirects to example.com - I was however wondering, however, since Google shows results searches in terms of example.com/myWebsite, how would this affect my rankings?

    Read the article

  • Site returning 404 header to google, not sure why

    - by Damon
    A Drupal site that works fine for regular users returns a 404 not found error when I try to use the W3C validator on it; it is also not being indexed by google at all (which is the main issue but I suspect there is a connection). It is a https:// site with .htaccess rule to redirect any http:// request to the https://. I had had it running in google webmaster tools and thought it was fine, but it turns out I had not added the https domain. After adding the https domain it's also returning the header as HTTP/1.1 404 Not Found Date: Mon, 15 Oct 2012 19:37:43 GMT Server: Apache Expires: Sun, 19 Nov 1978 05:00:00 GMT Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0 Robots.txt just has User-agent: * Crawl-delay: 10 # Files Disallow: /cron.php How can I check what the issue is here?

    Read the article

  • When Google gives up recrawling 301 that led to 404?

    - by Easy Life
    I've transferred a domain and made a mistake in the redirects (the URL structure is identical). Even though they went to the new domain, the error caused a 404 when crawled by Google bot. 10 days after I saw and corrected my redirect mistake, and now the site should (hopefully) redirect to proper pages. Q1: The URLs of the 404 pages in the Webmaster Tools all bear the mistake and will never be available at the new site. I marked them as fixed in the tools. Do I need to do something about that, like 301 rewrite them with a condition to fix the error? Q2: Does Google bot attempt to recrawl 301 pages that pointed to a 404?

    Read the article

  • In addition to Google's First Flick Free, should you whitelist search engine bots past a paywall?

    - by tobek
    Our site has subscription-only pages - non-subscribed visitors see a snippet preview. As per Google's FCF requirements, your first 5 hits to a subscriber-only pages with .google. as the referrer, you see the full page. In addition to this, should we whitelist search engine bots so that they can index the full content? I assume this is not required for Google, which can use FCF to index our content, but what about other search engines? Is this considered cloaking? My gut says that whitelisting bots past the paywall is bad practice., but I wanted to confirm - any evidence or references would be amazing.

    Read the article

  • In addition to Google's First Click Free, should you whitelist search engine bots past a paywall?

    - by tobek
    Our site has subscription-only pages - non-subscribed visitors see a snippet preview. As per Google's FCF requirements, your first 5 hits to a subscriber-only pages with .google. as the referrer, you see the full page. In addition to this, should we whitelist search engine bots so that they can index the full content? I assume this is not required for Google, which can use FCF to index our content, but what about other search engines? Is this considered cloaking? My gut says that whitelisting bots past the paywall is bad practice., but I wanted to confirm - any evidence or references would be amazing.

    Read the article

  • Will news ticker using overflow:hidden cause Google to see site as spam?

    - by molipix
    In the hope of tempting Googlebot with fresh content, I've implemented a homepage news ticker which displays the 20 most recent headlines on our site. The implementation I have chosen is a <ul> with each headline being a <li> Initially all the <li> elements have no style but Javascript kicks in on page load and gives all but one of them a display="style:none" attribute. Javascript then displays each of the other 19 headlines in a loop. So far so good. However, in order to prevent a visually unplesant page load where the 20 items display and then immediately collapse, I am using overflow:hidden on the <ul> element. Anyone got a view on what Googlebot is likely to make of this? Does the fact that I'm using overflow:hidden make the content look like spam?

    Read the article

  • apache-memory-hacker-linux

    - by bibhudatta
    When we start the linux system it take only 435mb memory and it is 4GB memory server. When we start the httpd services it take 1000mb and outmatically it take all the memory and the server crase. even we stop the apache just it release 200mb memory. What will be the problem Can any one tell me what these hacker are doing. I see they are goinging some hit to my apache by some but I thing they are doing from this system. Below is the log. Please help me out for this. [root@host ~]# tail -20 /var/log/httpd/dostizone.com-combined.log 180.76.5.143 - - [14/Nov/2011:02:30:16 +0530] "GET /blogs/10248/209403/nfl-panties-since-the-quality-of HTTP/1.1" 403 2298 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 180.76.5.88 - - [14/Nov/2011:02:30:31 +0530] "GET /blogs/815/158725/new-jersey-attorney-search HTTP/1.1" 403 2290 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 220.181.108.186 - - [14/Nov/2011:02:30:32 +0530] "GET / HTTP/1.1" 403 5043 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" crawl-66-249-67-137.googlebot.com - - [14/Nov/2011:02:30:20 +0530] "GET /blogs/805/11279/supra-suprano-high-shoes HTTP/1.1" 200 30642 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" crawl-66-249-68-51.googlebot.com - - [14/Nov/2011:02:30:37 +0530] "GET /blogs/10514/215084/oakland-raiders-sweatpants-tags HTTP/1.1" 403 2297 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 220.181.94.237 - - [14/Nov/2011:02:30:12 +0530] "GET /profile/8509 HTTP/1.1" 200 236894 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" 220.181.94.237 - - [14/Nov/2011:02:30:43 +0530] "GET /mode-switch?return_url=%2Fblogs%2F8529%2F160217%2Fclimate-jordan-6 HTTP/1.1" 302 1 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" crawl-66-249-68-51.googlebot.com - - [14/Nov/2011:02:30:44 +0530] "GET /blogs/390/61573/blackhawk-jerseys-from-the-you HTTP/1.1" 403 2293 "-" "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" 124.115.0.159 - - [14/Nov/2011:02:30:24 +0530] "GET /blogs/693/46081/application/modules/Hecore/externals/scripts/core.js HTTP/1.1" 200 26869 "http://dostizone.com/blogs/693/46081/thomas-sabo-charms-hot-chilli" "Sosospider+(+http://help.soso.com/webspider.htm)" 124.115.0.159 - - [14/Nov/2011:02:30:24 +0530] "GET /blogs/693/46081/application/modules/Activity/externals/scripts/core.js HTTP/1.1" 200 26873 "http://dostizone.com/blogs/693/46081/thomas-sabo-charms-hot-chilli" "Sosospider+(+http://help.soso.com/webspider.htm)" 124.115.0.159 - - [14/Nov/2011:02:30:24 +0530] "GET /blogs/693/46081/application/modules/Hecore/externals/scripts/imagezoom/core.js HTTP/1.1" 200 26899 "http://dostizone.com/blogs/693/46081/thomas-sabo-charms-hot-chilli" "Sosospider+(+http://help.soso.com/webspider.htm)" 180.76.5.153 - - [14/Nov/2011:02:30:50 +0530] "GET /blogs/10252/212268/cleveland-browns-authentic-jerse HTTP/1.1" 403 2298 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" crawl-66-249-68-51.googlebot.com - - [14/Nov/2011:02:30:51 +0530] "GET /blogs/741/46260/chocolate-ugg-women-boots-1873 HTTP/1.1" 403 2293 "-" "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" 124.115.1.7 - - [14/Nov/2011:02:30:40 +0530] "GET /blogs/682/97454/swarovski-jewellry-sale-articles HTTP/1.1" 200 25770 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" crawl-66-249-68-51.googlebot.com - - [14/Nov/2011:02:30:56 +0530] "GET /blogs/779/60941/players-a-to-z-michael-cuddyer HTTP/1.1" 403 2293 "-" "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" crawl-66-249-68-51.googlebot.com - - [14/Nov/2011:02:31:01 +0530] "GET /blogs/469/58551/chicago-bears-news-there-exist HTTP/1.1" 403 2293 "-" "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" 220.181.94.237 - - [14/Nov/2011:02:30:54 +0530] "GET /blogs/8529/160217/climate-jordan-6 HTTP/1.1" 200 30750 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" 180.76.5.59 - - [14/Nov/2011:02:31:05 +0530] "GET /blogs/815/158197/cheap-calgary-flames-jerseys HTTP/1.1" 403 2292 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" crawl-66-249-68-51.googlebot.com - - [14/Nov/2011:02:31:06 +0530] "GET /mode-switch?return_url=%2Fblogs%2F387%2F45679%2Fhandbag-louis-vuitton-judy-mm-m4 HTTP/1.1" 403 2258 "-" "SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" crawl-66-249-67-137.googlebot.com - - [14/Nov/2011:02:31:10 +0530] "GET /public/temporary/c83b731ecc556d7fd1a7732d9ac16ed6.png HTTP/1.1" 404 2305 "-" "Googlebot-Image/1

    Read the article

  • Detecting well behaved / well known bots

    - by Simon_Weaver
    I found this question very interesting : Programmatic Bot Detection I have a very similar question, but I'm not bothered about 'badly behaved bots'. I am tracking (in addition to google analytics) the following per visit : Entry URL Referer UserAgent Adwords (by means of query string) Whether or not the user made a purchase etc. The problem is that to calculate any kind of conversion rate I'm ending up with lots of 'bot' visits that are greatly skewing my results. I'd like to ignore as many as possible bot visits, but I want a solution that I don't need to monitor too closely, and that won't in itself be a performance hog and preferably still work if someone has javascript disabled. Are there good published lists of the top 100 bots or so? I did find a list at http://www.user-agents.org/ but that appears to contain hundreds if not thousands of bots. I don't want to check every referer against thousands of links. Here is the current googlebot UserAgent. How often does it change? Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

    Read the article

  • How to Fix this specific Google "Fetch as Googlebot" error appearing on my Webmaster Tools?

    - by UXdesigner
    Good day, I'm currently finding out why I have lost all of my website's rank in google. I don't even appear in google results by the domain. But other sites do link me and they appear in the google results. I think it's all about leaving my site two months alone and finding out I had 20k in comment spam, which I completely deleted and fixed with filters and adding a new Disqus comment service. Thing is, I added my site to Google Webmaster Tools and I'm finding out several awful things. For example, when I click in Google Fetch As GoogleBot. I receive this error message below in response to my request. And I don't even know what's the real problem and how to fix it. I simply don't get it. This is what appears: Date: Wednesday, July 20, 2011 9:43:35 AM PDT Googlebot Type: Web Download Time (in milliseconds): 55 HTTP/1.1 403 Forbidden Date: Wed, 20 Jul 2011 16:43:36 GMT Server: Apache Vary: Accept-Encoding Content-Encoding: gzip Content-Length: 248 Keep-Alive: timeout=2, max=100 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 403 Forbidden Forbidden You don't have permission to access / on this server. Additionally, a 403 Forbidden error was encountered while trying to use an ErrorDocument to handle the request. Do you guys know anything about this problem ? I need to have Google crawl my site again. I used to have a really nice google result in the past three years. Now, there's nothing. thanks,

    Read the article

  • SEO: does google bot see text in hidden divs

    - by Alexey
    I have login/signup popups on my site which are in hidden div by default. According to http://stackoverflow.com/questions/1547426/google-seo-and-hidden-elements googlebot should NOT see it. But Google Webmaster tool says that keywords "email" and "password" are top keywords over the site. Why it is so? Why google bot sees them? Should I worry about relevancy of top keywords at all?

    Read the article

  • Google bots are severely affecting site performance

    - by Lynn
    I have an aggregate site on a linux server that pulls in feeds from a universe of about 2,000 blogs. It's in Wordpress 3.4.2 and I have a cron job that is staggered to run five times an hour on another server to pull in the stories and then publish them to the front page of this site. This is so I didn't put too much pressure all on one server. However, the Google bots, which visit a few times every hour bring the server to its knees in the morning and evenings when there is an increase in traffic on the site. The bots have something like 30,000 links to follow at this point. How do I throttle the bots to simply grab the new stories off the front page and stop there? EDIT- Details of my server configuration: The way we have this set up is the server that handles all the publishing is an unmanaged instance via AWS. It mounts the NFS server and connects to the RDS to update content, etc. You get to this publishing instance via a plugin that detects the wp-admin link and then redirects you into there. The front end app server also mounts the NFS and requests data from the RDS. It is the only one that has the WP Super Cache on it.... The OS is Ubuntu on the App server and the NFS runs CentOs. The front end is Nginx and the publishing server is Apache.

    Read the article

  • How to return proper 404 for google while providing user friendly content to the user?

    - by Marek
    I am bouncing between posting this here and on Superuser. Please excuse me if you feel this does not belong here. I am observing the behavior described here - Googlebot is requesting random urls on my site, like aecgeqfx.html or sutwjemebk.html. I am sure that I am not linking these urls from anywhere on my site. I suspect this may be google probing how we handle non existent content - to cite from an answer to the linked question: [google is requesting random urls to] see if your site correctly handles non-existent files (by returning a 404 response header) We have a custom page for nonexistent content - a styled page saying "Content not found, if you believe you got here by error, please contact us", with a few internal links, served (naturally) with a 200 OK. The URL is served directly (no redirection to a single url). I am afraid this may discriminate the site at google - they may not interpret the user friendly page as a 404 - not found and may think we are trying to fake something and provide duplicate content. How should I proceed to ensure that google will not think the site is bogus while providing user friendly message to users in case they click on dead links by accident?

    Read the article

  • Why can't GoogleBot forget a 404 File Does Not Exist?

    - by Sam
    Hi folks, For two months now, googlebot is trying to get a file which does not exist anymore. This is just one example out of many: I had renamed the file to a better name and removed the old file. Now, why does Google insist on getting a file which it already has seen does not exist for months? doesn't it just give up and get on being a happy bot? My Error log file is filled with these repeating lines all wanting to get that one file, although they know for already long time that its not there: What to do in these situations? Are there automatic redirection rules that handle via 301 to the home page? [Sat Mar 05 01:55:41 2011] [error] [client 66.249.66.177] File does not exist: /var/www/vhosts/website.org/httpdocs/extraNeus.php [Sat Mar 05 01:58:20 2011] [error] [client 66.249.66.177] File does not exist: /var/www/vhosts/website.org/httpdocs/extraNeus.php [Sat Mar 05 02:03:57 2011] [error] [client 66.249.66.177] File does not exist: /var/www/vhosts/website.org/httpdocs/extraNeus.php on and on ... and on...

    Read the article

< Previous Page | 1 2 3 4 5 6 7  | Next Page >