bingbot - Developer IT

Bingbot requests from Google IP address

- by JITHIN JOSE

We have some suspicious requests to our server, 74.125.186.46 - - [24/Aug/2014:23:24:11 -0500] "GET <url> HTTP/1.1" 200 16912 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 74.125.187.193 - - [24/Aug/2014:23:24:12 -0500] "GET <url> HTTP/1.1" 200 20119 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" As it shows, user-agent shows it is bingbot. But whois data of IP address(74.125.186.46 and 74.125.187.193) shows it is from google servers. So is it Google,Bing or any other content scrappers?

Read the article

Bingbot seems to be adding "ForceRecrawl: 0" to URLs when crawling my sites

- by Louis Somers

I'm seeing this in the iis-logs of two websites that I maintain: GET /an/existing/page/on/my/site+ForceRecrawl:+0 - 80 - 207.46.195.105 HTTP/1.1 Mozilla/5.0+(compatible;+bingbot/2.0;++http://www.bing.com/bingbot.htm) I get about one or two of these per day from these IP addresses: 207.46.195.105, 65.52.110.190.. an more, all belonging to msnbot-ip.search.msn.com Probably Microsoft has a bug in their crawler? Any way, doing a search on "ForceRecrawl: 0" in major search engines comes up with a bunch of random sites. Doing the search on StackOverflow or here gave no results (to my amazement). Am I the only one seeing this? I first noticed these on the 9th of this month, and I'm seeing them pass almost daily since... Another thing that I think is crazy, is that the URL http://www.bing.com/bingbot.htm redirects to mail.live.com (hotmail). Currently I'm returning 404's but I'm considering to catch these, strip the trailing " ForceRecrawl: 0" and process as if it were a legitimate url. Could anyone shed some light on this? Could it have to do with some configuration or so in Bing's Webmaster Tools?

Read the article

Why deny access to website for msnbot/bingbot?

- by Quandary

I've seen quite a lot of tutorials that recommend you to ban user agents containing the strings libwww-perl and msnbot. I understand why one would ban libwww-perl, it's mainly if not only used for hacking and spamming. But why are there so many sites recommending to ban msnbot/bingbot? Since it's a search engine, even if only with a marginal market share, I would except one would want this bot to crawl one's sites. What is it that msnbot does that makes people ban it?

Read the article

Will bing bot index pages with invalid SSL certificates?

- by Martin

Bingbot and Yahoo slurp do not support SNI(Server Name Indication when using SSL). Ignoring other workarounds (multi domain certificates, non-SSL content etc.), will Bingbot index pages that have an invalid SSL certificate, eg. issued for example.net, but used on example.com? If possible please provide an example from Yahoo or Bing. I have found websites in bing, that use self signed certificates and are indexed correctly, but what about invalid certificates?

Read the article

What bots are really worth letting onto a site?

- by blunders

Having written a number of bots, and seen the massive amounts of random bots that happen to crawl a site, I am wondering if the goal of the site allowing bots is for the potential for the bot to send real traffic back to the site if there is any reason to allow bots that are not known to be sending real traffic back, and how to spot these "good" bots; based on how they ID themselves, IPs they come from, behaviors, etc.

Read the article

Server overhead caused by bots?

- by giuseppe

I have one customer website causing overhead (http://www.modacalcio.it/en/by-kind/football-boots.html). With htop opened, I am trying navigate the website and the much load of the website is done by the ajax link being placed on the left side of the website. The website is hosted by a VPS with 3 proc and 2GB RAM, with enough hard with disk space. The real problem is that this website is new and not visited much. From the http-status module I am seeing that the overhead is caused by bots (Google bots, Bing bots, hrefs checker and so on). So I thought that's probably due to those spiders trying to crawl all those links at once - could this be causing this overhead? I have also put rel="nofollow" in those links, but this doesn't keep the bots away. Is there any way through code or Plesk to disable those links to those bots?

Read the article

Do web crawlers/spiders index azure web sites?

- by Clay Shannon

For somebody who wants their web site to be as discoverable as possible (and who doesn't?), are Microsoft's Azure web sites (azurewebsites.net) a feasible domain to host sites? I have a site that is both on an azurewebsites.net and hosted under a completely different name by discountasp.net Both of these sites are exactly the same, except for the URL; whenever I update the code, I republish the site to/in both places. So obviosuly, they both have the same H1 and H2 elements. Searching for the value/content in my H1 tag, I find my .com site listed #3 on google and #2 on both Bing and Yahoo; OTOH, my azurewebsites.net site doesn't show up on the first page at all, in any of them. This makes me wonder if azurewebsites.net should only be used for Web API hosting and such-like, not for generic/commercial "public" sites. Are my conclusions valid?

Read the article

Blocking 'good' bots in nginx with multiple conditions for certain off-limits URL's where humans can go

- by Glenn Plas

After 2 days of searching/trying/failing I decided to post this here, I haven't found any example of someone doing the same nor what I tried seems to be working OK. I'm trying to send a 403 to bots not respecting the robots.txt file (even after downloading it several times). Specifically Googlebot. It will support the following robots.txt definition. User-agent: * Disallow: /*/*/page/ The intent is to allow Google to browse whatever they can find on the site but return a 403 for the following type of request. Googlebot seems to keep on nesting these links eternally adding paging block after block: my_domain.com:80 - 66.x.67.x - - [25/Apr/2012:11:13:54 +0200] "GET /2011/06/ page/3/?/page/2//page/3//page/2//page/3//page/2//page/2//page/4//page/4//pag e/1/&wpmp_switcher=desktop HTTP/1.1" 403 135 "-" "Mozilla/5.0 (compatible; G ooglebot/2.1; +http://www.google.com/bot.html)" It's a wordpress site btw. I don't want those pages to show up, even though after the robots.txt info got through, they stopped for a while only to begin crawling again later. It just never stops .... I do want real people to see this. As you can see, google get a 403 but when I try this myself in a browser I get a 404 back. I want browsers to pass. root@my_domain:# nginx -V nginx version: nginx/1.2.0 I tried different approaches, using a map and plain old nono if's and they both act the same: (under http section) map $http_user_agent $is_bot { default 0; ~crawl|Googlebot|Slurp|spider|bingbot|tracker|click|parser|spider 1; } (under the server section) location ~ /(\d+)/(\d+)/page/ { if ($is_bot) { return 403; # Please respect the robots.txt file ! } } I recently had to polish up my Apache skills for a client where I did about the same thing like this : # Block real Engines , not respecting robots.txt but allowing correct calls to pass # Google RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ $compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html$$ [NC,OR] # Bing RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ $compatible;\ bingbot/2\.[01];\ \+http://www\.bing\.com/bingbot\.htm$$ [NC,OR] # msnbot RewriteCond %{HTTP_USER_AGENT} ^msnbot-media/1\.[01]\ $\+http://search\.msn\.com/msnbot\.htm$$ [NC,OR] # Slurp RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ $compatible;\ Yahoo!\ Slurp;\ http://help\.yahoo\.com/help/us/ysearch/slurp$$ [NC] # block all page searches, the rest may pass RewriteCond %{REQUEST_URI} ^(/[0-9]{4}/[0-9]{2}/page/) [OR] # or with the wpmp_switcher=mobile parameter set RewriteCond %{QUERY_STRING} wpmp_switcher=mobile # ISSUE 403 / SERVE ERRORDOCUMENT RewriteRule .* - [F,L] # End if match This does a bit more than I asked nginx to do but it's about the same principle, I'm having a hard time figuring this out for nginx. So my question would be, why would nginx serve my browser a 404 ? Why isn't it passing, The regex isn't matching for my UA: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.30 Safari/536.5" There are tons of example to block based on UA alone, and that's easy. It also looks like the matchin location is final, e.g. it's not 'falling' through for regular user, I'm pretty certain that this has some correlation with the 404 I get in the browser. As a cherry on top of things, I also want google to disregard the parameter wpmp_switcher=mobile , wpmp_switcher=desktop is fine but I just don't want the same content being crawled multiple times. Even though I ended up adding wpmp_switcher=mobile via the google webmaster tools pages (requiring me to sign up ....). that also stopped for a while but today they are back spidering the mobile sections. So in short, I need to find a way for nginx to enforce the robots.txt definitions. Can someone shell out a few minutes of their lives and push me in the right direction please ? I really appreciate ANY response that makes me think harder ;-)

Read the article

Is there a way to disallow only crawling in https in robots.txt?

- by David Wilkins

I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so this seems frivolous. Is there a way to specify Disallow: / for https only? According to Wikipedia, each protocol has its own robots.txt And according to Google's Robots.txt Specification, the robots.txt applies to http AND https I don't want to Disallow: / for Bing totally, just over https.

Read the article

Is there a way to disallow crawling of only HTTPS in robots.txt?

- by David Wilkins

I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so this seems frivolous. Is there a way to specify Disallow: / for https only? According to Wikipedia, each protocol has its own robots.txt And according to Google's Robots.txt Specification, the robots.txt applies to http AND https I don't want to Disallow: / for Bing totally, just over https.

Read the article

Can a client dictate whether or not HttpContext is created?

- by Keivan

We are getting a lot of hits from Googlebot and BingBot and it appears that none of these requests have an HttpContext. I originally thought that every http request will get a context which obviously is not the case so I'm trying to understand how does an HttpContext gets constructed, is it part of the negotiation between client and server?

Read the article

How can I avoid a 302 for Fetch as Bot?

- by CookieMonster

I originally posted this on Stackoverflow, but I believe here is a better place to ask. My web application is very similar to notepad.cc which redirects to a randomly generated URL upon access, e.g. http://myapp.com/roTr94h4Gd. (Please note that notepad.cc is not my site.) Probably because of this redirect feature, when I do "fetch as Google" or "fetch as Bingbot", I get a 302 and no html content. Not even a <html></html> tag. HTTP/1.1 302 Moved Temporarily Server: nginx/1.4.1 Date: Tue, 01 Oct 2013 04:37:37 GMT Content-Type: text/html Transfer-Encoding: chunked Connection: keep-alive X-Powered-By: PHP/5.4.17-1~dotdeb.1 Set-Cookie: PHPSESSID=vp99q5e5t5810e3bnnnvi6sfo2; expires=Thu, 03-Oct-2013 04:37:37 GMT; path=/ Expires: Thu, 19 Nov 1981 08:52:00 GMT Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Pragma: no-cache Location: /roTr94h4Gd How should I avoid 302 in this case? I suppose I could modify my site to prevent the redirect, but it is a necessary feature of my web app to generate a random URL on each access. I added <meta name="fragment" content="!"> tag into my index page and set it to return a static snapshot of my page when the flag is set. But this still returns a 302. I also added a header to return 200 before redirecting, but this had no effect, either. Could someone tell me a good suggestion to solve this problem?

Read the article

Is a 302 redirect to a random URL from the homepage an SEO problem?

- by CookieMonster

I originally posted this on Stackoverflow, but I believe here is a better place to ask. My web application is very similar to notepad.cc which redirects to a randomly generated URL upon access, e.g. http://myapp.com/roTr94h4Gd. (Please note that notepad.cc is not my site.) Probably because of this redirect feature, when I do "fetch as Google" or "fetch as Bingbot", I get a 302 and no html content. Not even a <html></html> tag. HTTP/1.1 302 Moved Temporarily Server: nginx/1.4.1 Date: Tue, 01 Oct 2013 04:37:37 GMT Content-Type: text/html Transfer-Encoding: chunked Connection: keep-alive X-Powered-By: PHP/5.4.17-1~dotdeb.1 Set-Cookie: PHPSESSID=vp99q5e5t5810e3bnnnvi6sfo2; expires=Thu, 03-Oct-2013 04:37:37 GMT; path=/ Expires: Thu, 19 Nov 1981 08:52:00 GMT Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Pragma: no-cache Location: /roTr94h4Gd How should I avoid 302 in this case? I suppose I could modify my site to prevent the redirect, but it is a necessary feature of my web app to generate a random URL on each access. I added <meta name="fragment" content="!"> tag into my index page and set it to return a static snapshot of my page when the flag is set. But this still returns a 302. I also added a header to return 200 before redirecting, but this had no effect, either. Could someone tell me a good suggestion to solve this problem?

Read the article

Why am I experiencing random connection timeouts? (CentOS)

- by Ryan

I have a CentOS server setup that currently hosts several websites (all relative of each other in some form or another). As of recently throughout the day at the most random times the website speed will lag to a crawl and eventually hit a connection timeout. When I say random times this typically happens anywhere between 10am and 1pm usually, however, this morning this happened to me at 8am. I do not have a lot of familiarity with server knowledge as far as what I am looking for in this situation. What are some possible causes of why my server is slowing the websites down to a complete crawl or timing out? Are there specific things I should be checking for when this happens? I have noticed using: tail /var/log/httpd/access_log That usually when this down time occurs there are lot of IP addresses related to BingBot, Googlebot, and sometimes various bots or spiders that I am unfamiliar with. Could this be related and if so how can I avoid this from causing my websites to lag out? Thanks in advance for any help or advice. The websites that are timing out are built with PHP and use a MySQL database to display information.

Search Results

Search found 14 results on 1 pages for 'bingbot'.

Page 1/1 | 1

- by JITHIN JOSE

- by Louis Somers

- by Quandary

- by Martin

- by blunders

- by giuseppe

- by Clay Shannon

- by Glenn Plas

- by David Wilkins

- by David Wilkins

- by Keivan

- by CookieMonster

- by CookieMonster

- by Ryan