Search Results

Search found 287 results on 12 pages for 'crawling pasta hellion'.

Page 7/12 | < Previous Page | 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • Google indexed page a day before also reflecting in search but today everything vanish

    - by ganesh
    We had robots.txt which disallow all robots as we were in development. We are live now. We change robots.txt as per our requirement a day before. Submit indexes using Google Webmaster Tools index status. After this we can see proper result in search as well as Google images search was working as expected. Suddenly today all these things vanish from Google Search. Now again I can see old result i.e. under construction message. I checked robots.txt in Google Webmaster Tools, it's ok - no crawling errors. Kindly let me know what exactly happened? How I can inform this issue to Google?

    Read the article

  • Which token from a long User-Agent should I use in robots.txt?

    - by Gaia
    The definition of User-Agent states that several tokens can be included, as deemed necessary by the client. I want to block certain bots via robots.txt and I am confused as to which part of the User-Agent string to use, especially for more obscure bots. For example: Mozilla/5.0 (compatible; uMBot-LN/1.0; mailto: [email protected])" JS-Kit URL Resolver, http://js-kit.com/ Mozilla/5.0 (compatible; SEOkicks-Robot +http://www.seokicks.de/robot.html Do I use the second token? Can tokens contain spaces, or did the SEOkicks folks forget a semicolon after SEOkicks-Robot? I don't actually intend on making my question specific to a couple bots - I want to know the guideline: which part of UA do I place in robots.txt for these exotic bots with UA as long as a haiku? User-agent: uMBot-LN/1.0 Disallow: / PS: Thank you but I do not need to hear that undesirable bots are better blocked with mod_security. I already have commercial mod_sec rules in place.

    Read the article

  • Doubt regarding search engine/plugin(One present on the website itself)

    - by Ravi Gupta
    I am new to web development and trying to study various types of websites as case study. Right now my focus is on how search engines works for an eCommerce website. I know basic functioning for a search engine, i.e. crawl web pages, index them and the display the results using those indexes. But I got little confuse in case of an eCommerce website. Don't you think that it would be better if a search engine instead of crawling the web pages containing products, it should directly crawl the database and index the products stored in the database? And when a user search for any product, it will simply give us the rows of the table which matches the user query? If this is not the case, can someone please explain how the usual method works on eCommerce website?

    Read the article

  • Does Submit to Index on a page with new content update Content Keywords for the site?

    - by Dan Kanze
    Using Google Webmaster Tools I'm trying to update the Content Keywords of my site. I'm confused about the relationship between Submit to Index and Content Keywords Does Fetch as Google -- Submit to Index on a previously existing indexed page containing new content expidite updating the Content Keywords crawled by the real Google bot? Does Submit to Index only submit new URL's so that previously indexed URL's still point to the older cached version until Google crawls specifically for new content on its own? Does Submit to Index have anything to do with Content Keywords or crawling new content being a previously indexed page or never been indexed page?

    Read the article

  • How to allow Google Images search to by pass hotlink protection?

    - by Marco Demaio
    I saw Google Images seems to index my images only if hotlink protection is off. * I use anyway hotlink protection because I don't like the idea of people sucking my bandwidth, i simply this code to protcet my sites from being hotlinked: RewriteEngine on RewriteCond %{HTTP_REFERER} !^$ RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?mydomain\.com/.*$ [NC] RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?mydomain\.com$ [NC] RewriteRule .*\.(jpg|jpeg|png|gif)$ - [F,NC,L] But in order to allow Google Image search to bypass my hotlink protection (I want Google Images search to show my images) would it suffice to add a line like this one: RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?google\.com/.*$ [NC] RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?google\.com$ [NC] Because I'm wondring: is the crawler crawling just from google.com? and what about google.it / google.co.uk, etc.? FYI: on Google official guidelines I did not find info about this. I suppose hotlink protection prevents Google Images to show images in its results because I did some tests and it seems hotlink protection does prevent my images to be shown in Google Images search.

    Read the article

  • SSL Certificate

    - by outdoorcat
    I've received the email below from google about my wordpress site and have no idea how to follow the instructions. Any help out there? Dear Webmaster, The host name of your site, https://www.example.com/, does not match any of the "Subject Names" in your SSL certificate, which were: *.wordpress.com wordpress.com This will cause many web browsers to block users from accessing your site, or to display a security warning message when your site is accessed. To correct this problem, please get a new SSL certificate from a Certificate Authority (CA) with a "Subject Name" or "Subject Alternative DNS Names" that matches your host name. Thanks, The Google Web-Crawling Team

    Read the article

  • Should I prevent search engines indexing tag/category pages?

    - by Macha
    On my site, I currently have no special rules for search engines. It is a blog, statically generated using a Python program. When I search for some of my articles on Google, there is usually a tag or category page included in the results. Sometimes it even ranks ahead of the article itself. Obviously, as these links aren't always going to have the article on them, this aren't the results I want people to click on. So, I'm thinking of setting noindex on these pages. Is there any possible downside to doing so? Is this possible to do via robots.txt, or do I have to add it to all the relevant templates? All I can find for robots.txt are ways to stop the search engine crawling those pages, which isn't what I want - while I don't want them indexed, it's still the only surefire way to find all my blog posts.

    Read the article

  • How to identify the client is a search robot?

    - by Yau Leung
    I have built my entire site using AJAX (indeed it's GWT). I have also implemented AJAX crawling proposed by Google. However, after the implementation, I found that neither Yahoo , Bing, nor Baidu implemented that scheme! I'm wondering if there is a way to identify the web client is a search robot. If they are, they will be shown the HTML snapshot I created. It will be best if I can identify them in APACHE level, then I can just do a mod_rewrite. But it's still ok if I can do that in PHP or GWT.

    Read the article

  • How could I manage Google Adsense to approve my Web App? It keeps denying it

    - by Javierfdr
    Google adsense keeps denying my app from having ads, because of an "insufficient content" issue. I manage a Web Application that allows the users to set Youtube Videos as Alarm Clocks. It includes an in-site Youtube search to retrieve videos from user queries and lists the users alarms. The site has a good traffic (500 users per day), is currently promoted by Google in Google Chrome Webstore, and the ajax requests are crawlable, following Google's guidelines (https://developers.google.com/webmasters/ajax-crawling/). Although I understand there is not much content, beyond the user-generated, I really don't what else should I include in the site. Perhaps adding contact and about pages, and maybe another section would increase the navigation. Google argues I need a "fully launched and functioning site, allowing users to navigate throughout your site with a menu, sitemap, or appropiate links". They also ask for "full sentences or paragraphs" Isn't a Google Adsense solutions for Web Applications? Would all the web-apps have to include useless navigable subpages?

    Read the article

  • Why google isn't updating my site title in search results? [closed]

    - by SharkTheDark
    Possible Duplicate: Google doesn't seem to update the description or title of my homepage I had my domain for few days before I uploaded site to it, and it had one title, and then when I uploaded content it should get new title, but with my misunderstanding of WordPress it had blocked robots.txt and keyword with no-index and no-follow. But I removed that like 7 days ago, and I see in reports that Google bot is crawling over my site, but my site title isn't updating, it still has old domain title when site wasn't there... My robots.txt has now: User-agent: * Allow: / I have clear title tag on every page. How long does it take to update? Do I need to check something else?

    Read the article

  • Where would you start if you were trying to solve this PDF classification problem?

    - by burtonic
    We are crawling and downloading lots of companies' PDFs and trying to pick out the ones that are Annual Reports. Such reports can be downloaded from most companies' investor-relations pages. The PDFs are scanned and the database is populated with, among other things, the: Title Contents (full text) Page count Word count Orientation First line Using this data we are checking for the obvious phrases such as: Annual report Financial statement Quarterly report Interim report Then recording the frequency of these phrases and others. So far we have around 350,000 PDFs to scan and a training set of 4,000 documents that have been manually classified as either a report or not. We are experimenting with a number of different approaches including Bayesian classifiers and weighting the different factors available. We are building the classifier in Ruby. My question is: if you were thinking about this problem, where would you start?

    Read the article

  • Getting a lot of '/_' errors from webmaster tools

    - by Vermino
    I'm using a WordPress site and I thought I got all the kinks out of it. For some reason Webmaster Tools is crawling my website and showing a lot of 404 errors which are from /_ like additional pages that I've never created. I just can't figure out what is creating these for Google crawlers and then displaying a 404. My robots.txt is here. My sitemap (created by the Yoast plugin) is here. I have Yoast and Jetpack plugins installed. What could be causing these links to appear

    Read the article

  • How to identify a PDF classification problem?

    - by burtonic
    We are crawling and downloading lots of companies' PDFs and trying to pick out the ones that are Annual Reports. Such reports can be downloaded from most companies' investor-relations pages. The PDFs are scanned and the database is populated with, among other things, the: Title Contents (full text) Page count Word count Orientation First line Using this data we are checking for the obvious phrases such as: Annual report Financial statement Quarterly report Interim report Then recording the frequency of these phrases and others. So far we have around 350,000 PDFs to scan and a training set of 4,000 documents that have been manually classified as either a report or not. We are experimenting with a number of different approaches including Bayesian classifiers and weighting the different factors available. We are building the classifier in Ruby. My question is: if you were thinking about this problem, where would you start?

    Read the article

  • Another website is mirroring and ranks above my site in search results

    - by Marlboro Goodluck
    There is a site of ill-repute known as thedirty which has completely mirrored my site and now has links appearing on Google at the #1 spot using my content. I checked my log files and noticed that this site has been crawling mine for sometime, and also has 10,000 links from their site to mine. I have blocked user access which is referred from this site and reported them as web spam to Google already. I also disavowed the domain. How are they getting top links in Google (even overtaking mine) for such nefarious tactics? What are the steps to completely eliminating an issue such as this?

    Read the article

  • Duplicate page content and the Google index

    - by Kit Sunde
    I have a static pages with dynamically expanding content that google is indexing. I also have deep links into virtually duplicate pages which will pre-expand the relevant section of content into the relevant section. It seems like Google is ignoring all my specialized pages and not putting them in the index. Even after going through web-masters tools, crawling and submitting them to the index manually. I also use the google API for integrating search on the site, and the deep linked pages won't show up. Is there a good solution for this?

    Read the article

  • Another website is mirroring my site

    - by Marlboro Goodluck
    Question for you all. There is a site of ill repute known as thedirty which has completely mirrored my site and now has links appearing on Google at the #1 spot using my content. I checked my log file and noticed that this site has been crawling mine from sometime, and also has 10k links from their site to mine. I have blocked user access which is referred from this site and reported them as web spam to Google already. I also disavowed the domain. How are they getting top links in Google (even overtaking mine) for such nefarious tactics? What are the steps to completely eliminating an issue such as this?

    Read the article

  • Google indexing pages with #! although we don't have any

    - by Benjamin Gruenbaum
    Our company has developed a Single Page Application using AngularJS and its routing. Google indexed our site decently with JavaScript but it did not index some pages very well so we have developed an HTML only version. We have followed the Ajax Crawling Specification posted here and have a <meta name='fragment' content='!'> tag and canonical urls. We expect http://www.example.com/foo/bar to be fetched from http://www.example.com/?_escaped_fragment_=/foo/bar. However, we have found out that when we rolled the AJAX specification we now have all pages indexed twice, once with the JavaScript version as http://www.example.com/foo/bar and once with the new version as http://www.example.com/#!/foo/bar. This is harmful to us since it's duplicate content and also mis-representing out site. I have tried looking for similar questions here and in the Google product forum but could not come up with anything.

    Read the article

  • Xpath Injection detection Tool

    - by preeti
    Hi, I am working on xpath Injection attack, so looking forward to build a tool to detect xpath Injection Tool in a website.Is web crawling and scanning be used for this? What can be the Logic to detect it? Are there any open source tools to detect it, so that i can develop it in Java by looking at logic used in that code. Thank You.

    Read the article

  • WebCrawling Dynamic Links

    - by Jojo
    Hi Everyone, Anybody has any idea on crawling websites that have dynamic pages/queries? I mean if I click a certain link, it has different values every I try to reload it in a web browser. Now my webcrawler could not download the contents of these pages. Please advise.

    Read the article

  • How to write a crawler?

    - by Jason
    Hi All, I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content. Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it finds, etc,etc. Thanks! -Jason

    Read the article

  • How to retrieve Directories size including all sub-directories?

    - by vikingosegundo
    I have stored images from the net like this Documents/imagecache/domain1/original/path/inURI/foo.png Documents/imagecache/domain2/original/path/inURI/bar.png Documents/imagecache/... Documents/imagecache/... Now I'd like to check the size of imagecache including all it sub-directories. Is there a convenient way of doing it — preferable without crawling through all the data manually?

    Read the article

  • Google search box

    - by user343282
    I am working on a google box, something like this, http://mytwentyfive.com/blog/wp-content/uploads/byme/Google%20Search%20Appliances.jpg I am pointing the crawler to a folder where there are html files. before the crawler was crawling the files and indexing them but right now it finds the pattern or the folder but not following any html files within the folder. I have tried everything I could and know but, can't think of anything else. Can someone help? thanks

    Read the article

  • Investment advice data dump analysis

    - by portoalet
    For my year-end pet project, I'd like to analyze investment advices and their correlation to the stock market performance. The problem is, where do I get the dump of investment advice data (free) ? something like stackoverflow.com data dump will be nice. Or maybe it's easier to do distributed crawling and crawl the public finance webpages for investment advices? Investment advice is buy/sell advice for stocks/forex, issued by institution/investment advisor.

    Read the article

< Previous Page | 3 4 5 6 7 8 9 10 11 12  | Next Page >