crawling - Developer IT

https & ajax crawling

- by Christoph Gassauer

We made on our webpage https://www.1point618.com a transition to ssl and now we using nearly entirely ajax to load the content. Therefore all urls of existing pages have changed. We used the 301 redirect as recommended, also we have implemented google's specification that the webpage is still crawl-able. We thought that maybe it would last a month that we have the same ranking in google's search results, but still google's search results are much worse than before these changes. Most of the content (artist profiles) isn't indexed anymore. For example of the submitted sitemap only 3 of around 450 urls are indexed. Before almost all urls were indexed. My question is now: Does google's ajax crawling work together with ssl? (It looks like it would work, cause of the access log file.)

Read the article

Understanding Ajax crawling of search site

- by vacuum

I have a couple of questions about Ajax crawling of site, which is kind of search engine itself. The base article explains the mechanism of making AJAX application crawlable. All this stuff with HTML-snapshots is clear and easy to implement, but I cant understand where will Google bot will get "the crawler finds a pretty AJAX URL"( ie www.example.com/ajax.html#key=value) to work with. First thing, that came on mind - is breadcrumb. In sitemap we can specify pages with breadcrumb on it. so bot will go to these pages and get HTML-snapshots from here. But I'm sure, there are exists other ways to give bot this "pretty AJAX URL". In our case, we have simple search site, where user enters keyword, presses "Find", js execute Ajax request, receives JSON reponce and fill page with results(without any refresh of course). In this case - how to make google bot crawle all the presults in addition to sitemap? Is there some example of solution, described in article above?

Read the article

How to disallow indexing but allow crawling?

- by John Doe

In the front page of my website, I have some previews to articles (with a small introduction to them) that link to the full articles. I want to disallow the front page to prevent duplicate content. But if I do this (in robots.txt), would it still be crawled? I mean, the full articles would be still reached by the crawler even though I disallowed the only page that links to them? I don't want the webcrawler not to access the page and enter the links in them, but I just don't want it to save the information (that will be repeated in the full articles).

Read the article

How to get rid of crawling errors due to the URL Encoded Slashes (%2F) problem in Apache

- by user14198

The Google web crawler has indexed a whole set of URLs with encoded slashes (%2F) for our site. I assume it has picked up the pages from our XML sitemap file. The problem is that the live pages will actually result in a failure because of the Url Encoded Slashes Problem in Apache. Some solutions are mentioned here We are implementing a 301 redirect scheme for all the error pages. This should make the Google bot delete the pages from the crawling errors (no more crashing pages). Does implementing the 301s require the pages to be "live"? In that case we may be forced to implement solution 1 in the article. The problem is that solution 1 will pose a security vulnerability..

Read the article

Crawling not working windows2008

- by axtolf

Hi, We installed a new MOSS 2007 farm on windows 2008 SP2 enviroment. We used SQL2008 too. Configuration is 1 index, 1 FE and 1 server with 2008, all on ESX 4.0. All the Service that need it uses a dedicated user, so search has a dedicated user. Installation went well and we found no problem. We installed an SP1 MOSS from a ISO and after we upgraded WSS and MOSS to SP2. We installed the Italian language pack too and patched it to SP2. We created a new SSP. We created a web application and created a root website under it. The problem is that we can't male crawling work in any way. Seems that crawling is not able to reach the web application that we want to crawl. In event viewer of the index we have this error when we try to crawl it: The start address <h..p://name.domain.it:81 cannot be crawled. Context: Application 'SSP1', Catalog 'Portal_Content' Details: The object was not found. (0x80041201) The log of crawling from the search admin, only says: h..p://name.domani.it:81 The object was not found. (The item was deleted because it was either not found or the crawler was denied access to it.) The domain is fully accessible from everywhere using both farm admin user or the search user that we are using for service to run. Site is fully accessible from the index and seem not have problem. Inside the we application we created a root site collection with a couple of file. The log of the farm simply says.... nothing! When we ask to do a full crawl of the site, it runs for a second and after we have the errors that I wrote above. But the farm's log says nothing. Any suggestion or help is really appreciated since we are losing a lot of time on it and really we do not have any idea of what's wrong about this farm.

Read the article

How to protect/monitor your site from crawling by malicious user

- by deathy

Situation: Site with content protected by username/password (not all controlled since they can be trial/test users) a normal search engine can't get at it because of username/password restrictions a malicious user can still login and pass the session cookie to a "wget -r" or something else. The question would be what is the best solution to monitor such activity and respond to it (considering the site policy is no-crawling/scraping allowed) I can think of some options: Set up some traffic monitoring solution to limit the number of requests for a given user/IP. Related to the first point: Automatically block some user-agents (Evil :)) Set up a hidden link that when accessed logs out the user and disables his account. (Presumably this would not be accessed by a normal user since he wouldn't see it to click it, but a bot will crawl all links.) For point 1. do you know of a good already-implemented solution? Any experiences with it? One problem would be that some false positives might show up for very active but human users. For point 3: do you think this is really evil? Or do you see any possible problems with it? Also accepting other suggestions.

Read the article

Website content crawling

- by klork

We have a Business Listings directory hosted on IIS 6 Windows 2003. Our competitors crawl and steal our content and customers. We have tried IP blocking using honeypot URLs and log parsing without much success. Is anyone aware of a network device or a proxy server that I can run in front of my web server to minimize this issue? All suggestions are highly appreciated.

Read the article

Nutch crawling with seeds urls are in range

- by user365345

Some site have url pattern as www..com/id=1 to www..com/id=1000. How can I crawl the site using nutch. Is there any why to provide seed for fetching in range??

Read the article

Configure HTTP Post data input to Nutch before crawling a site

- by user365345

I have to crawl a site which list item based on user input through http post submission. How to configure post http submission details in Nutch. I got help on how to do HttpPostAuthentication, but I got no help on "how to do post data submit other than username and password".

Read the article

Crawling an ajax based page with both a hash fragment and a meta tag

- by Christofian

According to google's documentation on crawling ajax based web pages, if a url contains a hash fragment, or something at the end of an url that looks like #helloworld, and if there is an ! after the #, as in #!helloworld, google will then request the url url?_escaped_fragment_=helloworld. I currently have an ajax based webpage that I want google to be able to crawl. Sometimes, the page uses hash fragments, and for those situations I set up the server so it will return an html snapshot for that page using _escaped_fragment_. However, that webpage often does not load a hash fragment, and when that happens the webpage still loads content using ajax. I couldn't find a good solution to enable ajax crawling for pages that sometimes have a hash fragment and sometimes don't. How can I tell google to use _escaped_fragment_ when there is a hash fragment, and to use something else to get an html snapshot of a page when there isn't a hash fragment?

Read the article

Stop bots from crawling old links with extensions

- by Jared

I've recently switched to MVC3 which is extension-less for the URL's, but Google and Bing have a wealth of links that they are crawling which no longer exist. So I'm trying to find out if there is a way to format robots.txt (or by some other method) to tell google/bing that any link that ends in an extension isn't a valid link... Is this possible? On pages that I'm concerned about a User having saved as a fav I'm displaying a 404 page that lists the links to take once they are redirected to the new page (I decided to not just redirect them as I don't want to maintain these forever). For Google/Bing sake I do have the canonical tag in the header. User-agent: * Allow: / Disallow: /*.* EDIT: I just added the 3rd line (in text above) and it APPEARS to do what I'm wanting. Allow a path, but disallow a file. Can anyone confirm this?

Read the article

Why do people crawl sites without downloading pictures?

- by Michael

Let me show you what I mean: IP Pages Hits Bandwidth 85.xx.xx.xxx 236 236 735.00 KB 195.xx.xxx.xx 164 164 533.74 KB 95.xxx.xxx.xxx 90 90 293.47 KB It's very clear that these person are crawling my site with bots. There's no way that you could visit my site and use <1MB bandwidth. You might say that there's the possibility that they could be browsing the site using some browser or plug-in that does not download images, js/css files, etc., but the simple fact of the matter is that there are not 90-236 pages that are linked from the home page (outside of WP files), even if you visited every page twice. I could understand if these people were crawling the site for pictures, but once again, the bandwidth indicates that this isn't what is happening. Why, then, would they crawl the site to simply view the HTML/txt/js/etc. files? The only thing that I can come up with is that they are scanning for outdated versions of WordPress, SQL injection vulnerabilities, etc., which makes me inclined to outright ban the IPs, but I'm curious, is it possible that this person is a legitimate user, or at the very least, not intending to be harmful?

Read the article

Bingbot seems to be adding "ForceRecrawl: 0" to URLs when crawling my sites

- by Louis Somers

I'm seeing this in the iis-logs of two websites that I maintain: GET /an/existing/page/on/my/site+ForceRecrawl:+0 - 80 - 207.46.195.105 HTTP/1.1 Mozilla/5.0+(compatible;+bingbot/2.0;++http://www.bing.com/bingbot.htm) I get about one or two of these per day from these IP addresses: 207.46.195.105, 65.52.110.190.. an more, all belonging to msnbot-ip.search.msn.com Probably Microsoft has a bug in their crawler? Any way, doing a search on "ForceRecrawl: 0" in major search engines comes up with a bunch of random sites. Doing the search on StackOverflow or here gave no results (to my amazement). Am I the only one seeing this? I first noticed these on the 9th of this month, and I'm seeing them pass almost daily since... Another thing that I think is crazy, is that the URL http://www.bing.com/bingbot.htm redirects to mail.live.com (hotmail). Currently I'm returning 404's but I'm considering to catch these, strip the trailing " ForceRecrawl: 0" and process as if it were a legitimate url. Could anyone shed some light on this? Could it have to do with some configuration or so in Bing's Webmaster Tools?

Read the article

Voxel Face Crawling (Mesh simplification, possibly using greedy)

- by Tim Winter

This is in regards to a Minecraft-like terrain engine. I store blocks in chunks (16x256x16 blocks in a chunk). When I generate a chunk, I use multiple procedural techniques to set the terrain and to place objects. While generating, I keep one 1D array for the full chunk (solid or not) and a separate 1D array of solid blocks. After generation, I iterate through the solid blocks checking their neighbors so I only generate block faces that don't have solid neighbors. I store which faces to generate in their own list (that's 6 lists, one per possible face). When rendering a chunk, I render all lists in the camera's current chunk and only the lists facing the camera in all other chunks. Using a 2D atlas with this little shader trick Andrew Russell suggested, I want to merge similar faces together completely. That is, if they are in the same list (same normal), are adjacent to each other, have the same light level, etc. My assumption would be to have each of the 6 lists sorted by the axis they rest on, then by the other two axes (the list for the top of a block would be sorted by it's Y value, then X, then Z). With this alone, I could quite easily merge strips of faces, but I'm looking to merge more than just strips together when possible. I've read up on this greedy meshing algorithm, but I am having a lot of trouble understanding it. To even use it, I would think I'd need to perform a type of flood-fill per sorted list to get the groups of merge-able faces. Then, per group, perform the greedy algorithm. It all sounds awfully expensive if I would ever want dynamic terrain/lighting after initial generation. So, my question: To perform merging of faces as described (ignoring whether it's a bad idea for dynamic terrain/lighting), is there perhaps an algorithm that is simpler to implement? I would also quite happily accept an answer that walks me through the greedy algorithm in a much simpler way (a link or explanation). I don't mind a slight performance decrease if it's easier to implement or even if it's only a little better than just doing strips. I worry that most algorithms focus on triangles rather than quads and using a 2D atlas the way I am, I don't know that I could implement something triangle based with my current skills. PS: I already frustum cull per chunk and as described, I also cull faces between solid blocks. I don't occlusion cull yet and may never.

Read the article

GWT: reporting crawling errors for non existing links

- by pixeline

Google Webmaster Tools is reporting crawl errors for links that never existed, and if i check the "Linked from" tab for a given error link, it shows another that never existed. They all mention joomla/ which is not the cms used on this domain (it's wordpress fyi). Exampled: http://example.com/joomla/index.php/component/user/register Linked from: http://example.com/joomla/component/user/login?return=L2###### What is going on? UPDATE 1 I tried something: I provided one of the faulty urls to the "Fetch as Google" functionality. Instead of returning a 404, it returns a 301 to another Joomla page. HTTP/1.1 301 Moved Permanently Server: Apache/2.4.3 X-Powered-By: PHP/5.4.4-10 X-Pingback: http://example.com/xmlrpc.php Expires: Wed, 11 Jan 1984 05:00:00 GMT Cache-Control: no-cache, must-revalidate, max-age=0 Pragma: no-cache Set-Cookie: PHPSESSID=1fgr5v2oip39miibuptd51s8h0; path=/ Set-Cookie: woocommerce_items_in_cart=0; expires=Sat, 12-Jan-2013 11:44:01 GMT; path=/ Location: http://example.com/joomla/component/user/register Content-Type: text/html; charset=iso-8859-1 Content-Length: 387 Date: Sat, 12 Jan 2013 12:44:01 GMT Via: 1.1 varnish Connection: keep-alive Accept-Ranges: bytes Age: 0 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>301 Moved Permanently</title> </head><body> <h1>Moved Permanently</h1> <p>The document has moved <a href="http://example.com/joomla/component/user/register">here</a>.</p> <p>Additionally, a 301 Moved Permanently error was encountered while trying to use an ErrorDocument to handle the request.</p> </body></html>

Read the article

Crawling a Content Folio

- by Kyle Hatlestad

Content Folios in WebCenter Content allow you to assemble, track, and access a logical group of documents and/or links. It allows you to manage them as just a list of items (simple folio) or organized as a hierarchy (advanced folio). The built-in UI in content server allows you to work with these folios, but publishing them or consuming them externally can be a bit of a challenge. [Read More]

Read the article

Crawling a Content Folio

- by Kyle Hatlestad

Content Folios in WebCenter Content allow you to assemble, track, and access a logical group of documents and/or links. It allows you to manage them as just a list of items (simple folio) or organized as a hierarchy (advanced folio). The built-in UI in content server allows you to work with these folios, but publishing them or consuming them externally can be a bit of a challenge. The folios themselves are actually XML files that contain the structure, attributes, and pointers to the content items. So to publish this somewhere, such as a Site Studio page, you could perhaps use an XML parser to traverse the structure and create your output. But XML parsers are not always the easiest or most efficient to use. In order to more easily crawl and consume a Content Folio, Ed Bryant - Principal Sales Consultant, wrote a component to do just that. His component adds a service which does all the work for you and returns the folio structure as a simple resultset. So consuming and publishing that folio on a Site Studio page or in your portal using RIDC is a breeze! For example, let's take an advanced Content Folio example like this: If we look at the native file, the XML looks like this: But if we access the folio using the new service - http://server/cs/idcplg?IdcService=FOLIO_CRAWL&dDocName=ecm008003&IsPageDebug=1 - this is what the result set looks like (using the IsPageDebug parameter). Given this as the result set, it makes it very easy to consume and repurpose that folio. You can download a copy of the sample component here. Special thanks to Ed for letting me share this component!

Read the article

Why google is not crawling my website

- by Aman Virk

I am running a design and development blog http://www.thetutlage.com/ . From last couple of days my search traffic have been reduced from 70% to 10%. I myself is against black hat seo and all it do is write my own unique content almost everyday. Last week my search traffic was really good but now is dropping like heck. I have checked my webmasters dashboard and no message there from google. When i checked server logs i came to know last time google crawled my website was on 27 september 2012. Really i have no idea what i am doing wrong. I follow all google guidelines like bible, please help me

Read the article

indexing and crawling

- by ricky

hello mate my site is dailytopup.com...earlier my site was indexed imediately i post anything but last month my website was crashed due to sever problem and i adont have back up at that time so i recover everything from cached copies but before doing that i remove old urls from the webmaster and then repost again.but after that my website is not indexed properly reaults in no optimsation.everytime i have to use fetch as google but this is not that effective..can you please tell where um lacking or what should i do now?

Read the article

Google crawling the site but refusing to index dynamic content

- by Omeoe

I am trying to get Google to index an AJAX site (davidelifestyle.com). It's crawlable with JavaScript turned off and I have also recently implemented _escaped_content_ snapshot mechanism but all that's indexed is a home page and PDF files that are not directly available from the home page. Also when I use Fetch as Google in Webmaster Tools, it downloads the dynamic page but does not index it ("Submit to Index" just reloads the page). Any ideas what might be wrong?

Read the article

Googlebot can't access my site when crawling from rootdomain

- by PéCé

I can't explain why I get this message for my rootdomain result in Google : trocmalin.com/ A description for this result is not available because of this site's robots.txt – learn more. Here is my site specifics : vide-greniers.trocmalin.com is the site address www.trocmalin.com redirects (301) to vide-greniers.trocmalin.com trocmalin.com redirects (301) to vide-greniers.trocmalin.com too... User-agent: * Disallow: /orga/ User-agent: * Disallow: /sitemap-update Google results for vide-greniers.trocmalin.com are well rendered, as well as sub pages allowed for bots. But the result for my rootdomain (trocmalin.com) gives this message... Can you help me ?

Read the article

Crawling for geotagged data

- by abe3

I have no experience with web crawlers -- but I know that Apache maintains an open source web crawler called "Lucene." How would I go about writing such a crawler to search the web for geo tagged data close to a particular location? What would a general road map look like? How do I pick which slice of the web to crawl? Do I use regular expressions to find things that look like longitudes and latitudes? What does a general sketch of that solution look like?

Read the article

SEO-meta description crawling issue [duplicate]

- by user3707382

This question already has an answer here: Meta Descriptions not working for google search 3 answers i have following code where i m including my title and description for the page But google crawled only title not the meta description from the code. Where as meta description was read from the keywords present in html of the page.. Please guide me guys where i m coding wrongly <!DOCTYPE html> <html> <head> <title>title inserted here</title> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> <meta name="description" content="description here"/>

Read the article

Is there a weight for html link?

- by Questions

I would like to know if there is any weight associated with a html link (its backlink) when google does crawling/indexing. Will 1,2 & 3 be ever considered as a backlink by Google? 1. <a href="xyz.com">1</a> // one character link 2. <a href="xyz.com"> </a> //blank space link 3. <a href="xyz.com">!</a> //special character link 4. <a href="xyz.com">keyword</a> //meaningful word link I hope my question is understandable and guess this is right forum. I don't know to put it in other words. Thanks in advance.

Read the article

404s on password protected content

- by tjb1982

I'm new to WordPress and SEO, generally, but we've been running into problems with our site that don't seem to make sense to me. The problem is that our editor likes to schedule posts and/or mark them private until she is ready to make them public, but somehow Google is crawling these posts and getting 404s (because they are password protected). How does Google know they exist in the first place? I checked the sitemap.xml file and don't see a record of the post. One of the offending posts was marked public, but is scheduled for a future date. Could that have something to do with it? I've tried to Google the answer, and I came up with a good amount of reassurance that this won't hurt the site, but I'm still wondering how it's happening in the first place. It's hard because I don't know exactly what the editor's workflow is. Is it possible she's posting publicly first and then revising it to be private only after it's too late? Does anyone know how Google finds WordPress URLs it shouldn't have access to?

Search Results

Search found 241 results on 10 pages for 'crawling'.

Page 1/10 | 1 2 3 4 5 6 7 8 9 10 | Next Page >

- by Christoph Gassauer

- by vacuum

- by John Doe

- by user14198

- by axtolf

- by deathy

- by klork

- by user365345

- by user365345

- by Christofian

- by Jared

- by Michael

- by Louis Somers

- by Tim Winter

- by pixeline

- by Kyle Hatlestad

- by Kyle Hatlestad

- by Aman Virk

- by ricky

- by Omeoe

- by PéCé

- by abe3

- by user3707382

- by Questions

- by tjb1982

1 2 3 4 5 6 7 8 9 10 | Next Page >