Does the google crawler really guess URL patterns and index pages that were never linked against?

Posted by Dominik on Pro Webmasters See other posts from Pro Webmasters or by Dominik
Published on 2012-02-06T19:25:17Z Indexed on 2012/04/04 23:45 UTC
Read the original article Hit count: 263

Filed under:
|

I'm experiencing problems with indexed pages which were (probably) never linked to. Here's the setup:

  1. Data-Server: Application with RESTful interface which provides the data
  2. Website A: Provides the data of (1) at http://website-a.example.com/?id=RESOURCE_ID
  3. Website B: Provides the data of (1) at http://website-b.example.com/?id=OTHER_RESOURCE_ID

So the whole, non-private data is stored on (1) and the websites (2) and (3) can fetch and display this data, which is a representation of the data with additional cross-linking between those.

In fact, the URL /?id=1 of website-a points to the same resource as /?id=1 of website-b. However, the resource id:1 is useless at website-b. Unfortunately, the google index for website-b now contains several links of resources belonging to website-a and vice versa.

I "heard" that the google crawler tries to determine the URL-pattern (which makes sense for deciding which page should go into the index and which not) and furthermore guesses other URLs by trying different values (like "I know that id 1 exists, let's try 2, 3, 4, ...").

Is there any evidence that the google crawler really behaves that way (which I doubt). My guess is that the google crawler submitted a HTML-Form and somehow got links to those unwanted resources.

I found some similar posted questions about that, including "Google webmaster central: indexing and posting false pages" [link removed] however, none of those pages give an evidence.

© Pro Webmasters or respective owner

Related posts about url

Related posts about google