Does the google crawler really guess URL patterns and index pages that were never linked against?
Posted
by
Dominik
on Pro Webmasters
See other posts from Pro Webmasters
or by Dominik
Published on 2012-02-06T19:25:17Z
Indexed on
2012/04/04
23:45 UTC
Read the original article
Hit count: 263
I'm experiencing problems with indexed pages which were (probably) never linked to. Here's the setup:
- Data-Server: Application with RESTful interface which provides the data
- Website A: Provides the data of (1) at http://website-a.example.com/?id=RESOURCE_ID
- Website B: Provides the data of (1) at http://website-b.example.com/?id=OTHER_RESOURCE_ID
So the whole, non-private data is stored on (1) and the websites (2) and (3) can fetch and display this data, which is a representation of the data with additional cross-linking between those.
In fact, the URL /?id=1 of website-a points to the same resource as /?id=1 of website-b. However, the resource id:1 is useless at website-b. Unfortunately, the google index for website-b now contains several links of resources belonging to website-a and vice versa.
I "heard" that the google crawler tries to determine the URL-pattern (which makes sense for deciding which page should go into the index and which not) and furthermore guesses other URLs by trying different values (like "I know that id 1 exists, let's try 2, 3, 4, ...").
Is there any evidence that the google crawler really behaves that way (which I doubt). My guess is that the google crawler submitted a HTML-Form and somehow got links to those unwanted resources.
I found some similar posted questions about that, including "Google webmaster central: indexing and posting false pages" [link removed] however, none of those pages give an evidence.
© Pro Webmasters or respective owner