Does the google crawler really guess URL patterns and index pages that were never linked against?

Posted by Dominik on Pro Webmasters See other posts from Pro Webmasters or by Dominik
Published on 2012-02-06T19:25:17Z Indexed on 2012/04/04 23:45 UTC
Read the original article Hit count: 263

Filed under:

google

I'm experiencing problems with indexed pages which were (probably) never linked to. Here's the setup:

Data-Server: Application with RESTful interface which provides the data
Website A: Provides the data of (1) at http://website-a.example.com/?id=RESOURCE_ID
Website B: Provides the data of (1) at http://website-b.example.com/?id=OTHER_RESOURCE_ID

So the whole, non-private data is stored on (1) and the websites (2) and (3) can fetch and display this data, which is a representation of the data with additional cross-linking between those.

In fact, the URL /?id=1 of website-a points to the same resource as /?id=1 of website-b. However, the resource id:1 is useless at website-b. Unfortunately, the google index for website-b now contains several links of resources belonging to website-a and vice versa.

I "heard" that the google crawler tries to determine the URL-pattern (which makes sense for deciding which page should go into the index and which not) and furthermore guesses other URLs by trying different values (like "I know that id 1 exists, let's try 2, 3, 4, ...").

Is there any evidence that the google crawler really behaves that way (which I doubt). My guess is that the google crawler submitted a HTML-Form and somehow got links to those unwanted resources.

I found some similar posted questions about that, including "Google webmaster central: indexing and posting false pages" [link removed] however, none of those pages give an evidence.

Developer IT

Does the google crawler really guess URL patterns and index pages that were never linked against? - Developer IT

Does the google crawler really guess URL patterns and index pages that were never linked against?

url

google

Related posts about url

mod_rewrite for clean URL doesn't convert the URL to clean URL (but it's accessible) [on hold]

Tip/Trick: Fix Common SEO Problems Using the URL Rewrite Extension

mod_rewrite one url to another url without changing source url

ASP.NET MVC without Url Rewriting/Pretty Url

Ant get task throws "get doesn't support nested resources element" error

Related posts about google

Removing malware of a particular kind

Trouble installing Matlab

Google chrome is always searching in local google domain instead of Google.com

Google I/O 2010: Google TV Keynote - Introducing Google TV

Google I/O 2010: Google TV Keynote - Android Apps On Google TV

Categories cloud