How to scrape Google SERP based on copyright year?

Posted by Michael Mao on Stack Overflow See other posts from Stack Overflow or by Michael Mao
Published on 2010-03-08T00:15:21Z Indexed on 2010/03/08 0:17 UTC
Read the original article Hit count: 405

Filed under:
|
|

Hi all:

I know there must be ways to do this sort of things. I am not pro in RoR or Python, not even an expert in PHP. So my solution tends to be quite dumb: It uses a FireFox add-on called imarcos to scrape the target urls from Google SERP, and use PHP to store info into the database.

At the very core of my workaround there lies a problem: How to specifically find target urls based on their copyright year? I mean, something like "copyright 1998-2006" in the footer is to be considered a target, but my search results are not 100% accurate.

I used the following url to search :

http://www.google.com.au/#hl=en&q=inurl:.com.au+intext:copyright+1995..2007+--2008+--2009&start=0&cad=b&fp=6a8119b094529f00

It reads : search for pages that have .com.au in URL and a copyright range from 1995 to 2007 exclude the year of 2008 or 2009. Starting position is 0, of course the offset can be changed.

I've already done a dummy list and honestly I am not pleased with the result. That's mostly because I cannot find a way to restrict search terms in the exact order as they are entered into the search url. copyright can appear in anywhere on page and doesn't necessarily before the years, that's the current story.

Is there a more clear way to sort out this? Oh, almost forgot to say the client doesn't wanna spent too much in this - I cannot persuade him simply buy some cool software, unfortunately. I hope there is a way to use clever Google search operators or similar things to go around this issue.

Many thanks in advance!

© Stack Overflow or respective owner

Related posts about google

Related posts about scrape