How to scrape Google SERP based on copyright year?

Posted by Michael Mao on Stack Overflow See other posts from Stack Overflow or by Michael Mao
Published on 2010-03-08T00:15:21Z Indexed on 2010/03/08 0:17 UTC
Read the original article Hit count: 475

Filed under:

I know there must be ways to do this sort of things. I am not pro in RoR or Python, not even an expert in PHP. So my solution tends to be quite dumb: It uses a FireFox add-on called imarcos to scrape the target urls from Google SERP, and use PHP to store info into the database.

At the very core of my workaround there lies a problem: How to specifically find target urls based on their copyright year? I mean, something like "copyright 1998-2006" in the footer is to be considered a target, but my search results are not 100% accurate.

I used the following url to search :

http://www.google.com.au/#hl=en&q=inurl:.com.au+intext:copyright+1995..2007+--2008+--2009&start=0&cad=b&fp=6a8119b094529f00

It reads : search for pages that have .com.au in URL and a copyright range from 1995 to 2007 exclude the year of 2008 or 2009. Starting position is 0, of course the offset can be changed.

I've already done a dummy list and honestly I am not pleased with the result. That's mostly because I cannot find a way to restrict search terms in the exact order as they are entered into the search url. copyright can appear in anywhere on page and doesn't necessarily before the years, that's the current story.

Is there a more clear way to sort out this? Oh, almost forgot to say the client doesn't wanna spent too much in this - I cannot persuade him simply buy some cool software, unfortunately. I hope there is a way to use clever Google search operators or similar things to go around this issue.

Many thanks in advance!

Developer IT

How to scrape Google SERP based on copyright year? - Developer IT

How to scrape Google SERP based on copyright year?

google

scrape

imacros

Related posts about google

Removing malware of a particular kind

Trouble installing Matlab

Google chrome is always searching in local google domain instead of Google.com

Google I/O 2010: Google TV Keynote - Introducing Google TV

Google I/O 2010: Google TV Keynote - Android Apps On Google TV

Related posts about scrape

How to scrape a _private_ google group?

Scrape HTML tables from a given URL into CSV

perl script to scrape out sentences

Scrape data from HTML pages using Java, output to database

I want to scrape a site using GAE and post the results into a Google Entity

Categories cloud