Screen Scraping Twitter

Posted by BRADINO on Bradino See other posts from Bradino or by BRADINO
Published on Sat, 28 Mar 2009 17:20:22 +0000 Indexed on 2010/03/23 5:22 UTC
Read the original article Hit count: 1089

Filed under:

I got an email today asking for help to scrape Twitter. In particular, to be able to login. So I am going to show everyone, NOT to encourage anyone to violate Twitters terms of use but as an educational blog post about how PHP and cURL can be used to post variables and store cookies.

Again, I am using the cScrape class I wrote, which you can download.

Step 1
First go to twitter.com and look at the source code of the login to get the form field names and the form post location. You will see that the form posts to https://twitter.com/session and the username and password fields are session[username_or_email] and session[password] respectively.

Step 2
Now you are ready to login. So using the fetch function in the Scrape class you create an associative array to contain the form values you want to post. The other thing you will need to do is uncomment the lines for CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR. Cookies will be required to stay logged in and scrape around. The paths to the cookie files need to be writable by your app. Also you will need to uncomment the line about CURLOPT_FOLLOWLOCATION.

$data = array('session[username_or_email]' => "bradino", 'session[password]' => "secret");

$scrape->fetch('https://twitter.com/sessions',$data);

Step 1.5
Oops that didn't work. All I got back was 403 Forbidden: The server understood the request, but is refusing to fulfill it. Ahhh I see another variable called authenticity_token I bet Twitter was looking for that. So let's back up and first hit twitter.com to get the authenticity_token variable, and then make the login post request with that variable included in our array of parameters.

$scrape->fetch('https://twitter.com');

$data = array('session[username_or_email]' => "bradino", 'session[password]' => "secret");

$data['authenticity_token'] = $scrape->fetchBetween('name="authenticity_token" type="hidden" value="','"',$scrape->result);

$scrape->fetch('https://twitter.com/sessions',$data);

echo $scrape->result;

So that's basically it. Now you are logged in and can scrape around and request other pages as you normally would. Sorry it wasn't a longer post. I really do enjoy this kind of stuff so if anyone has a request, hit me up.

Errors?
1) Make sure that you are properly parsing the token variable
2) Make sure that you uncommented the lines about CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR, those options need to be enabled and be sure the path set is writable by your application
3) Make sure that the path to the cookie file is writable and that it is getting data written to it
4) If you get a message about being redirected you need to uncomment the line about CURLOPT_FOLLOWLOCATION, that option needs to be enabled true

Developer IT

Screen Scraping Twitter - Developer IT

Screen Scraping Twitter

php

Screen Scraping

Related posts about php

Magento, NGINX, PHP-FPM, APC, MEMCACHED, 16gb Ram CentOS, Spiking PHP-FPM to 100% CPU

PHP Pear Installation on CentOS

Apache configurations for php "AddType text/html php" or "AddType application/x-httpd-php php .php"

mod_rewrite settings causes server to throw HTTP 500 errors instead of 404

Problems installing Memcache (PECL extension)

Related posts about Screen Scraping

Unity completely broken after upgrade to 12.10?

XNA Screen Manager problem with transitions

Confused about home screen widget size in normal screen and larget screen

Changing enum in a different class for screen

Dual-screen Multimedia Control Suite for Linux (like Screen Monkey)

Categories cloud