Anyone have a good solution for scraping the HTML source of a page with content (in this case, HTML

Posted by phpwns on Stack Overflow See other posts from Stack Overflow or by phpwns
Published on 2010-05-18T04:25:43Z Indexed on 2010/05/18 4:30 UTC
Read the original article Hit count: 326

Filed under:

php

|

scraper

|

dom

|

java

|

table

Anyone have a good solution for scraping the HTML source of a page with content (in this case, HTML tables) generated with Javascript?

An embarrassingly simple, though workable solution using Crowbar:

<?php
function get_html($url) // $url must be urlencode(d)
{
$context = stream_context_create(array(
    'http' => array('timeout' => 120) // HTTP timeout in seconds
    ));
    $html = substr(file_get_contents('http://127.0.0.1:10000/?url=' . $url . '&delay=3000&view=browser', 0, $context), 730, -32); // substr removes HTML from the Crowbar web service, returning only the $url HTML
return $html;
}
?>

The advantage to using Crowbar is that the tables will be rendered (and accessible) thanks to the headless mozilla-based browser. The problem, of course, is being dependent on on an external web service, especially given that SIMILE seems to undergo regular server maintenance. :(

A pure php solution would be nice, but any functional (and reliable) alternatives would be great.

© Stack Overflow or respective owner

Related posts about php

Magento, NGINX, PHP-FPM, APC, MEMCACHED, 16gb Ram CentOS, Spiking PHP-FPM to 100% CPU

as seen on Server Fault - Search for 'Server Fault'
I have been trying to resolve my issue of spiking cpu caused by php-fpm processes. I've reduced the php-fpm config settings to: pm = ondemand pm.max_children = 12 pm.start_servers = 2 pm.min_spare_servers = 2 pm.max_spare_servers = 10 pm.max_requests = 500 php_admin_value[memory_limit] = 128M Problem… >>> More
PHP Pear Installation on CentOS

as seen on Server Fault - Search for 'Server Fault'
[root@ip ~]# yum install php-pear* Reducing CentOS-5 Testing to included packages only Finished Setting up Install Process Package 1:php-pear-1.8.1-2.el5.centos.noarch already installed and latest versio … >>> More
Apache configurations for php "AddType text/html php" or "AddType application/x-httpd-php php .php"

as seen on Server Fault - Search for 'Server Fault'
I am taking over an application server and discover that it contain the following settings: AddType text/html php Although it works, but my understanding is that it should set as following: AddType application/x-httpd-php php .php What are the key differences between the two settings?… >>> More
mod_rewrite settings causes server to throw HTTP 500 errors instead of 404

as seen on Server Fault - Search for 'Server Fault'
Hello. I have a server with VBulletin forum (working under Apache 2.2, CentOS). The default settings for it in .htaccess are as follows: RewriteEngine on RewriteCond %{HTTP_HOST} ^gsmforum\.ru RewriteRule (.*) http://www.gsmforum.ru/$1 [R=301,L] # If you are having problems or are using VirtualDocumentRoot… >>> More
Problems installing Memcache (PECL extension)

as seen on Server Fault - Search for 'Server Fault'
I have installed memcached fine, and now I will need to install PECL extension memcache. Im running RedHat x86_64 es5. The installation gives me this: downloading memcache-2.2.6.tgz ... Starting to download memcache-2.2.6.tgz (35,957 bytes) ..........done: 35,957 bytes 11 source files, building running:… >>> More

Related posts about scraper

Is selling a "website screen scraper" is illegal?

as seen on Stack Overflow - Search for 'Stack Overflow'
I have coded a "website screen scraper" and want to sell it commercially. I know that webpages scraped by the screen scraper are restricted to be scraped by the webmaser of that website. The robots.txt file of the website says that its webpages must not be scraped. So my question is whether selling… >>> More
Build a PHP Link Scraper with cURL

as seen on Internet.com - Search for 'Internet.com'
Use cURL and PHP to build a robot that scrapes links from web pages and dumps them in a database. Ju ... >>> More
Build a PHP Link Scraper with cURL

as seen on Internet.com - Search for 'Internet.com'
Use cURL and PHP to build a robot that scrapes links from web pages and dumps them in a database. Ju ... >>> More
A good web data extraction/screen scraper program?

as seen on Stack Overflow - Search for 'Stack Overflow'
I need to capture product data from a site on a regular basis and wondered if any one knows of a good software program? I've trialed Mozenda but its a monthly subscription and pricey in the long term. Obviously something thats free would be best but I don't mind paying either. Just need a decent program… >>> More
Scrapy cannot find div on this website [on hold]

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I am very new at this and have been trying to get my head around my first selector can somebody help? i am trying to extract data from page http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false all the info under div class =… >>> More