urllib2 misbehaving with dynamically loaded content

Posted by Sheena on Stack Overflow See other posts from Stack Overflow or by Sheena
Published on 2012-11-27T09:00:37Z Indexed on 2012/11/27 11:04 UTC
Read the original article Hit count: 295

Filed under:

urllib2

Some Code

headers = {}
headers['user-agent'] = 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0'
headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
headers['Accept-Language'] = 'en-gb,en;q=0.5'
#headers['Accept-Encoding'] = 'gzip, deflate'

request = urllib.request.Request(sURL, headers = headers)
try:
    response = urllib.request.urlopen(request)
except error.HTTPError as e:
    print('The server couldn\'t fulfill the request.')
    print('Error code: {0}'.format(e.code))
except error.URLError as e:
    print('We failed to reach a server.')
    print('Reason: {0}'.format(e.reason))
else:
    f = open('output/{0}.html'.format(sFileName),'w')
    f.write(response.read().decode('utf-8'))

A url

http://groupon.cl/descuentos/santiago-centro

The situation

Here's what I did:

enable javascript in browser
open url above and keep an eye on the console
disable javascript
repeat step 2
use urllib2 to grab the webpage and save it to a file
enable javascript
open the file with browser and observe console
repeat 7 with javascript off

results

In step 2 I saw that a whole lot of the page content was loaded dynamically using ajax. So the HTML that arrived was a sort of skeleton and ajax was used to fill in the gaps. This is fine and not at all surprising
Since the page should be seo friendly it should work fine without js. in step 4 nothing happens in the console and the skeleton page loads pre-populated rendering the ajax unnecessary. This is also completely not confusing
in step 7 the ajax calls are made but fail. this is also ok since the urls they are using are not local, the calls are thus broken. The page looks like the skeleton. This is also great and expected.
in step 8: no ajax calls are made and the skeleton is just a skeleton. I would have thought that this should behave very much like in step 4

question

What I want to do is use urllib2 to grab the html from step 4 but I cant figure out how. What am I missing and how could I pull this off?

To paraphrase

If I was writing a spider I would want to be able to grab plain ol' HTML (as in that which resulted in step 4). I dont want to execute ajax stuff or any javascript at all. I don't want to populate anything dynamically. I just want HTML.

The seo friendly site wants me to get what I want because that's what seo is all about.

How would one go about getting plain HTML content given the situation I outlined? To do it manually I would turn off js, navigate to the page and copy the html. I want to automate this.

stuff I've tried

I used wireshark to look at packet headers and the GETs sent off from my pc in steps 2 and 4 have the same headers. Reading about SEO stuff makes me think that this is pretty normal otherwise techniques such as hijax wouldn't be used.

Here are the headers my browser sends:

Host: groupon.cl
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

Here are the headers my script sends:

Accept-Encoding: identity
Host: groupon.cl
Accept-Language: en-gb,en;q=0.5
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0

The differences are:

my script has Connection = close instead of keep-alive. I can't see how this would cause a problem
my script has Accept-encoding = identity. This might be the cause of the problem. I can't really see why the host would use this field to determine the user-agent though. If I change encoding to match the browser request headers then I have trouble decoding it. I'm working on this now...

watch this space, I'll update the question as new info comes up

Developer IT

urllib2 misbehaving with dynamically loaded content - Developer IT

urllib2 misbehaving with dynamically loaded content

python

html

AJAX

python-3.x

urllib2

Related posts about python

unmet dependencies in Ubuntu 12.04

How can I get sikuli-ide to work?

Getting PATH right for python after MacPorts install

call python with system() in R to run a python script emulating the python console

Python - Calling a non python program from python?

Related posts about html

Install usblib package - Ubuntu

Prevent malicious vulnerability scan increasing load on a server

can't install psycopg2 in my env on mac os x lion

Bitnami redmine error SVN

Can the .htaccess file slow down a website to a crawl? If so, are there better ways to solve these problems with different rewrite rules and such?

Categories cloud