urllib2 misbehaving with dynamically loaded content
- by Sheena
Some Code
headers = {}
headers['user-agent'] = 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0'
headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
headers['Accept-Language'] = 'en-gb,en;q=0.5'
#headers['Accept-Encoding'] = 'gzip, deflate'
request = urllib.request.Request(sURL, headers = headers)
try:
response = urllib.request.urlopen(request)
except error.HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: {0}'.format(e.code))
except error.URLError as e:
print('We failed to reach a server.')
print('Reason: {0}'.format(e.reason))
else:
f = open('output/{0}.html'.format(sFileName),'w')
f.write(response.read().decode('utf-8'))
A url
http://groupon.cl/descuentos/santiago-centro
The situation
Here's what I did:
enable javascript in browser
open url above and keep an eye on the console
disable javascript
repeat step 2
use urllib2 to grab the webpage and save it to a file
enable javascript
open the file with browser and observe console
repeat 7 with javascript off
results
In step 2 I saw that a whole lot of the page content was loaded dynamically using ajax. So the HTML that arrived was a sort of skeleton and ajax was used to fill in the gaps. This is fine and not at all surprising
Since the page should be seo friendly it should work fine without js. in step 4 nothing happens in the console and the skeleton page loads pre-populated rendering the ajax unnecessary. This is also completely not confusing
in step 7 the ajax calls are made but fail. this is also ok since the urls they are using are not local, the calls are thus broken. The page looks like the skeleton. This is also great and expected.
in step 8: no ajax calls are made and the skeleton is just a skeleton. I would have thought that this should behave very much like in step 4
question
What I want to do is use urllib2 to grab the html from step 4 but I cant figure out how.
What am I missing and how could I pull this off?
To paraphrase
If I was writing a spider I would want to be able to grab plain ol' HTML (as in that which resulted in step 4). I dont want to execute ajax stuff or any javascript at all. I don't want to populate anything dynamically. I just want HTML.
The seo friendly site wants me to get what I want because that's what seo is all about.
How would one go about getting plain HTML content given the situation I outlined? To do it manually I would turn off js, navigate to the page and copy the html. I want to automate this.
stuff I've tried
I used wireshark to look at packet headers and the GETs sent off from my pc in steps 2 and 4 have the same headers. Reading about SEO stuff makes me think that this is pretty normal otherwise techniques such as hijax wouldn't be used.
Here are the headers my browser sends:
Host: groupon.cl
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Here are the headers my script sends:
Accept-Encoding: identity
Host: groupon.cl
Accept-Language: en-gb,en;q=0.5
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0
The differences are:
my script has Connection = close instead of keep-alive. I can't see how this would cause a problem
my script has Accept-encoding = identity. This might be the cause of the problem. I can't really see why the host would use this field to determine the user-agent though. If I change encoding to match the browser request headers then I have trouble decoding it. I'm working on this now...
watch this space, I'll update the question as new info comes up