Downloading a web page and all of its resource files in Python

Posted by Mark on Stack Overflow See other posts from Stack Overflow or by Mark
Published on 2009-05-09T21:28:26Z Indexed on 2010/05/14 21:24 UTC
Read the original article Hit count: 180

Filed under:

python

|

urllib2

|

wget

I want to be able to download a page and all of its associated resources (images, style sheets, script files, etc) using Python. I am (somewhat) familiar with urllib2 and know how to download individual urls, but before I go and start hacking at BeautifulSoup + urllib2 I wanted to be sure that there wasn't already a Python equivalent to "wget --page-requisites http://www.google.com".

Specifically I am interested in gathering statistical information about how long it takes to download an entire web page, including all resources.

Thanks Mark

© Stack Overflow or respective owner

Related posts about python

unmet dependencies in Ubuntu 12.04

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I tried today to install a dvb-card on my Ubuntu 12.04 (Linux blauhai-linux 3.2.0-25-generic #40-Ubuntu SMP Wed May 23 20:30:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux ). The installation failed with an error. After that, i tried to install python (it was already installed but i got this error): linux:~$… >>> More
How can I get sikuli-ide to work?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I installed sikuli-ide with sudo apt-get install sikuli-ide Everything was fine until I tried to start it from the terminal. I typed sikuli-ide But the only response I got was [info] locale: en_US The application was not started, furthermore there is no desktop file and sikuli-ide does not… >>> More
Getting PATH right for python after MacPorts install

as seen on Super User - Search for 'Super User'
I can't import some python libraries (PIL, psycopg2) that I just installed with MacPorts. I looked through these forums, and tried to adjust my PATH variable in $HOME/.bash_profile in order to fix this but it did not work. I added the location of PIL and psycopg2 to PATH. I know that Terminal is… >>> More
call python with system() in R to run a python script emulating the python console

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to pass a chunk of Python code to Python in R with something like system('python ...'), and I'm wondering if there is an easy way to emulate the python console in this case. For example, suppose the code is "print 'hello world'", how can I get the output like this in R? >>> print… >>> More
Python - Calling a non python program from python?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am currently struggling to call a non python program from a python script. I have a ~1000 files that when passed through this C++ program will generate ~1000 outputs. Each output file must have a distinct name. The command I wish to run is of the form: program_name -input -output -o1 -o2… >>> More

Related posts about urllib2

Python urllib2 Basic Auth Problem

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm having a problem sending basic AUTH over urllib2. I took a look at this article, and followed the example. My code: passman = urllib2.HTTPPasswordMgrWithDefaultRealm() passman.add_password(None, "api.foursquare.com", username, password) urllib2.install_opener(urllib2.build_opener(urllib2.HTTPBasicAuthHandler(passman))) req… >>> More
Facebook publish HTTP Error 400 : bad request

as seen on Stack Overflow - Search for 'Stack Overflow'
Hey I am trying to publish a score to Facebook through python's urllib2 library. import urllib2,urllib url = "https://graph.facebook.com/USER_ID/scores" data = {} data['score']=SCORE data['access_token']='APP_ACCESS_TOKEN' data_encode = urllib.urlencode(data) request = urllib2.Request(url, data_encode) response… >>> More
Using paired certificates with urllib2

as seen on Stack Overflow - Search for 'Stack Overflow'
I need to create a secure channel between my server and a remote web service. I'll be using HTTPS with a client certificate. I'll also need to validate the certificate presented by the remote service. How can I use my own client certificate with urllib2? What will I need to do in my code to ensure… >>> More
Passing input hidden params through urllib2 POST request

as seen on Stack Overflow - Search for 'Stack Overflow'
I need to make POST request to CAS SSO server login page, and CAS login page has few input hidden params which are dynamically populated through java. I don't know how to read these hidden param values from response and pass in to CAS server. Without passing these hidden params I am not able to login… >>> More
Using urllib2 with SOCKS proxy

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello. Is it possible to fetch pages with urllib2 through a SOCKS proxy on a one socks server per opener basic? I've seen the solution using setdefaultproxy method, but I need to have different socks in different openers. >>> More