recursive wget with hotlinked requisites

Posted by dongle on Stack Overflow See other posts from Stack Overflow or by dongle
Published on 2012-10-17T12:40:16Z Indexed on 2012/10/19 17:01 UTC
Read the original article Hit count: 121

Filed under:

I often use wget to mirror very large websites. Sites that contain hotlinked content (be it images, video, css, js) pose a problem, as I seem unable to specify that I would like wget to grab page requisites that are on other hosts, without having the crawl also follow hyperlinks to other hosts.

For example, let's look at this page https://dl.dropbox.com/u/11471672/wget-all-the-things.html

Let's pretend that this is a large site that I would like to completely mirror, including all page requisites – including those that are hotlinked.

wget -e robots=off -r -l inf -pk 

^^ gets everything but the hotlinked image

wget -e robots=off -r -l inf -pk -H

^^ gets everything, including hotlinked image, but goes wildly out of control, proceeding to download the entire web

wget -e robots=off -r -l inf -pk -H --ignore-tags=a

^^ gets the first page, including both hotlinked and local image, does not follow the hyperlink to the site outside of scope, but obviously also does not follow the hyperlink to the next page of the site.

I know that there are various other tools and methods of accomplishing this (HTTrack and Heritrix allow for the user to make a distinction between hotlinked content on other hosts vs hyperlinks to other hosts) but I'd like to see if this is possible with wget. Ideally this would not be done in post-processing, as I would like the external content, requests, and headers to be included in the WARC file I'm outputting.

© Stack Overflow or respective owner

Related posts about wget