Download HTML and Images with WGet without first few lines

Posted by St. John Johnson on Stack Overflow See other posts from Stack Overflow or by St. John Johnson
Published on 2010-03-31T15:30:58Z Indexed on 2010/03/31 15:33 UTC
Read the original article Hit count: 684

Filed under:

curl

|

wget

|

html

I'm attempting to use wget with the -p option to download specific documents and the images linked in the HTML.

The problem is, the site that is hosting the HTML has some non-html information preceding the HTML. This is causing wget to not interpret the document as HTML and doesn't search for images.

Is there a way to have wget strip the first X lines and/or force searching for images?

Example URL:

http://www.sec.gov/Archives/edgar/data/13239/000119312510070346/ds4.htm

First Lines of Content:

<DOCUMENT>
<TYPE>S-4
<SEQUENCE>1
<FILENAME>ds4.htm
<DESCRIPTION>FORM S-4
<TEXT>
<HTML><HEAD>
<TITLE>Form S-4</TITLE>

Last Lines of Content:

</BODY></HTML>
</TEXT>
</DOCUMENT>

© Stack Overflow or respective owner

Related posts about curl

iPhone Curl Left and Curl Right transitions

as seen on Stack Overflow - Search for 'Stack Overflow'
I am looking for a way to do a UIViewAnimationTransitionCurlUp or UIViewAnimationTransitionCurlDown transition on the iPhone but instead of top to bottom, do it from the left to right (or top/bottom in landscape mode). I've seen this asked aroud the internet a few times but none sems to get an answer… >>> More
PHP Curl and Curl

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi , I am able to send a get request using PHP Curl . But the same thing when i try from command line in Linux (/usr/bin/curl ) I am unable to do so. Please find below my PHP curl that is working $url = "http://172.20.22.26"; $headers = array("Host: 172.20.22.26", "User-Agent:… >>> More
php, curl , php curl , multipart/form-data , upload picture redirect

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm trying to upload some pictures using php cURL on a classified ad website .I think that I set all the parameters properly but I see that there is a kind of redirect after I post the picture . The issue is that the url where I'm getting redirected gives 404 error instead to return the html that… >>> More
Allow Incoming Responses from Curl On Ubuntu 11.10 - Curl

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I'm trying to get a Curl Response from an outside server, however I noticed I cant neither PING the server in question nor connect to it. I tried disabling the iptables firewall but I had no success. My server is running behind a Cisco Linksys WRTN310N Router with the DD-wrt firmware Installed. In… >>> More
cURL works but PHP cURL fails to internet [migrated]

as seen on Pro Webmasters - Search for 'Pro Webmasters'
Trying to diagnose an issue using PHP to cURL to an Internet location on a RedHat Linux server. cURL is installed and working, and: <?php var_dump(curl_version()); ?> shows all the correct information in the output. The issue is I can use PHP to cURL to localhost on the box itself, but… >>> More

Related posts about wget

Make wget not download files larger than X size

as seen on Super User - Search for 'Super User'
Okay, I give up. How do I size limit which files are downloaded, like say I don't want any files bigger than 2 MB? >>> More
How to start using Wget?

as seen on Super User - Search for 'Super User'
Please, forgive me for asking this question. Usually I would try to learn thisngs myself first before bothering others, but my situation is urgent - if I don't act now and don't download all my family pictures from this website, it will be closed in about two weeks from now and I will loose all of… >>> More
wget mirroring, subdomains and directories and cookies

as seen on Server Fault - Search for 'Server Fault'
Hi all, I have an account on a web page that is now "full" (ie I have used up all my allocated space) and I would like to make a mirror of that site. wget seems like the thing to use. The problem is that I would only like to mirror the sites the lie within this directory http://user.domain.com/room/2324343/transcript/… >>> More
How can I install things in Linux with *no yum* and *no wget*?

as seen on Super User - Search for 'Super User'
I'm a newbie to Linux (that mainly uses Windows and Mac OS X) needing some advice. I was trying to install git on a Linux machine today, and encountered some problems: Not knowing the version of the installed OS, I've opened the /proc/version file which said: Linux version 2.6.9-42.0.2.ELsmp (bhcompile@ls20-bc1-13… >>> More
Getting wget to dowload only files with specific name patterns

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to use wget to DL some files. I want to DL only files whose name that fit a certain pattern, e.g. ???.txt and not any other *.txt files. Can this be done with wget? I could only find a way to --accept/--reject files based on the extension. Thanks! >>> More