Spider a Website and Return URLs Only

Posted by Rob Wilkerson on Stack Overflow See other posts from Stack Overflow or by Rob Wilkerson
Published on 2010-05-10T16:37:18Z Indexed on 2010/05/10 16:54 UTC
Read the original article Hit count: 372

Filed under:

wget

|

spider

|

grep

|

uri

I'm not quite sure how best to define/articulate this, but I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the right magic to make it work:

wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'

The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set?

Thanks.

UPDATE

So I just found out offline that, by default, wget writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need:

wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:'

I'd still be interested in other/better means for doing this kind of thing, if any exist.

© Stack Overflow or respective owner

Related posts about wget

Make wget not download files larger than X size

as seen on Super User - Search for 'Super User'
Okay, I give up. How do I size limit which files are downloaded, like say I don't want any files bigger than 2 MB? >>> More
How to start using Wget?

as seen on Super User - Search for 'Super User'
Please, forgive me for asking this question. Usually I would try to learn thisngs myself first before bothering others, but my situation is urgent - if I don't act now and don't download all my family pictures from this website, it will be closed in about two weeks from now and I will loose all of… >>> More
wget mirroring, subdomains and directories and cookies

as seen on Server Fault - Search for 'Server Fault'
Hi all, I have an account on a web page that is now "full" (ie I have used up all my allocated space) and I would like to make a mirror of that site. wget seems like the thing to use. The problem is that I would only like to mirror the sites the lie within this directory http://user.domain.com/room/2324343/transcript/… >>> More
How can I install things in Linux with *no yum* and *no wget*?

as seen on Super User - Search for 'Super User'
I'm a newbie to Linux (that mainly uses Windows and Mac OS X) needing some advice. I was trying to install git on a Linux machine today, and encountered some problems: Not knowing the version of the installed OS, I've opened the /proc/version file which said: Linux version 2.6.9-42.0.2.ELsmp (bhcompile@ls20-bc1-13… >>> More
Getting wget to dowload only files with specific name patterns

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to use wget to DL some files. I want to DL only files whose name that fit a certain pattern, e.g. ???.txt and not any other *.txt files. Can this be done with wget? I could only find a way to --accept/--reject files based on the extension. Thanks! >>> More

Related posts about spider

Creating a spider using Scrapy, Spider generation error.

as seen on Stack Overflow - Search for 'Stack Overflow'
I just downloaded Scrapy (web crawler) on Windows 32 and have just created a new project folder using the "scrapy-ctl.py startproject dmoz" command in dos. I then proceeded to created the first spider using the command: scrapy-ctl.py genspider myspider myspdier-domain.com but it did not work and… >>> More
Site crawler/spider that tosses results into mysql

as seen on Server Fault - Search for 'Server Fault'
It's been suggested that we use mysql for our site's search as it'd be running on the same server that hosts our web server (nginx) and our db (mysql). Since not all of our pages are created from the database, it's been suggested that we have a crawler that can crawl the site, and toss the page url… >>> More
Getting Started with Python: Attribute Error

as seen on Stack Overflow - Search for 'Stack Overflow'
I am new to python and just downloaded it today. I am using it to work on a web spider, so to test it out and make sure everything was working, I downloaded a sample code. Unfortunately, it does not work and gives me the error: "AttributeError: 'MyShell' object has no attribute 'loaded' " I am… >>> More
How to create a web crawler/spider/robot?

as seen on Stack Overflow - Search for 'Stack Overflow'
Is there a way to make a web robot like websiteoutlook.com does? I need something that searches the internet for URLs only...I don't need links, descriptions, etc. What is the best way to do this without getting too technical? I guess it could even be a cronjob that runs a PHP script grabbing URLs… >>> More
How do I block a user-agent from Apache

as seen on Pro Webmasters - Search for 'Pro Webmasters'
How do I realize a UA string block by regular expression in the config files of my Apache webserver? For example: if I would like to block out all bots from Apache on my debian server, that have the regular expression /\b\w+[Bb]ot\b/ or /Spider/ in their user-agent. Those bots should not be able… >>> More