Website crawler/spider to get site map

Posted by ack__ on Super User See other posts from Super User or by ack__
Published on 2012-09-03T14:23:27Z Indexed on 2012/09/13 3:42 UTC
Read the original article Hit count: 414

Filed under:

website

|

wget

|

web-crawler

|

spider

|

sitemap

I need to retrieve a whole website map, in a format like :

I need it to be linked-based (no file or dir brute-force), like :

parse homepage -> retrieve all links -> explore them -> retrieve links, ...

And I also need the ability to detect if a page is a "template" to not retrieve all of the "child-pages". For example if the following links are found :

I need to get only once the http://example.org/product/viewproduct

I've looked into HTTtracks, wget (with spider-option), but nothing conclusive so far.

The soft/tool should be downloadable, and I prefer if it runs on Linux. It can be written in any language.

Thanks

© Super User or respective owner

Related posts about website

Need help redirecting http://website.com/ to http://www.website.com/

as seen on Stack Overflow - Search for 'Stack Overflow'
What I'm trying to do is to redirect my website visitors who enter "website.com" to "www.website.com". I would do this with a standard redirect, but I don't know how to make a site specific to WWW or non-WWW addresses. I see that Firefox thinks my site is clearly different at the WWW version, because… >>> More
Official BETA release of Developer IT

as seen on Developer IT - Search for 'Developer IT'
We finally did it It’s been a week since our first online publish and our indexer robot is going as well as the website. We already have reach more than 20,000 articles and it’s only the begining. Stay tune on http://www.developerit.com/ >>> More
Get Visitors to Your Website Using Website Building Tips

as seen on Ezine Articles - Search for 'Ezine Articles'
Before building a website there are many things that you need to take in mind. For example, how are you going to design your website, what is it going to cost you, how long will it take you to build, etc. All of these things mentioned are important aspects to consider when designing a website, but… >>> More
Website Optimization - Requirement of Every Website

as seen on Ezine Articles - Search for 'Ezine Articles'
Website optimization is now days a must for each and every website to be on the top at search engines result pages. Also it has become a vital method to drive traffic to the website. >>> More
Website Basics - Planning Your Website

as seen on Ezine Articles - Search for 'Ezine Articles'
A website begins with an idea. You probably have an idea for a site and that is the reason that you are exploring your options more. Or you may be looking to getting into Internet Marketing and need to know what is involved in getting a site up and running. >>> More

Related posts about wget

Make wget not download files larger than X size

as seen on Super User - Search for 'Super User'
Okay, I give up. How do I size limit which files are downloaded, like say I don't want any files bigger than 2 MB? >>> More
How to start using Wget?

as seen on Super User - Search for 'Super User'
Please, forgive me for asking this question. Usually I would try to learn thisngs myself first before bothering others, but my situation is urgent - if I don't act now and don't download all my family pictures from this website, it will be closed in about two weeks from now and I will loose all of… >>> More
wget mirroring, subdomains and directories and cookies

as seen on Server Fault - Search for 'Server Fault'
Hi all, I have an account on a web page that is now "full" (ie I have used up all my allocated space) and I would like to make a mirror of that site. wget seems like the thing to use. The problem is that I would only like to mirror the sites the lie within this directory http://user.domain.com/room/2324343/transcript/… >>> More
How can I install things in Linux with *no yum* and *no wget*?

as seen on Super User - Search for 'Super User'
I'm a newbie to Linux (that mainly uses Windows and Mac OS X) needing some advice. I was trying to install git on a Linux machine today, and encountered some problems: Not knowing the version of the installed OS, I've opened the /proc/version file which said: Linux version 2.6.9-42.0.2.ELsmp (bhcompile@ls20-bc1-13… >>> More
Getting wget to dowload only files with specific name patterns

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to use wget to DL some files. I want to DL only files whose name that fit a certain pattern, e.g. ???.txt and not any other *.txt files. Can this be done with wget? I could only find a way to --accept/--reject files based on the extension. Thanks! >>> More