Website crawler/spider to get site map

Posted by ack__ on Super User See other posts from Super User or by ack__
Published on 2012-09-03T14:23:27Z Indexed on 2012/09/13 3:42 UTC
Read the original article Hit count: 319

Filed under:
|
|
|
|

I need to retrieve a whole website map, in a format like :

I need it to be linked-based (no file or dir brute-force), like :

parse homepage -> retrieve all links -> explore them -> retrieve links, ...

And I also need the ability to detect if a page is a "template" to not retrieve all of the "child-pages". For example if the following links are found :

I need to get only once the http://example.org/product/viewproduct

I've looked into HTTtracks, wget (with spider-option), but nothing conclusive so far.

The soft/tool should be downloadable, and I prefer if it runs on Linux. It can be written in any language.

Thanks

© Super User or respective owner

Related posts about website

Related posts about wget