screen scraper templates for various websites

Posted by intuited on Super User See other posts from Super User or by intuited
Published on 2010-07-21T20:00:20Z Indexed on 2011/02/02 23:27 UTC
Read the original article Hit count: 292

Filed under:
|
|
|

I'm looking specifically for a convenient way to locally archive posts from this and other similar sites. I'd like to separate the question itself from the answers, or maybe crop the question and store it, keeping the page title. Obviously I don't need to store the menu or the various other site interface chrome.

The best way to do this would seem to be to associate an XSLT template with a match on the URL and use that template to pull the various relevant informations and format them.

My two-part question:

  • Is there a tool specifically built for this task? I.E. something that takes a URL and checks it against a map of path-matching expressions to templates, and outputs the result of applying the template to that resource?

    xmlto seems to be most of the way there, and could probably just be called from a script that does the pattern-matching, but something already integrated would be more convenient.

  • Is such a URL_pattern-to-XSLT_template map publicly available somewhere?

Question 2.5: Is it legal to do this with sites like this one that have public licenses on their content?

© Super User or respective owner

Related posts about download

Related posts about templates