Search Results

Search found 1 results on 1 pages for 'itemio'.

Page 1/1 | 1 

  • Storing URLs while Spidering

    - by itemio
    I created a little web spider in python which I'm using to collect URLs. I'm not interested in the content. Right now I'm keeping all the visited URLs in a set in memory, because I don't want my spider to visit URLs twice. Of course that's a very limited way of accomplishing this. So what's the best way to keep track of my visited URLs? Should I use a database? * which one? MySQL, sqlite, postgre? * how should I save the URLs? As a primary key trying to insert every URL before visiting it? Or should I write them to a file? * one file? * multiple files? how should I design the file-structure? I'm sure there are books and a lot of papers on this or similar topics. Can you give me some advice what I should read?

    Read the article

1