architecture python question

Posted by tom smith on Stack Overflow See other posts from Stack Overflow or by tom smith
Published on 2010-04-19T19:46:13Z Indexed on 2010/04/19 19:53 UTC
Read the original article Hit count: 347

hi.

creating a distributed crawling python app. it consists of a master server, and associated client apps that will run on client servers. the purpose of the client app is to run across a targeted site, to extract specific data. the clients need to go "deep" within the site, behind multiple levels of forms, so each client is specifically geared towards a given site.

each client app looks something like

main:

parse initial url

call function level1 (data1)

function level1 (data)
 parse the url, for data1
 use the required xpath to get the dom elements
 call the next function
 call level2 (data)


function level2 (data2)
 parse the url, for data2
 use the required xpath to get the dom elements
 call the next function
 call level3

function level3 (dat3)
 parse the url, for data3
 use the required xpath to get the dom elements
 call the next function
 call level4

function level4 (data)
 parse the url, for data4
 use the required xpath to get the dom elements

 at the final function.. 
 --all the data output, and eventually returned to the server        
 --at this point the data has elements from each function...

my question: given that the number of calls that is made to the child function by the current function varies, i'm trying to figure out the best approach.

 each function essentialy fetches a page of content, and then parses 
 the page using a number of different XPath expressions, combined 
 with different regex expressions depending on the site/page.

 if i run a client on a single box, as a sequential process, it'll 
 take awhile, but the load on the box is rather small. i've thought 
 of attempting to implement the child functions as threads from the 
 current function, but that could be a nightmare, as well as quickly 
 bring the "box" to its knees!

 i've thought of breaking the app up in a manner that would allow 
 the master to essentially pass packets to the client boxes, in a 
 way to allow each client/function to be run directly from the 
 master. this process requires a bit of rewrite, but it has a number 
 of advantages. a bunch of redundancy, and speed. it would detect if 
 a section of the process was crashing and restart from that point. 
 but not sure if it would be any faster...

i'm writing the parsing scripts in python..

so... any thoughts/comments would be appreciated...

i can get into a great deal more detail, but didn't want to bore anyone!!

thanks!

tom

© Stack Overflow or respective owner

Related posts about python

Related posts about architecture