Crawling engine architecture - Java/ Perl integration

Posted by Bigtwinz on Stack Overflow See other posts from Stack Overflow or by Bigtwinz
Published on 2009-12-22T06:55:55Z Indexed on 2010/03/13 17:45 UTC
Read the original article Hit count: 651

Filed under:

I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business instructions and goals. As you can imagine, not an optimal solution.

There are 3 consistent themes with this system:

the retrieval of data has a "conceptual structure" for lack of a better phrase i.e. the retrieval of information follows a particular path
we are only looking for very specific information so we dont have to really worry about extensive crawling for awhile (think thousands-tens of thousands of pages vs millions)
crawls are url-based instead of site-based.

As I enhance this alpha version to a more production-level beta I am looking to add automation and management of the retrieval of data. Additionally our other systems are Java (which I'm more proficient in) and I'd like to compartmentalize the perl aspects so we dont have to lean heavily on outside help.

I've evaluated the usual suspects Nutch, Droid etc but the time spent on modifying those frameworks to suit our specific information retrieval cant be justified.

So I'd like your thoughts regarding the following architecture.

I want to create a solution which

use Java as the interface for managing and execution of the perl scripts
use Java for configuration and data access
stick with perl for retrieval

An example use case would be

a data analyst delivers us a requirement for crawling
perl developer creates the required script and uses this webapp to submit the script (which gets saved to the filesystem)
the script gets kicked off from the webapp with specific parameters ....

Webapp should be able to create multiple threads of the perl script to initiate multiple crawlers.

So questions are

what do you think
how solid is integration between Java and Perl specifically from calling perl from java
has someone used such a system which actually is part perl repository

The goal really is to not have a whole bunch of unorganized perl scripts and put some management and organization on our information retrieval. Also, I know I can use perl do do the web part of what we want - but as I mentioned before - trying to keep perl focused. But it seems assbackwards I'm not adverse to making it an all perl solution.

Open to any all suggestions and opinions.

Thanks

Developer IT

Crawling engine architecture - Java/ Perl integration - Developer IT

Crawling engine architecture - Java/ Perl integration

java

perl

webcrawling

nutch

hadoop

Related posts about java

Tomcat 6: Access Control Exception?

Problem in creation MDB Queue connection at Jboss StartUp

failing to establish connection between Postgres db and gwt

failing to establish connection between postgre db and gwt

Migration and deployement problems JBoss 4.2.2.GA to JBoss 6.0.0.M2

Related posts about perl

Munin on Centos 6 - missing perl MODULE_COMPAT_5.8.8

Pain removing a perl rootkit

How To Avoid a Perl script calling an Another Perl Script

Perl :how to sort dates in perl

please suggest a perl book exclusively for perl programs

Categories cloud