Nutch - how to crawl by small patches?

Posted by Yurish on Stack Overflow See other posts from Stack Overflow or by Yurish
Published on 2010-03-29T12:40:01Z Indexed on 2010/03/29 12:43 UTC
Read the original article Hit count: 406

Filed under:
|
|

Hi everyone!

I am stuck! Can`t get Nutch to crawl for me by small patches. I start it by bin/nutch crawl command with parameters -depth 7 and -topN 10000. And it never ends. Ends only when my HDD is empty. What i need to do:

  1. Start to crawl my seeds with possibility to go further on outlinks.
  2. Crawl 20000 pages, then index them.
  3. Crawl another 20000 pages, index them and merge with first index.
  4. Loop step 3 n times.

Tried also with scripts found in wiki, but all scripts i found don't go further. If i run them again, they do everything from beginning. And in the end of script i have the same index i had, when started to crawl. But, i need to continue my crawl.

Some help would be very usefull!

© Stack Overflow or respective owner

Related posts about nutch

Related posts about lucene