crawl websites out of java web application without using bin/nutch

Posted by Marcel on Stack Overflow See other posts from Stack Overflow or by Marcel
Published on 2010-05-17T09:12:22Z Indexed on 2010/06/13 15:12 UTC
Read the original article Hit count: 534

Filed under:

i am trying to using nutch (1.1) without bin/nutch from my (java) mojarra 2.0.2 webapp... i am searching at google for examples, but there are no examples how i can realize this :/ ... i get an exception and the job fails :/ (i think of cause something with hadoop)... here is my code:

  public void run() throws Exception {
      final String[] args = new String[] {
            String.format("%s%s%s%s", JSFUtils.getWebAppRoot(), "nutch", File.separator, DIRECTORY_URLS),
            "-dir", String.format("%s%s%s%s", JSFUtils.getWebAppRoot(), "nutch", File.separator, DIRECTORY_CRAWL),
            "-threads", this.preferences.get("threads"),
            "-depth", this.preferences.get("depth"),
            "-topN", this.preferences.get("topN"),
            "-solr", this.preferences.get("solr")
        };
      Crawl.main(args);
  }

and a part of the logging:

10/05/17 10:42:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
10/05/17 10:42:54 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
10/05/17 10:42:54 INFO mapred.FileInputFormat: Total input paths to process : 1
10/05/17 10:42:54 INFO mapred.JobClient: Running job: job_local_0001
10/05/17 10:42:54 INFO mapred.FileInputFormat: Total input paths to process : 1
10/05/17 10:42:55 INFO mapred.MapTask: numReduceTasks: 1
10/05/17 10:42:55 INFO mapred.MapTask: io.sort.mb = 100
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:211)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
        at lan.localhost.process.NutchCrawling.run(NutchCrawling.java:108)
        at lan.localhost.main.Index.indexing(Index.java:71)
        at lan.localhost.bean.FeedingBean.actionStart(FeedingBean.java:25)
        ....

can someone help me or tell me how i can crawling from a java application? i have increased the Xms to 256m and Xmx to 768m, but nothing changed...

best regards marcel

Developer IT

crawl websites out of java web application without using bin/nutch - Developer IT

crawl websites out of java web application without using bin/nutch

webapp

nutch

crawl

Related posts about webapp

Java webapp: adding a content-disposition header to force browsers "save as" behavior

JMX Based Monitoring - Part Three - Web App Server Monitoring

What are the options for simple Ajax calls for a Java webapp?

Why does my Perl CGI script raise an internal server error on Apache?

force recompilation of war file including its Jar dependencies

Related posts about nutch

nutch spell checker | nutch navigation filter

how to configure nutch on windows and netbeans

Problem with running the Nutch command from PHP exec()

Nutch - how to crawl by small patches?

looking for nutch alternative

Categories cloud