nutch - Developer IT

Problem with running the Nutch command from PHP exec()

- by Annibigi

My Nutch directory lies in /home/myserv/nutch/nutch-1.0/ My php applictaion is in the diretcory /home/myserv/www/ Theres a a php file in my /home/myserv/www/ diretcory that runs a exec command to run a nutch command.PHP code is like : $output = exec("bin/nutch all"); When I run the command from the command line I need to be in the "/home/myserv/nutch/nutch-1.0/" directory When i'm trying to run it through the php exec() ,I just can seems to make it execute. I have tried giving the ful path like (below) but nothing works :( $output = exec("/home/myserv/nutch/nutch-1.0/bin/nutch all"); Desperately looking for help

Read the article

nutch spell checker | nutch navigation filter

- by Sascha

Hello, i am try to configure the nutch 1.0 search engine. First i need to integrate a spell checker or somthing like this, is there a plugin available? My next question is, how to rule out html tag like "", so that navigation is not a part of the index? thanks for all answers

Read the article

posting nutch data into a BASIC auth secured Solr instance

- by mlathe

Hi. I've secured a solr instance using BASIC auth, kind of how it is shown here: http://blog.comtaste.com/2009/02/securing_your_solr_server_on_t.html Now i'm trying to update my batch processes to push data into the authenticated instance. The ones using "curl" are easy, but i also have a Nutch crawl that uses the "solrindex" command to push data into Solr. When i do that i get this error: 2010-02-22 12:09:28,226 INFO auth.AuthChallengeProcessor - basic authentication scheme selected 2010-02-22 12:09:28,229 INFO httpclient.HttpMethodDirector - No credentials available for BASIC 'Tomcat Manager Application'@ninja:5500 2010-02-22 12:09:28,236 WARN mapred.LocalJobRunner - job_local_0001 org.apache.solr.common.SolrException: Unauthorized Unauthorized request: http://ninja:5500/solr/foo/update?wt=javabin&version=2.2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:343) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:183) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:217) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:48) at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:69) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:447) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) 2010-02-22 12:09:29,134 FATAL solr.SolrIndexer - SolrIndexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.indexer.solr.SolrIndexer.indexSolr(SolrIndexer.java:73) at org.apache.nutch.indexer.solr.SolrIndexer.run(SolrIndexer.java:95) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.solr.SolrIndexer.main(SolrIndexer.java:104) Apparently nutch uses SolrJ to push the content, and after going through the solrj code, it's clear that it uses commons-httpclient without providing a way to set the credentials. Here are my question(s) Is this possible to do? ie push from nutch into a BASIC auth secured Solr instance? Is it possible to tell commons-httpclient about a credential without explicitly doing an _httpclient.getState().setCredentials(...)? Anyother ideas? One idea i had was to use an IPfiltering Valve for just the "update" Solr webservices. That would mean you could only make an update call from certain nodes. Thanks

Read the article

crawl websites out of java web application without using bin/nutch

- by Marcel

hi :) i am trying to using nutch (1.1) without bin/nutch from my (java) mojarra 2.0.2 webapp... i am searching at google for examples, but there are no examples how i can realize this :/ ... i get an exception and the job fails :/ (i think of cause something with hadoop)... here is my code: public void run() throws Exception { final String[] args = new String[] { String.format("%s%s%s%s", JSFUtils.getWebAppRoot(), "nutch", File.separator, DIRECTORY_URLS), "-dir", String.format("%s%s%s%s", JSFUtils.getWebAppRoot(), "nutch", File.separator, DIRECTORY_CRAWL), "-threads", this.preferences.get("threads"), "-depth", this.preferences.get("depth"), "-topN", this.preferences.get("topN"), "-solr", this.preferences.get("solr") }; Crawl.main(args); } and a part of the logging: 10/05/17 10:42:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 10/05/17 10:42:54 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/05/17 10:42:54 INFO mapred.FileInputFormat: Total input paths to process : 1 10/05/17 10:42:54 INFO mapred.JobClient: Running job: job_local_0001 10/05/17 10:42:54 INFO mapred.FileInputFormat: Total input paths to process : 1 10/05/17 10:42:55 INFO mapred.MapTask: numReduceTasks: 1 10/05/17 10:42:55 INFO mapred.MapTask: io.sort.mb = 100 java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.crawl.Injector.inject(Injector.java:211) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) at lan.localhost.process.NutchCrawling.run(NutchCrawling.java:108) at lan.localhost.main.Index.indexing(Index.java:71) at lan.localhost.bean.FeedingBean.actionStart(FeedingBean.java:25) .... can someone help me or tell me how i can crawling from a java application? i have increased the Xms to 256m and Xmx to 768m, but nothing changed... best regards marcel

Read the article

how to configure nutch on windows and netbeans

- by radi

i have nutch-1.0 and i import it to a web project on netbeans but when i run the project i get a 404 error on the web browser so i want to know : what the wrong ? what are the configurations should i do? note : i want to run nutch for first time thanks

Read the article

looking for nutch alternative

- by radi

i am looking for a open source full featured web search engine like nutch , because nutch is complex and it take much time to penetrate its code and i didnt find a book about it .

Read the article

Nutch search always returns 0 results

- by darbour

I have set up nutch 1.0 on a cluster. It has been setup and has successfully crawled, I copied the crawl directory using the dfs -copyToLocal and set the value of searcher.dir in the nutch-site.xml file located in the tomcat directory to point to that directory. Still when I try to search I receive 0 results. Any help would be greatly appreciated.

Read the article

Nutch - how to crawl by small patches?

- by Yurish

Hi everyone! I am stuck! Can`t get Nutch to crawl for me by small patches. I start it by bin/nutch crawl command with parameters -depth 7 and -topN 10000. And it never ends. Ends only when my HDD is empty. What i need to do: Start to crawl my seeds with possibility to go further on outlinks. Crawl 20000 pages, then index them. Crawl another 20000 pages, index them and merge with first index. Loop step 3 n times. Tried also with scripts found in wiki, but all scripts i found don't go further. If i run them again, they do everything from beginning. And in the end of script i have the same index i had, when started to crawl. But, i need to continue my crawl. Some help would be very usefull!

Read the article

Nutch crawling with seeds urls are in range

- by user365345

Some site have url pattern as www..com/id=1 to www..com/id=1000. How can I crawl the site using nutch. Is there any why to provide seed for fetching in range??

Read the article

Configure HTTP Post data input to Nutch before crawling a site

- by user365345

I have to crawl a site which list item based on user input through http post submission. How to configure post http submission details in Nutch. I got help on how to do HttpPostAuthentication, but I got no help on "how to do post data submit other than username and password".

Read the article

Do not filter outlinks in Nutch?

- by sigpwned

I'm currently trying to perform a deep crawl within a small list of sites. To accomplish this, I updated conf/domain-urlfilter.txt with the domains of the sites I wish to scrape, which worked nicely. However, I found that not only were the links crawled at every step filtered, but the outlinks captured from each page crawled were filtered as well. Is there a way to avoid filtering captured outlinks while still filtering crawled URLs?

Read the article

What is the best way to freshen a Nutch index?

- by Miles

I haven't looked at Nutch for a year or so and it looks like it has changed significantly. The documentation on re-crawling isn't clear. What is the best way to update an existing Nutch index?

Read the article

which Distribution of Linux is best suited for Nutch-Hadoop?

- by vipin k.

Hi experts, we are Trying to figure out which Distribution of Linux be best suited for the Nutch-Hadoop Integration?. we are planning to Use Clusters for Crawling large contents through Nutch. Let me Know if You need more clarification on this question?. Thanks you.

Read the article

Does Nutch automatically crawl my site when new pages are added?

- by murali

Does Nutch crawl automatically when I add new pages to the website?

Read the article

How to develop Nutch for better Arabic searching technology?

- by user330670

Hello, I am a Computer Science student and working on a project based on Nutch search engine, I want to develop the Java algorithms to better index and search Arabic websites?. Which part of the programming code should I optimize for this purpose, any idea? thanks in advance Moudy

Read the article

Where is the code in Nutch that parses HTML?

- by Ysttk

Where is the code in Nutch that parses HTML? I want to add code to parse Javascript.

Read the article

Is Nutch's language identification available in c#.net

- by Pranali Desai

Is Nutch's language identification available in c#.net and if yes where can I find it.

Read the article

Can I run a site search like Lucene on a single 2 gig server that's also a web & mysql server

- by ian.evans

My site's pages have exceeded the limit of pages for Google Custom Search so many of the results are not found in our site search. I've been reading about Lucene, Nutch, Solr, etc and I'm wondering if I'd have the requirements for running those on a single server that also runs the site (on nginx) and our mysql server. We hae 2 gigs of RAM. I'd appreciate any suggestions for migrating to a new site search.

Read the article

Can't search in a certain field using solR

- by intrance

Hi, I'm setting up an environment using Nutch 1.0 + solR 1.4. In Nutch I configured the subcollection plugin which seems to work nicely. If I search as normal adding fl=* I can see the subcollection field is filled as intented. (something like <str name="subcollection">mysite.com</str>). My problem is, I would like to be able to search only in one or more given subcollections, but whenever my searchquery is something like q=subcollection:mysite.com it won't work. I've also tried to add a fl=* or searched in mysite* instead but I never get any results. Obviously solR "knows" the subcollection field, as it doesn't result with an error but simply whith an empty result. I'd be glad for any help

Read the article

Crawling engine architecture - Java/ Perl integration

- by Bigtwinz

Hi all, I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business instructions and goals. As you can imagine, not an optimal solution. There are 3 consistent themes with this system: the retrieval of data has a "conceptual structure" for lack of a better phrase i.e. the retrieval of information follows a particular path we are only looking for very specific information so we dont have to really worry about extensive crawling for awhile (think thousands-tens of thousands of pages vs millions) crawls are url-based instead of site-based. As I enhance this alpha version to a more production-level beta I am looking to add automation and management of the retrieval of data. Additionally our other systems are Java (which I'm more proficient in) and I'd like to compartmentalize the perl aspects so we dont have to lean heavily on outside help. I've evaluated the usual suspects Nutch, Droid etc but the time spent on modifying those frameworks to suit our specific information retrieval cant be justified. So I'd like your thoughts regarding the following architecture. I want to create a solution which use Java as the interface for managing and execution of the perl scripts use Java for configuration and data access stick with perl for retrieval An example use case would be a data analyst delivers us a requirement for crawling perl developer creates the required script and uses this webapp to submit the script (which gets saved to the filesystem) the script gets kicked off from the webapp with specific parameters .... Webapp should be able to create multiple threads of the perl script to initiate multiple crawlers. So questions are what do you think how solid is integration between Java and Perl specifically from calling perl from java has someone used such a system which actually is part perl repository The goal really is to not have a whole bunch of unorganized perl scripts and put some management and organization on our information retrieval. Also, I know I can use perl do do the web part of what we want - but as I mentioned before - trying to keep perl focused. But it seems assbackwards I'm not adverse to making it an all perl solution. Open to any all suggestions and opinions. Thanks

Read the article

Number of connections to the host at the same time

- by sev

How can I handle this?

Read the article

How to crawl a webPage with dynamic content added by javascript

- by blunderboy

I guess there is a news that Google bots have the capability to understand our javascript code. It means this is possible to fully crawl a webpage which has lazy loading feature enabled. I am using Apache Nutch to crawl websites but I don't think it has the capability to fetch the URLs being injected in HTML page by javascript when the page is scrolled down. I see a lot of websites doing lazy loading for performance issue. So Can somebody please explain me how can i crawl the data which comes in HTML page on lazy load. (On scrolling the page down).

Read the article

can i use hadoop cloudera without root access?

- by in_the_cloud

a bit of a binary question (okay, not excatly) - but was wondering if one is able to configure cloudera / hadoop to run at the nodes without root shell access to the node computers (although i can setup ssh passwordless login)? appears from their instructions that root access is needed, at yet i found a hadoop wiki which suggest root access might not be needed ? http://wiki.apache.org/nutch/NutchHadoopTutorial

Read the article

Is there any descent open-source search engine solutions?

- by Nazariy

Few weeks ago my friend asked me how hard is it to launch your own search engine service with list of websites that suppose to be crawled time to time. First what come at my mind was Google Custom Search however pricing policy is quite tricky and would drain your budget if you reach 500K queries per year. Another solution I found here was SearchBlox, which can be compared to Google Mini service. It's quite good solution if you planing to cover search over small amount of websites but for larger projects it is not very handy. I also found few other search platforms like Lucene, Hadoop and Xapian which seems to be quite powerful solutions to reach Google search quality, and Nutch as a web crawler. As most of open-source projects they share same problem, luck of comprehensive guidance of usage, examples and it's expected that you are expert in this subject. I'm wondering if any of you using this solutions, which of them would you recommend, and what should I be aware of?

Read the article

Building intranet search

- by gmkv

At work, we have lots of information squirreled away in many different sites -- wikis, product docs, ticketing system, etc -- many of which require authentication. I'm very interested in having a single way to search all our various silos, and in my spare time have looked at Nutch, Grub, Django + Haystack, etc. None of these is a complete solution a la Google Mini or Google Search Appliance. Has anybody built a basic intranet search engine out of a mixture of these tools? Would you have recommendations about how to go about it? I like Django, and Haystack seems to be a mildly popular search solution for it, but I'd need to wire up a crawler that can support crawling authenticated sites to it.

Search Results

Search found 30 results on 2 pages for 'nutch'.

Page 1/2 | 1 2 | Next Page >

- by Annibigi

- by Sascha

- by mlathe

- by Marcel

- by radi

- by radi

- by darbour

- by Yurish

- by user365345

- by user365345

- by sigpwned

- by Miles

- by vipin k.

- by murali

- by user330670

- by Ysttk

- by Pranali Desai

- by ian.evans

- by intrance

- by Bigtwinz

- by sev

- by blunderboy

- by in_the_cloud

- by Nazariy

- by gmkv

1 2 | Next Page >