Search Results

Search found 30 results on 2 pages for 'nutch'.

Page 1/2 | 1 2  | Next Page >

  • Problem with running the Nutch command from PHP exec()

    - by Annibigi
    My Nutch directory lies in /home/myserv/nutch/nutch-1.0/ My php applictaion is in the diretcory /home/myserv/www/ Theres a a php file in my /home/myserv/www/ diretcory that runs a exec command to run a nutch command.PHP code is like : $output = exec("bin/nutch all"); When I run the command from the command line I need to be in the "/home/myserv/nutch/nutch-1.0/" directory When i'm trying to run it through the php exec() ,I just can seems to make it execute. I have tried giving the ful path like (below) but nothing works :( $output = exec("/home/myserv/nutch/nutch-1.0/bin/nutch all"); Desperately looking for help

    Read the article

  • nutch spell checker | nutch navigation filter

    - by Sascha
    Hello, i am try to configure the nutch 1.0 search engine. First i need to integrate a spell checker or somthing like this, is there a plugin available? My next question is, how to rule out html tag like "", so that navigation is not a part of the index? thanks for all answers

    Read the article

  • posting nutch data into a BASIC auth secured Solr instance

    - by mlathe
    Hi. I've secured a solr instance using BASIC auth, kind of how it is shown here: http://blog.comtaste.com/2009/02/securing_your_solr_server_on_t.html Now i'm trying to update my batch processes to push data into the authenticated instance. The ones using "curl" are easy, but i also have a Nutch crawl that uses the "solrindex" command to push data into Solr. When i do that i get this error: 2010-02-22 12:09:28,226 INFO auth.AuthChallengeProcessor - basic authentication scheme selected 2010-02-22 12:09:28,229 INFO httpclient.HttpMethodDirector - No credentials available for BASIC 'Tomcat Manager Application'@ninja:5500 2010-02-22 12:09:28,236 WARN mapred.LocalJobRunner - job_local_0001 org.apache.solr.common.SolrException: Unauthorized Unauthorized request: http://ninja:5500/solr/foo/update?wt=javabin&version=2.2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:343) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:183) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:217) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:48) at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:69) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:447) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) 2010-02-22 12:09:29,134 FATAL solr.SolrIndexer - SolrIndexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.indexer.solr.SolrIndexer.indexSolr(SolrIndexer.java:73) at org.apache.nutch.indexer.solr.SolrIndexer.run(SolrIndexer.java:95) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.solr.SolrIndexer.main(SolrIndexer.java:104) Apparently nutch uses SolrJ to push the content, and after going through the solrj code, it's clear that it uses commons-httpclient without providing a way to set the credentials. Here are my question(s) Is this possible to do? ie push from nutch into a BASIC auth secured Solr instance? Is it possible to tell commons-httpclient about a credential without explicitly doing an _httpclient.getState().setCredentials(...)? Anyother ideas? One idea i had was to use an IPfiltering Valve for just the "update" Solr webservices. That would mean you could only make an update call from certain nodes. Thanks

    Read the article

  • crawl websites out of java web application without using bin/nutch

    - by Marcel
    hi :) i am trying to using nutch (1.1) without bin/nutch from my (java) mojarra 2.0.2 webapp... i am searching at google for examples, but there are no examples how i can realize this :/ ... i get an exception and the job fails :/ (i think of cause something with hadoop)... here is my code: public void run() throws Exception { final String[] args = new String[] { String.format("%s%s%s%s", JSFUtils.getWebAppRoot(), "nutch", File.separator, DIRECTORY_URLS), "-dir", String.format("%s%s%s%s", JSFUtils.getWebAppRoot(), "nutch", File.separator, DIRECTORY_CRAWL), "-threads", this.preferences.get("threads"), "-depth", this.preferences.get("depth"), "-topN", this.preferences.get("topN"), "-solr", this.preferences.get("solr") }; Crawl.main(args); } and a part of the logging: 10/05/17 10:42:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 10/05/17 10:42:54 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/05/17 10:42:54 INFO mapred.FileInputFormat: Total input paths to process : 1 10/05/17 10:42:54 INFO mapred.JobClient: Running job: job_local_0001 10/05/17 10:42:54 INFO mapred.FileInputFormat: Total input paths to process : 1 10/05/17 10:42:55 INFO mapred.MapTask: numReduceTasks: 1 10/05/17 10:42:55 INFO mapred.MapTask: io.sort.mb = 100 java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.crawl.Injector.inject(Injector.java:211) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) at lan.localhost.process.NutchCrawling.run(NutchCrawling.java:108) at lan.localhost.main.Index.indexing(Index.java:71) at lan.localhost.bean.FeedingBean.actionStart(FeedingBean.java:25) .... can someone help me or tell me how i can crawling from a java application? i have increased the Xms to 256m and Xmx to 768m, but nothing changed... best regards marcel

    Read the article

  • Nutch search always returns 0 results

    - by darbour
    I have set up nutch 1.0 on a cluster. It has been setup and has successfully crawled, I copied the crawl directory using the dfs -copyToLocal and set the value of searcher.dir in the nutch-site.xml file located in the tomcat directory to point to that directory. Still when I try to search I receive 0 results. Any help would be greatly appreciated.

    Read the article

  • Nutch - how to crawl by small patches?

    - by Yurish
    Hi everyone! I am stuck! Can`t get Nutch to crawl for me by small patches. I start it by bin/nutch crawl command with parameters -depth 7 and -topN 10000. And it never ends. Ends only when my HDD is empty. What i need to do: Start to crawl my seeds with possibility to go further on outlinks. Crawl 20000 pages, then index them. Crawl another 20000 pages, index them and merge with first index. Loop step 3 n times. Tried also with scripts found in wiki, but all scripts i found don't go further. If i run them again, they do everything from beginning. And in the end of script i have the same index i had, when started to crawl. But, i need to continue my crawl. Some help would be very usefull!

    Read the article

  • Do not filter outlinks in Nutch?

    - by sigpwned
    I'm currently trying to perform a deep crawl within a small list of sites. To accomplish this, I updated conf/domain-urlfilter.txt with the domains of the sites I wish to scrape, which worked nicely. However, I found that not only were the links crawled at every step filtered, but the outlinks captured from each page crawled were filtered as well. Is there a way to avoid filtering captured outlinks while still filtering crawled URLs?

    Read the article

  • Can I run a site search like Lucene on a single 2 gig server that's also a web & mysql server

    - by ian.evans
    My site's pages have exceeded the limit of pages for Google Custom Search so many of the results are not found in our site search. I've been reading about Lucene, Nutch, Solr, etc and I'm wondering if I'd have the requirements for running those on a single server that also runs the site (on nginx) and our mysql server. We hae 2 gigs of RAM. I'd appreciate any suggestions for migrating to a new site search.

    Read the article

  • Can't search in a certain field using solR

    - by intrance
    Hi, I'm setting up an environment using Nutch 1.0 + solR 1.4. In Nutch I configured the subcollection plugin which seems to work nicely. If I search as normal adding fl=* I can see the subcollection field is filled as intented. (something like <str name="subcollection">mysite.com</str>). My problem is, I would like to be able to search only in one or more given subcollections, but whenever my searchquery is something like q=subcollection:mysite.com it won't work. I've also tried to add a fl=* or searched in mysite* instead but I never get any results. Obviously solR "knows" the subcollection field, as it doesn't result with an error but simply whith an empty result. I'd be glad for any help

    Read the article

  • Crawling engine architecture - Java/ Perl integration

    - by Bigtwinz
    Hi all, I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business instructions and goals. As you can imagine, not an optimal solution. There are 3 consistent themes with this system: the retrieval of data has a "conceptual structure" for lack of a better phrase i.e. the retrieval of information follows a particular path we are only looking for very specific information so we dont have to really worry about extensive crawling for awhile (think thousands-tens of thousands of pages vs millions) crawls are url-based instead of site-based. As I enhance this alpha version to a more production-level beta I am looking to add automation and management of the retrieval of data. Additionally our other systems are Java (which I'm more proficient in) and I'd like to compartmentalize the perl aspects so we dont have to lean heavily on outside help. I've evaluated the usual suspects Nutch, Droid etc but the time spent on modifying those frameworks to suit our specific information retrieval cant be justified. So I'd like your thoughts regarding the following architecture. I want to create a solution which use Java as the interface for managing and execution of the perl scripts use Java for configuration and data access stick with perl for retrieval An example use case would be a data analyst delivers us a requirement for crawling perl developer creates the required script and uses this webapp to submit the script (which gets saved to the filesystem) the script gets kicked off from the webapp with specific parameters .... Webapp should be able to create multiple threads of the perl script to initiate multiple crawlers. So questions are what do you think how solid is integration between Java and Perl specifically from calling perl from java has someone used such a system which actually is part perl repository The goal really is to not have a whole bunch of unorganized perl scripts and put some management and organization on our information retrieval. Also, I know I can use perl do do the web part of what we want - but as I mentioned before - trying to keep perl focused. But it seems assbackwards I'm not adverse to making it an all perl solution. Open to any all suggestions and opinions. Thanks

    Read the article

  • How to crawl a webPage with dynamic content added by javascript

    - by blunderboy
    I guess there is a news that Google bots have the capability to understand our javascript code. It means this is possible to fully crawl a webpage which has lazy loading feature enabled. I am using Apache Nutch to crawl websites but I don't think it has the capability to fetch the URLs being injected in HTML page by javascript when the page is scrolled down. I see a lot of websites doing lazy loading for performance issue. So Can somebody please explain me how can i crawl the data which comes in HTML page on lazy load. (On scrolling the page down).

    Read the article

  • can i use hadoop cloudera without root access?

    - by in_the_cloud
    a bit of a binary question (okay, not excatly) - but was wondering if one is able to configure cloudera / hadoop to run at the nodes without root shell access to the node computers (although i can setup ssh passwordless login)? appears from their instructions that root access is needed, at yet i found a hadoop wiki which suggest root access might not be needed ? http://wiki.apache.org/nutch/NutchHadoopTutorial

    Read the article

  • Is there any descent open-source search engine solutions?

    - by Nazariy
    Few weeks ago my friend asked me how hard is it to launch your own search engine service with list of websites that suppose to be crawled time to time. First what come at my mind was Google Custom Search however pricing policy is quite tricky and would drain your budget if you reach 500K queries per year. Another solution I found here was SearchBlox, which can be compared to Google Mini service. It's quite good solution if you planing to cover search over small amount of websites but for larger projects it is not very handy. I also found few other search platforms like Lucene, Hadoop and Xapian which seems to be quite powerful solutions to reach Google search quality, and Nutch as a web crawler. As most of open-source projects they share same problem, luck of comprehensive guidance of usage, examples and it's expected that you are expert in this subject. I'm wondering if any of you using this solutions, which of them would you recommend, and what should I be aware of?

    Read the article

  • Building intranet search

    - by gmkv
    At work, we have lots of information squirreled away in many different sites -- wikis, product docs, ticketing system, etc -- many of which require authentication. I'm very interested in having a single way to search all our various silos, and in my spare time have looked at Nutch, Grub, Django + Haystack, etc. None of these is a complete solution a la Google Mini or Google Search Appliance. Has anybody built a basic intranet search engine out of a mixture of these tools? Would you have recommendations about how to go about it? I like Django, and Haystack seems to be a mildly popular search solution for it, but I'd need to wire up a crawler that can support crawling authenticated sites to it.

    Read the article

1 2  | Next Page >