Search Results

Search found 30 results on 2 pages for 'nutch'.

Page 2/2 | < Previous Page | 1 2 

  • .Net search engine architecture and technology choice

    - by shrivb
    I am in the process of designing a search engine for an asp.net site. The site currently uses Microsoft Indexing Server to index and search content which range from simple text files to MS documents to PDFs. MIS is also used to crawl File servers. MIS in tandem with Index Server Companion crawls for content from external sites. I intend to replace MIS with the indexer/crawler I am trying to build. Since my platform is completely on the Microsoft stack, I cant afford to have a Java application server. Thus, Solr, and effectively, SolrNet is ruled out. With this being the context, I have couple of questions. 1.Technology choice I had done my initial investigation and looked at Lucene.Net. There seemed to be 2 issues in using Lucene.Net. First being, it cant crawl external content. There doesn't seem to be a direct port of Nutch in .Net. Second, since it is just an indexer, it cant parse various document types. The parsing is left to the developer. So, what would be best technology choice on the .Net platform to achieve indexing & crawling? Are there any .Net open source libraries available for document parsing? 2.Architectural pattern Is there any general architectural pattern or best practice that needs to be followed in designing such a search engine? Thanks in advance.

    Read the article

  • What is a good Java crawler library?

    - by DrDee
    Hi, I am about to develop a crawler in Java but don't feel like reinventing the wheel. A quick Google search gives a whole bunch of Java libraries to build a web crawler. Besides that Nutch is of course a very robust package but seems a bit too advanced for my needs. I only need to crawl a handful websites a week containing a couple of 1000 pages each. Which open source Java library would you recommend considering: speed multithreading (or even distributed) extending it with new functionality active maintained and documentation?

    Read the article

  • What is a good Java web crawler library?

    - by DrDee
    Hi, I am about to develop a crawler in Java but don't feel like reinventing the wheel. A quick Google search gives a whole bunch of Java libraries to build a web crawler. Besides that Nutch is of course a very robust package but seems a bit too advanced for my needs. I only need to crawl a handful websites a week containing a couple of 1000 pages each. Which open source Java library would you recommend considering: speed multithreading (or even distributed) extending it with new functionality active maintained and documentation?

    Read the article

  • How to normalize a URL in Java?

    - by dfrankow
    URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent. Strategies include lowercasing, adding trailing slashes, https = http, etc. The Wikipedia page lists many. Got a favorite method of doing this in Java? Perhaps a library (Nutch?), but I'm open. Smaller and fewer dependencies is better. I'll handcode something for now and keep an eye on this question.

    Read the article

  • Can a raw Lucene index be loaded by Solr?

    - by wynz
    Some colleagues of mine have a large Java web app that uses a search system built with Lucene Java. What I'd like to do is have a nice HTTP-based API to access those existing search indexes. I've used Nutch before and really liked how simple the OpenSearch implementation made it to grab results as RSS. I've tried setting Solr's dataDir in solrconfig.xml, hoping it would happily pick up the existing index files, but it seems to just ignore them. My main question is: Can Solr be used to access Lucene indexes created elsewhere? Or might there be a better solution?

    Read the article

< Previous Page | 1 2