lucene - Developer IT

performance comparision between Zend Lucene and Java Lucene

- by Carson

Zend Lucene and Java Lucene are built in PHP and java repectively, and PHP language has a higher level than java. Just wondering How big the performance difference among these two, regarding to index building and data searching? Is it much more effective to let java create and rebuild index, and let php use the index?

Read the article

Lucene - querying with long strings

- by Mikos

I have an index, with a field "Affiliation", some example values are: "Stanford University School of Medicine, Palo Alto, CA USA", "Institute of Neurobiology, School of Medicine, Stanford University, Palo Alto, CA", "School of Medicine, Harvard University, Boston MA", "Brigham & Women's, Harvard University School of Medicine, Boston, MA" "Harvard University, Cambridge MA" and so on... (the bottom-line being the affiliations are written in multiple ways with no apparent consistency) I query the index on the affiliation field using say "School of Medicine, Stanford University, Palo Alto, CA" (with QueryParser) to find all Stanford related documents, I get a lot of false +ves, presumably because of the presence of School of Medicine etc. etc. (note: I cannot use Phrase query because of variability in the way affiliation is constructed) I have tried the following: Use a SpanNearQuery by splitting the search phrase with a whitespace (here I get no results!) Tried boosting (using ^) by splitting with the comma and boosting the last parts such as "Palo Alto CA" with a much higher boost than the initial phrases. Here I still get lots of false +ves. Any suggestions on how to approach this? If SpanNearQuery the way to go, Any ideas on why I get 0 results?

Read the article

Lucene and .NET Part I

- by javarg

I’ve playing around with Lucene.NET and trying to get a feeling of what was required to develop and implement a full business application using it. As you would imagine, many things are required for you to implement a robust solution for indexing content and searching it afterwards. Lucene is a great and robust solution for indexing content. It offers fast and performance enhanced search engine library available in Java and .NET. You will want to use this library in many particular scenarios: In Windows Azure, to support Full Text Search (a functionality not currently supported by SQL Azure) When storing files outside or not managed by your database (like in large document storage solutions that uses File System) When Full Text Search is not really what you need Lucene is more than a Full Text Search solution. It has several analyzers that let you process and search content in different ways (decomposing sentences, deriving words, removing articles, etc.). When deciding to implement indexing using Lucene, you will need to take into account the following: How content is to be indexed by Lucene and when. Using a service that runs after a specific interval Immediately when content changes When content is to available for searching / Availability of indexed content (as in real time content search) Immediately when content changes = near real time searching After a few minutes.. Ease of maintainability and development Some Technical Concerns.. When indexing content, indexes are locked for writing operations by the Index Writer. This means that Lucene is best designed to index content using single writer approach. When searching, Index Readers take a snapshot of indexes. This has the following implications: Setting up an index reader is a costly task. Your are not supposed to create one for each query or search. A good practice is to create readers and reuse them for several searches. The latter means that even when the content gets updated, you wont be able to see the changes. You will need to recycle the reader. In the second part of this post we will review some alternatives and design considerations.

Read the article

Lucene.NET - sorting by int

- by Judah Himango

In the latest version of Lucene (or Lucene.NET), what is the proper way to get the search results back in sorted order? I have a document like this: var document = new Lucene.Document(); document.AddField("Text", "foobar"); document.AddField("CreationDate", DateTime.Now.Ticks.ToString()); // store the date as an int lucene.AddDocument(document); Now I want do a search and get my results back in order of most recent. How can I do a search that orders results by CreationDate? All the documentation I see is for old Lucene versions that use now-deprecated APIs.

Read the article

Using Solr and Zends Lucene port together...

- by thebluefox

Afternoon chaps, After my adventures with Zend-Lucene-Search, and discovering it isn't all its cracked up to be when indexing large datasets, I've turned to Solr (thanks to Bill Karwin for that :) ) I've got Solr indexing the db far far quicker now, taking just over 8 minutes to index a table of just over 1.7million rows - which I'm very pleased with. However, when I come to try and search the index with the Zend port, I run into the following error; Fatal error: Uncaught exception 'Zend_Search_Lucene_Exception' with message 'Unsupported segments file format' in /var/www/Zend/Search/Lucene.php:407 Stack trace: #0 /var/www/Zend/Search/Lucene.php(555): Zend_Search_Lucene-_readSegmentsFile() #1 /var/www/z_search.php(12): Zend_Search_Lucene-__construct('tmp/feeds_index') #2 {main} thrown in /var/www/Zend/Search/Lucene.php on line 407 I've tried to have a search around but can't seem to find anything about this problem, everyone just seems to be able to get them to work? Any help as always much appreciated :) Thanks, Tom

Read the article

Lucene.NET (strings fuzzy matching)

- by dark-elf2

Good day The question is: Could anyone give me an example about how to do fuzzy matching of two strings using Lucene.NET (or using Java version of Lucene, or in any other language that has port of Lucene).

Read the article

Tutorial for Examine/Lucene

- by Kumar

I am interested in Examine for building searching in a standalone desktop app for searching db tables as well as office/.pdf files This looks like an excellent scenario for Lucene/examine However the doc there is minimal and while i have plenty of experience with SQL full text search, Lucene is a different beast altogether and hence looking for help/pointers on how/where to start And yes, i did a google search but did not find any resources as the terms are fairly common ( lucene examine tutorial etc. )

Read the article

Lucene HTMLFormatter skipping last character

- by Midhat

I have this simple Lucene search code (Modified from http://www.lucenetutorial.com/lucene-in-5-minutes.html) class Program { static void Main(string[] args) { StandardAnalyzer analyzer = new StandardAnalyzer(); Directory index = new RAMDirectory(); IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED); addDoc(w, "Table 1 <table> content </table>"); addDoc(w, "Table 2"); addDoc(w, "<table> content </table>"); addDoc(w, "The Art of Computer Science"); w.Close(); String querystr = "table"; Query q = new QueryParser("title", analyzer).Parse(querystr); Lucene.Net.Search.IndexSearcher searcher = new Lucene.Net.Search.IndexSearcher(index); Hits hitsFound = searcher.Search(q); SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("*", "*"); Highlighter highlighter = null; highlighter = new Highlighter(formatter, new QueryScorer(searcher.Rewrite(q))); for (int i = 0; i < hitsFound.Length(); i++) { Console.WriteLine(highlighter.GetBestFragment(analyzer, "title", hitsFound.Doc(i).Get("title"))); // Console.WriteLine(hitsFound.Doc(i).Get("title")); } Console.ReadKey(); } private static void addDoc(IndexWriter w, String value) { Document doc = new Document(); doc.Add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED)); w.AddDocument(doc); } } The highlighted results always seem to skip the closing '' of my last table tag. Any suggestions?

Read the article

Updating Lucene index from two different threads in a web application

- by Jimmy

Hi, I've a .net web application which uses Lucene.net for company search functionality. When registered users add a new company,it is saved to database and also gets indexed in Lucene based company search index in real time. When adding company in Lucene index, how do I handle use case of two or more logged-in users posting a new company at the same time?Also, will both these companies get indexed without any file lock, lock time out, etc. related issues? Would appreciate if i could help with code as well. Thanks.

Read the article

Lucene.NET - Find documents that do not contain a specified field

- by Brandon

Let's say I have 2 instance of a class called 'Animal'. Animal has 3 fields: Name, Age, and Type The name field is nullable, so before I insert an instance of Animal as a Lucene indexed document, I check if Animal.Name == null, and if it does, I do not insert it as a field in my document. If I were to retrieve all animals, I would see that the Name field does not exist and I can set its value to null. However, there may be situations where I want to say "Get me all animals that do not have a name specified yet." In this situation I want to retrieve all Lucene.NET documents from my animal index that do not contain the Name field. Is there an easy way to do this with Lucene.NET? I want to stay away from having to perform some sort of hack to check if my name field has a value of 'null'.

Read the article

Lucene: Wildcards are missing from index

- by Eleasar

Hi - i am building a search index that contains special names - containing ! and ? and & and + and ... I have to tread the following searches different: me & you me + you But whatever i do (did try with queryparser escaping before indexing, escaped it manually, tried different indexers...) - if i check the search index with Luke they do not show up (question marks and @-symbols and the like show up) The logic behind is that i am doing partial searches for a live suggestion (and the fields are not that large) so i split it up into "m" and "me" and "+" and "y" and "yo" and "you" and then index it (that way it is way faster than a wildcard query search (and the index size is not a big problem). So what i would need is to also have this special wildcard characters be inserted into the index. This is my code: using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text; using Lucene.Net.Analysis; using Lucene.Net.Util; namespace AnalyzerSpike { public class CustomAnalyzer : Analyzer { public override TokenStream TokenStream(string fieldName, TextReader reader) { return new ASCIIFoldingFilter(new LowerCaseFilter(new CustomCharTokenizer(reader))); } } public class CustomCharTokenizer : CharTokenizer { public CustomCharTokenizer(TextReader input) : base(input) { } public CustomCharTokenizer(AttributeSource source, TextReader input) : base(source, input) { } public CustomCharTokenizer(AttributeFactory factory, TextReader input) : base(factory, input) { } protected override bool IsTokenChar(char c) { return c != ' '; } } } The code to create the index: private void InitIndex(string path, Analyzer analyzer) { var writer = new IndexWriter(path, analyzer, true); //some multiline textbox that contains one item per line: var all = new List<string>(txtAllAvailable.Text.Replace("\r","").Split('\n')); foreach (var item in all) { writer.AddDocument(GetDocument(item)); } writer.Optimize(); writer.Close(); } private static Document GetDocument(string name) { var doc = new Document(); doc.Add(new Field( "name", DeNormalizeName(name), Field.Store.YES, Field.Index.ANALYZED)); doc.Add(new Field( "raw_name", name, Field.Store.YES, Field.Index.NOT_ANALYZED)); return doc; } (Code is with Lucene.net in version 1.9.x (EDIT: sorry - was 2.9.x) but is compatible with Lucene from Java) Thx

Read the article

Solr WordDelimiterFilter + Lucene Highlighter

- by Lucas

I am trying to get the Highlighter class from Lucene to work properly with tokens coming from Solr's WordDelimiterFilter. It works 90% of the time, but if the matching text contains a ',' such as "1,500" the output is incorrect: Expected: 'test 1,500 this' Observed: 'test 11,500 this' I am not currently sure whether it is Highlighter messing up the recombination or WordDelimiterFilter messing up the tokenization but something is unhappy. Here are the relevant dependencies from my pom: org.apache.lucene lucene-core 2.9.3 jar compile org.apache.lucene lucene-highlighter 2.9.3 jar compile org.apache.solr solr-core 1.4.0 jar compile And here is a simple JUnit test class demonstrating the problem: package test.lucene; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; import java.io.IOException; import java.io.Reader; import java.util.HashMap; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Query; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.InvalidTokenOffsetsException; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleFragmenter; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.util.Version; import org.apache.solr.analysis.StandardTokenizerFactory; import org.apache.solr.analysis.WordDelimiterFilterFactory; import org.junit.Test; public class HighlighterTester { private static final String PRE_TAG = "<b>"; private static final String POST_TAG = "</b>"; private static String[] highlightField( Query query, String fieldName, String text ) throws IOException, InvalidTokenOffsetsException { SimpleHTMLFormatter formatter = new SimpleHTMLFormatter( PRE_TAG, POST_TAG ); Highlighter highlighter = new Highlighter( formatter, new QueryScorer( query, fieldName ) ); highlighter.setTextFragmenter( new SimpleFragmenter( Integer.MAX_VALUE ) ); return highlighter.getBestFragments( getAnalyzer(), fieldName, text, 10 ); } private static Analyzer getAnalyzer() { return new Analyzer() { @Override public TokenStream tokenStream( String fieldName, Reader reader ) { // Start with a StandardTokenizer TokenStream stream = new StandardTokenizerFactory().create( reader ); // Chain on a WordDelimiterFilter WordDelimiterFilterFactory wordDelimiterFilterFactory = new WordDelimiterFilterFactory(); HashMap<String, String> arguments = new HashMap<String, String>(); arguments.put( "generateWordParts", "1" ); arguments.put( "generateNumberParts", "1" ); arguments.put( "catenateWords", "1" ); arguments.put( "catenateNumbers", "1" ); arguments.put( "catenateAll", "0" ); wordDelimiterFilterFactory.init( arguments ); return wordDelimiterFilterFactory.create( stream ); } }; } @Test public void TestHighlighter() throws ParseException, IOException, InvalidTokenOffsetsException { String fieldName = "text"; String text = "test 1,500 this"; String queryString = "1500"; String expected = "test " + PRE_TAG + "1,500" + POST_TAG + " this"; QueryParser parser = new QueryParser( Version.LUCENE_29, fieldName, getAnalyzer() ); Query q = parser.parse( queryString ); String[] observed = highlightField( q, fieldName, text ); for ( int i = 0; i < observed.length; i++ ) { System.out.println( "\t" + i + ": '" + observed[i] + "'" ); } if ( observed.length > 0 ) { System.out.println( "Expected: '" + expected + "'\n" + "Observed: '" + observed[0] + "'" ); assertEquals( expected, observed[0] ); } else { assertTrue( "No matches found", false ); } } } Anyone have any ideas or suggestions?

Read the article

Lucene.Net PrefixQuery

- by Sole

Hi, i´m development a suggest box for my site search service. I has to search fields like these: Visual Basic Enterprise Edition Visual C++ Visual J++ My code is: Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory("Index", false); IndexSearcher searcher = new Lucene.Net.Search.IndexSearcher( dir,true); Term term = new Term("nombreAnalizado", _que); PrefixQuery query = new PrefixQuery(term); TopDocs topDocs = searcher.Search(query, 10000); This code works well in this case: "Enterprise" will match "Visual Basic Enterprise Edition" But "Enterprise E" doesn´t match anything. I removed white spaces at indexing time and when the user is searching. Thanks.

Read the article

Lucene search taking TOOO long.

- by Josh Handel

I;m using Lucene.net (2.9.2.2) on a (currently) 70Gig index.. I can do a fairly complicated search and get all the document IDs back in 1 ~ 2 seconds.. But to actually load up all the hits (about 700 thousand in my test queries) takes 5+ minutes. We aren't using lucene for UI, this is a datastore between processes where we have hundreds of millions of pre-cached data elements, and the part I am working on exports a few specific fields from each found document. (ergo, pagination doesn't make since as this is an export between processes). My question is what is the best way to get all of the documents in a search result? currently I am using a custom collector that does a get on the document (with a MapFieldSelector) as its collecting.. I've also tried iterating through the list after the collector has finished.. but that was even worse. I'm open to ideas :-). Thanks in advance.

Read the article

CouchDB Lucene How to URL Encode Query containing Minus (-)

- by Peter

I'd like to query text containing a minus (-) Sign, e.g. foo-bar with a couchdb lucene fulltext query. Following lucene rules I'd have to escape the minus, resulting in foo\-bar Last I'd have to urlencode the backslash resulting in foo%5C-bar So the complete url would be: http://127.0.0.1:5984/_fti/local/db/_design/foo/by_subject?q=foo%5C-bar Neither works. The result is always split in two phrases "q":"default:foo default:bar" Leading to documents containing only foo or bar being found also. Thanks for your help!

Read the article

Why wasn't fast-vector-highlighter (lucene-contrib) made an official part of Lucene 3.0 core

- by mP

I've read some Jira entries and they mentioned moving fast-vector-highlighter to core about a year ago but it never made it. Looking at the svn for contrib it seems incomplete. There are no tests for FastVectorHighlighter Documentation is lacking No samples anywhere on apache.org Anyone have any ideas what its status is?

Read the article

Lucene and Special Characters

- by Brandon

I am using Lucene.Net 2.0 to index some fields from a database table. One of the fields is a 'Name' field which allows special characters. When I perform a search, it does not find my document that contains a term with special characters. I index my field as such: Directory DALDirectory = FSDirectory.GetDirectory(@"C:\Indexes\Name", false); Analyzer analyzer = new StandardAnalyzer(); IndexWriter indexWriter = new IndexWriter(DALDirectory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED); Document doc = new Document(); doc.Add(new Field("Name", "Test (Test)", Field.Store.YES, Field.Index.TOKENIZED)); indexWriter.AddDocument(doc); indexWriter.Optimize(); indexWriter.Close(); And I search doing the following: value = value.Trim().ToLower(); value = QueryParser.Escape(value); Query searchQuery = new TermQuery(new Term(field, value)); Searcher searcher = new IndexSearcher(DALDirectory); TopDocCollector collector = new TopDocCollector(searcher.MaxDoc()); searcher.Search(searchQuery, collector); ScoreDoc[] hits = collector.TopDocs().scoreDocs; If I perform a search for field as 'Name' and value as 'Test', it finds the document. If I perform the same search as 'Name' and value as 'Test (Test)', then it does not find the document. Even more strange, if I remove the QueryParser.Escape line do a search for a GUID (which, of course, contains hyphens) it finds documents where the GUID value matches, but performing the same search with the value as 'Test (Test)' still yields no results. I am unsure what I am doing wrong. I am using the QueryParser.Escape method to escape the special characters and am storing the field and searching by the Lucene.Net's examples. Any thoughts?

Read the article

Lucene (.NET) Document stucture and performance suggestions.

- by Josh Handel

Hello, I am indexing about 100M documents that consist of a few string identifiers and a hundred or so numaric terms.. I won't be doing range queries, so I haven't dugg too deep into Numaric Field but I'm not thinking its the right choose here. My problem is that the query performance degrades quickly when I start adding OR criteria to my query.. All my queries are on specific numaric terms.. So a document looks like StringField:[someString] and N DataField:[someNumber].. I then query it with something like DataField:((+1 +(2 3)) (+75 +(3 5 52)) (+99 +88 +(102 155 199))). Currently these queries take about 7 to 16 seconds to run on my laptop.. I would like to make sure thats really the best they can do.. I am open to suggestions on field structure and query structure :-). Thanks Josh PS: I have already read over all the other lucene performance discussions on here, and on the Lucene wiki and at lucid imiagination... I'm a bit further down the rabbit hole then that...

Read the article

Lucene search and underscores

- by Matt

When I use Luke to search my Lucene index using a standard analyzer, I can see the field I am searchng for contains values of the form MY_VALUE. When I search for field:"MY_VALUE" however, the query is parsed as field:"my value" Is there a simple way to escape the underscore (_) character so that it will search for it?

Read the article

How do i search 'and' with lucene?

- by acidzombie24

I am looking at the query syntax. and i could not figure out how to search 'and'. I tried "a sentence with and and words after it" i tried +and and \and. It always ignored it. How can i search 'and'? I am using lucene.net

Read the article

Lucene.Net Search result to highlight search keywords

- by Ali Shafai

I use Lucene.Net to index some documents. I want to show the user a couple of lines as to why that document is in the result set. just like when you use google to search and it shows the link and followed by the link there are a few lines with the keywords highlighted. any ideas?

Read the article

Lucene.Net support phrases?: What is best approach to tokenize comma-delimited data (atomically) in

- by Pete Alvin

I have a database with a column I wish to index that has comma-delimited names, e.g., User.FullNameList = "Helen Ready, Phil Collins, Brad Paisley" I prefer to tokenize each name atomically (name as a whole searchable entity). What is the best approach for this? Did I miss a simple option to set the tokenize delimiter? Do I have to subclass or write my own class that to roll my own tokenizer? Something else? ;) Or does Lucene.net not support phrases? Or is it smart enough to handle this use case automatically? I'm sure I'm not the first person to have to do this. Googling produced no noticeable solutions.

Read the article

Reading from compressed lucene index

- by Akhil

I created a lucene index and compressed the index directory with bz2 or zip. I donot want to uncompress it. Is there any API call that can read the index from this zipped directory and thus allow searching and other functionalities. That is, can lucence IndexReader read the index from a compressed file. I saw that Lucnene IndexReader does not support "Reader" to open the index, otherwise I would have created a Reader class that uncompresses the file and streams the uncompressed version. Any alternatives to this are welcome. Thanks, Akhil

Read the article

Disabling scoring in Lucene(.NET)

- by user72185

Hi, When searching, is there a way to disable scoring for any query? The scenario is that the user refines his query by trying different combinations of words, phrases etc., and needs realtime (well, reasonably fast at least) responses on the number of hits. Search time slows down a lot when there are millions of hits due to scoring, but the user really doesn't care about all these documents. As soon as he sees there are 1M+ hits he will start adding additional words to the query. A "Sort by relevance" option would allow him to do this quickly, while turning scoring back on when the number of hits is reasonable. Is this possible? I'm using Lucene.NET 2.9.2 but AFAIK it is identical to the Java version.

Read the article

Hyphens in Lucene

- by user72185

Hi, I'm playing around with Lucene and noticed that the use of a hyphen (e.g. "semi-final") will result in two words ("semi" and "final" in the index. How is this supposed to match if the users searches for "semifinal", in one word? Edit: I'm just playing around with the StandardTokenizer class actually, maybe that is why? Am I missing a filter? Thanks! (Edit) My code looks like this: StandardAnalyzer sa = new StandardAnalyzer(); TokenStream ts = sa.TokenStream("field", new StringReader("semi-final")); while (ts.IncrementToken()) { string t = ts.ToString(); Console.WriteLine("Token: " + t); }

Search Results

Search found 393 results on 16 pages for 'lucene'.

Page 1/16 | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by Carson

- by Mikos

- by javarg

- by Judah Himango

- by thebluefox

- by dark-elf2

- by Kumar

- by Midhat

- by Jimmy

- by Brandon

- by Eleasar

- by Lucas

- by Sole

- by Josh Handel

- by Peter

- by mP

- by Brandon

- by Josh Handel

- by Matt

- by acidzombie24

- by Ali Shafai

- by Pete Alvin

- by Akhil

- by user72185

- by user72185

1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >