lucene - Page 10 - Developer IT

Spelling correction for data normalization in Java

- by dareios

I am looking for a Java library to do some initial spell checking / data normalization on user generated text content, imagine the interests entered in a Facebook profile. This text will be tokenized at some point (before or after spell correction, whatever works better) and some of it used as keys to search for (exact match). It would be nice to cut down misspellings and the like to produce more matches. It would be even better if the correction would perform well on tokens longer than just one word, e.g. "trinking coffee" would become "drinking coffee" and not "thinking coffee". I found the following Java libraries for doing spelling correction: JAZZY does not seem to be under active development. Also, the dictionary-distance based approach seems inadequate because of the use of non-standard language in social network profiles and multi-word tokens. APACHE LUCENE seems to have a statistical spell checker that should be much more suited. Question here would how to create a good dictionary? (We are not using Lucene otherwise, so there is no existing index.) Any suggestions are welcome!

Read the article

Is having a single `IndexWriter` instance in Lucene a good idea?

- by Dragos

I am trying to understand how Lucene should be used. From what I have read, creating an IndexReader is costly, so using a Search Manager shoulg be the right choice. However, a SearchManager should be produced by a NRTManager(which, by the way, should replace the IndexWriter for every add or delete operation performed). But in order to have a NRTManager, I should first have an IndexWriter, and here comes my problem. The documentation says: an IndexWriter is thread-safe the constructor of this class takes a Directory object, so it seems creating an instace should be costly(as in the case of an IndexReader) all changes are buffered and flushed periodically(so they seem to encourage using a single instance) but: the changes, although flushed will only be visible after commit or close after finished making updates(add/delete), the instance should be closed I also found this: http://stackoverflow.com/questions/5374419/forgot-to-close-the-lucene-indexwriter-after-adding-documents-to-the-index where it is said that not closing a writer might ruin everything So what am I really supposed to do? Is having a single IndexWriter instance a good idea (make only commit and never close it)? EDIT: What is more, if I use NRTManager, how can I make acommit`? Is it even possible?

Read the article

Hosted full text search solutions?

- by James Cooper

Does anyone know of companies offering SaaS full text search? I'm looking for something that uses Lucene, solr, or sphinx on the backend, and provides a REST API for submitting documents to index, and running searches. I could build my own EC2 AMI, but I'd have to configure EBS and other stuff, monitor it, etc. Curious if someone has already done all this and would charge per MB/GB indexed. thank you. -- James

Read the article

Asp.net library to extract plain text from docx, pptx, xlsx (for search index)

- by Myster

Is there a pre-existing library to extract plain text form docx, pptx, and xlsx files? I require this to populate a lucene.net index. I've found this example which extracts text from docx and it seems to work ok. But before building my own solution based on this I was wondering if there's something already available for the other file formats?

Read the article

Advice on reading indexes

- by London

Hello, I'm trying to figure out the right way to read lucene index only once whilst running the application multiple times, how can I do that in java? Because indexed data will not change so reading them each time would not be necessary. Can someone explain me the logic of it reading them only once? thank you

Read the article

How can I search in transcluded categories?

- by Wikis

I want to add functionality to a MediaWiki wiki to search in specific categories: Platform 1 Platform 2 etc. So I created a template which, based on a certain field, assigns pages to those categories. The template was already included on most of these pages. So now most pages are in either: Category:Platform 1 or Category:Platform 2 Then I thought I just need to add incategory to the search and I'm done, as described on the Wikipedia page. But then I reread it and to my horror discovered: incategory: – using the incategory: parameter returns pages in a given category (as long as the pages are directly categorized, and not transcluded through templates). Eeeek! Is there any other way to search even in transcluded templates? Or any other way of resolving this?

Read the article

Best way to retrieve certain field of all documents returned by a lucen search

- by Philipp

Hi, I was wondering what the best way is to retrieve a certain field of all documents returned by a Searcher of Lucene. Background: each document has a date field (written on) and I would like to show a timeline of all found documents, so I need to extract the date (day) field of all the documents I find with the search. I currently retrieve every document using Searcher.doc(int, FieldSelector) having the selector only retrieve the certain field. I have indexed 250k documents, the search itself takes no time and returns about 10k document ids. Retrieving those however, takes 20+ seconds. What can I do to speed things up, but still get all the values I need. Thx in advance Philipp

Read the article

Grails searchable plugin

- by Don

Hi, In my Grails app, I'm using the Searchable plugin for searching/indexing. I want to write a Compass/Lucene query that involves multiple domain classes. Within that query when I want to refer to the id of a class, I can't simply use 'id' because all classes have an 'id' property. Currently, I work around this problem by adding the following property to a class Foo public Long getFooId() { return id } static transients = ['fooId'] Then when I want to refer to the id of Foo within a query I use 'fooId'. Is there a way I can provide an alias for a property in the searchable mapping rather than adding a property to the class?

Read the article

MSSQL Search Proper Names Full Text Index vs LIKE + SOUNDEX

- by Matthew Talbert

I have a database of names of people that has (currently) 35 million rows. I need to know what is the best method for quickly searching these names. The current system (not designed by me), simply has the first and last name columns indexed and uses "LIKE" queries with the additional option of using SOUNDEX (though I'm not sure this is actually used much). Performance has always been a problem with this system, and so currently the searches are limited to 200 results (which still takes too long to run). So, I have a few questions: Does full text index work well for proper names? If so, what is the best way to query proper names? (CONTAINS, FREETEXT, etc) Is there some other system (like Lucene.net) that would be better? Just for reference, I'm using Fluent NHibernate for data access, so methods that work will with that will be preferred. I'm using MS SQL 2008 currently.

Read the article

Need help in filtering records based on radius value in solr

- by kshama

Hi, I am using solr with Lucene spatial 2.9.1 as per http://www.ibm.com/developerworks/java/library/j-spatial/ I want to write a query, that will retrieve records within a given radius using hsin function, and using cartesian tiers as filters. So i wrote query like this http://localhost:8983/solr/select/?q=body:engineering colleges^0 AND _val_:"recip(hsin(0.227486,1.354193 , lat_rad, lng_rad, 4), 1, 1, 0)"^100 &&fq={!tier x=13.033993 y=77.589569 radians=false dist=4 prefix=tier_ unit=m} My records include many US records and few Indian records. For US records filtering based on radius is working fine. But for Indian records its not varying even if i change the radius . So can any one tell me if anything is wrong with the query or is there any configuration issues related to solr in order to make this work, or since record density is very less for Indian records filtering is not happening properly.Am not able to figure it out. Thanks in advance.

Read the article

JMS message received at only one server

- by BJH

I'm having a problem with a JEE6 application running in a clustered environment using WebSphere ApplicationServer 8. A search index is used for quick search in the UI (using Lucene), which must be re-indexed after new data arrived in the corresponding DB layer. To achieve this we're sending a JMS message to the application, then the search index will be refreshed. The problem is, that the messages only arrives at one of the cluster members. So only there the search index is up to date. At the other servers it remains outdated. How can I achieve that the search index gets updated at all cluster members? Can I receive the message somehow on all servers? Or is there a better way to do this?

Read the article

Nested BooleanQuery?

- by KailZhang

I'm using a BooleanQuery to combine several queries. I find that if I add a BooleanQuery to the BooleanQuery, then no result is returned. The added BooleanQuery is a MUST_NOT one, like -city_id:100. But as lucene's spec says, BooleanQuery could be nested, which I think means it's okay to add such BooleanQuery. Now I have to get all clauses from the BooleanQuery to be added, and then add them to the container BooleanQuery one by one. I'm a bit confused. Anybody could help? Thank you very much!

Read the article

Boost Solr results based on the field that contained the hit

- by TomFor

Hi, I was browsing the web looking for a indexing and search framework and stumbled upon Solr. A functionality that we abolutely need is to boost results based on what field contained the hit. A small example: Consider a record like this: <movie> <title>The Dark Knight</title> <alternative_title>Batman Begins 2</alternative_title> <year>2008</year> <director>Christopher Nolan</director> <plot>Batman, Gordon and Harvey Dent are forced to deal with the chaos unleashed by an anarchist mastermind known only as the Joker, as it drives each of them to their limits.</plot> </movie> I want to combine for example the title, alternative_title and plot fields into one search field, which isn't too difficult after looking at the Solr/Lucene documentation and tutorials. However I also want that movies that have a hit in title have a higher score than hits on alternative_title and those in their turn should score higher than hits in the plot field. Is there any way to indicate this kond of scoring in the xml or do we need to develop some custom scoring algorythm? Please also note that the example I've givnen is fictional end the real data will probably contain 100+ fields. Thanks in advance, Tom

Read the article

SQL Server Search Proper Names Full Text Index vs LIKE + SOUNDEX

- by Matthew Talbert

I have a database of names of people that has (currently) 35 million rows. I need to know what is the best method for quickly searching these names. The current system (not designed by me), simply has the first and last name columns indexed and uses "LIKE" queries with the additional option of using SOUNDEX (though I'm not sure this is actually used much). Performance has always been a problem with this system, and so currently the searches are limited to 200 results (which still takes too long to run). So, I have a few questions: Does full text index work well for proper names? If so, what is the best way to query proper names? (CONTAINS, FREETEXT, etc) Is there some other system (like Lucene.net) that would be better? Just for reference, I'm using Fluent NHibernate for data access, so methods that work will with that will be preferred. I'm using SQL Server 2008 currently. EDIT I want to add that I'm very interested in solutions that will deal with things like commonly misspelled names, eg 'smythe', 'smith', as well as first names, eg 'tomas', 'thomas'. Query Plan |--Parallelism(Gather Streams) |--Nested Loops(Inner Join, OUTER REFERENCES:([testdb].[dbo].[Test].[Id], [Expr1004]) OPTIMIZED WITH UNORDERED PREFETCH) |--Hash Match(Inner Join, HASH:([testdb].[dbo].[Test].[Id])=([testdb].[dbo].[Test].[Id])) | |--Bitmap(HASH:([testdb].[dbo].[Test].[Id]), DEFINE:([Bitmap1003])) | | |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([testdb].[dbo].[Test].[Id])) | | |--Index Seek(OBJECT:([testdb].[dbo].[Test].[IX_Test_LastName]), SEEK:([testdb].[dbo].[Test].[LastName] >= 'WHITDþ' AND [testdb].[dbo].[Test].[LastName] < 'WHITF'), WHERE:([testdb].[dbo].[Test].[LastName] like 'WHITE%') ORDERED FORWARD) | |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([testdb].[dbo].[Test].[Id])) | |--Index Seek(OBJECT:([testdb].[dbo].[Test].[IX_Test_FirstName]), SEEK:([testdb].[dbo].[Test].[FirstName] >= 'THOMARþ' AND [testdb].[dbo].[Test].[FirstName] < 'THOMAT'), WHERE:([testdb].[dbo].[Test].[FirstName] like 'THOMAS%' AND PROBE([Bitmap1003],[testdb].[dbo].[Test].[Id],N'[IN ROW]')) ORDERED FORWARD) |--Clustered Index Seek(OBJECT:([testdb].[dbo].[Test].[PK__TEST__3214EC073B95D2F1]), SEEK:([testdb].[dbo].[Test].[Id]=[testdb].[dbo].[Test].[Id]) LOOKUP ORDERED FORWARD) SQL for above: SELECT * FROM testdb.dbo.Test WHERE LastName LIKE 'WHITE%' AND FirstName LIKE 'THOMAS%' Based on advice from Mitch, I created an index like this: CREATE INDEX IX_Test_Name_DOB ON Test (LastName ASC, FirstName ASC, BirthDate ASC) INCLUDE (and here I list the other columns) My searches are now incredibly fast for my typical search (last, first, and birth date).

Read the article

How to search a PDF in Acrobat Reader AND jump to a certain page via parameter?

- by agez

Hi, we are using lucene within a web application to search in a great number of PDF documents. The workflow is like this: A user enters a search term A list of search results is presented to the user. Each search result represents one PDF document and shows the user on which page the search term was found. Each of these pages is represented as a hyperlink. If the user now clicks on such a hyperlink, he directly jumps to that page. But now the user has the problem that the search term isn't highlighted on the page. Therefore the user has to look on his own to find the search term on the page. What we wanted is a way to highlight the search term on the specific page in the PDF. The open parameters for Acrobat Reader allow for either searching a PDF document (with hit highlighting) OR jumping to a specific page. But the combination of both parameters - which we would need - doesn't work. Does anyone have an idea how jumping to a page and highlighting a search term in a pdf document could work? I had a look at the Acrobat SDK but don't see how we can use it (it's terribly documented). Cheers, Helmut

Read the article

Full Text Search like Google

- by Eduardo

I would like to implement full-text-search in my off-line (android) application to search the user generated list of notes. I would like it to behave just like Google (since most people are already used to querying to Google) My initial requirements are: Fast: like Google or as fast as possible, having 100000 documents with 200 hundred words each. Searching for two words should only return documents that contain both words (not just one word) (unless the OR operator is used) Case insensitive (aka: normalization): If I have the word 'Hello' and I search for 'hello' it should match. Diacritical mark insensitive: If I have the word 'así' a search for 'asi' should match. In Spanish, many people, incorrectly, either do not put diacritical marks or fail in correctly putting them. Stop word elimination: To not have a huge index meaningless words like 'and', 'the' or 'for' should not be indexed at all. Dictionary substitution (aka: stem words): Similar words should be indexed as one. For example, instances of 'hungrily' and 'hungry' should be replaced with 'hunger'. Phrase search: If I have the text 'Hello world!' a search of '"world hello"' should not match it but a search of '"hello world"' should match. Search all fields (in multifield documents) if no field specified (not just a default field) Auto-completion in search results while typing to give popular searches. (just like Google Suggest) How may I configure a full-text-search engine to behave as much as possible as Google? (I am mostly interested in Open Source, Java and in particular Lucene)

Read the article

How can I get job in company when I unfamiliar with technology [closed]

- by Michael Z

Sorry if I have chosen wrong stackexchange site for this question. Point me in correct place if any... How can I get job in company that have some unfamiliar technology for me in they Job Requirements list? In other words. How can I get job on Lucene if I have not any experience on Lucene, but for getting experience in Lucene I need to be involved in company that needs developers with Lucene technology experience? It is closed disk!

Read the article

Compass Autocomplete to only return index words

- by tariqj

I am currently trying to configure a compass query for autocomplete. I have it working so that the compass query will return an object. I would like to modify it so that it will return matching words in the index, not matching results. Thanks.

Read the article

Solrnet /ASP.NET sample without MVC

- by Mikos

I am trying to get a handle on Solrnet and interacting an ASP.NET site with a Solr server. However, the sample app (on the code repository) is MVC based ,does anyone know of a version in plain vanilla ASP.NET? Thanks

Read the article

SOLR - Boost function (bf) to increase score of documents whose date is closest to NOW

- by Mechanic

Hi all, I have a solr instance containing documents which have a 'startTime' field ranging from last month to a year from now. I'd like to add a boost query/function to boost the scores of documents whose startTime field is close to the current time. So far I have seen a lot of examples which use rord to add boosts to documents whom are newer but I have never seen an example of something like this. Can anyone tell me how to do it please? Thanks

Read the article

Is Lucern.net good choice for website search of 1M item product database? (giving up on SQL Server

- by Pete Alvin

We currently have in production SQL Server 2005 and we use it's full text search for a eCommerce site search of a million product database. I've optimized it as much as possible (I think) and we're still seeing search times of five seconds. (We don't need site scrawl or PDF (etc.) document indexing features... JUST "Google" speed for site search.) I was going to buy dtSearch but now I realize I can just use Lucerne.net and save the $2,500 for two server license. I read on a post that Lucerne.Net is not good for website searches. Has anyone else used Lucerne.Net from ASP.Net? Does it take a lot of memory? Any problems? Any comments?

Read the article

Nutch - how to crawl by small patches?

- by Yurish

Hi everyone! I am stuck! Can`t get Nutch to crawl for me by small patches. I start it by bin/nutch crawl command with parameters -depth 7 and -topN 10000. And it never ends. Ends only when my HDD is empty. What i need to do: Start to crawl my seeds with possibility to go further on outlinks. Crawl 20000 pages, then index them. Crawl another 20000 pages, index them and merge with first index. Loop step 3 n times. Tried also with scripts found in wiki, but all scripts i found don't go further. If i run them again, they do everything from beginning. And in the end of script i have the same index i had, when started to crawl. But, i need to continue my crawl. Some help would be very usefull!

Read the article

Search for short words with SOLR

- by Carsten Gehling

I am using SOLR along with NGramTokenizerFactory to help create search tokens for substrings of words NGramTokenizer is configured with a minimum word length of 3 This means that I can search for e.g. "unb" and then match the word "unbelievable". However I have a problem with short words like "I" and "in". These are not indexed by SOLR (I suspect it is because of NGramTokenizer) and therefore I cannot search for them. I don't want to reduce the minimum word length to 1 or 2, since this creates a huge search index. But I would like SOLR to include whole words whose length is already below this minimum. How can I do that? /Carsten

Read the article

Is Nutch's language identification available in c#.net

- by Pranali Desai

Is Nutch's language identification available in c#.net and if yes where can I find it.

Read the article

Proper snowball analyzer configuration when using Grails Searchable Plugin

- by Wirsbro

To improve stemming we want to switch from the default analyzer to snowball, however, having a lot of difficulty with the proper settings and would appreciate any help. In Environment: - Sun's Java 1.6.16 - Grails 1.2.2 - Searchable Plug-In 0.5.5 Config.groovy: Have tried both settings: compassSettings = ['compass.engine.analyzer.stemmed.type': 'snowball', 'compass.engine.analyzer.stemmed.name': 'English'] compassSettings = ['compass.engine.analyzer.snowball.type': 'snowball', 'compass.engine.analyzer.snowball.name': 'English', 'compass.engine.analyzer.search.type': 'snowball', 'compass.engine.analyzer.search.name': 'English'] Search.groovy - The Invocation: def searchResult = searchableService.search(params.q, withHighlighter: { highlighter, index, sr if (!sr.highlights) { sr.highlights = [] } try { sr.highlights[index] = highlighter.fragments("content")[0..2].join(" ") } catch (IndexOutOfBoundsException ex) { sr.highlights[index] = highlighter.fragment("content") } }) def suggestion = searchableService.suggestQuery(params.q) if (suggestion != params.q) { searchResult.suggestedQuery = suggestion }

Search Results

Search found 393 results on 16 pages for 'lucene'.

Page 10/16 | < Previous Page | 6 7 8 9 10 11 12 13 14 15 16 | Next Page >

- by dareios

- by Dragos

- by James Cooper

- by Myster

- by London

- by Wikis

- by Philipp

- by Don

- by Matthew Talbert

- by kshama

- by BJH

- by KailZhang

- by TomFor

- by Matthew Talbert

- by agez

- by Eduardo

- by Michael Z

- by tariqj

- by Mikos

- by Mechanic

- by Pete Alvin

- by Yurish

- by Carsten Gehling

- by Pranali Desai

- by Wirsbro

< Previous Page | 6 7 8 9 10 11 12 13 14 15 16 | Next Page >