Search Results

Search found 393 results on 16 pages for 'lucene'.

Page 5/16 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • Get highest frequency terms from Lucene index

    - by Julia
    Hello! i need to extract terms with highest frequencies from several lucene indexes, to use them for some semantic analysis. So, I want to get maybe top 30 most occuring terms(still did not decide on threshold, i will analyze results) and their per-index counts. I am aware that I might lose some precision because of potentionally dropped duplicates, but for now, lets say i am ok with that. So for the proposed solutions, (needless to say maybe) speed is not important, since I would do static analysis, I would put accent on simplicity of implementation because im not so skilled with Lucene (not the programming guru too :/ ) and cant wrap my mind around many concepts of it.. I can not find any code samples from something similar, so all concrete advices (code, pseudocode, links to code samples...) I will apretiate very much!!! Thank you!

    Read the article

  • Exception when indexing text documents with Lucene, using SnowballAnalyzer for cleaning up

    - by Julia
    Hello!!! I am indexing the documents with Lucene and am trying to apply the SnowballAnalyzer for punctuation and stopword removal from text .. I keep getting the following error :( IllegalAccessError: tried to access method org.apache.lucene.analysis.Tokenizer.(Ljava/io/Reader;)V from class org.apache.lucene.analysis.snowball.SnowballAnalyzer Here is the code, I would very much appreciate help!!!! I am new with this.. public class Indexer { private Indexer(){}; private String[] stopWords = {....}; private String indexName; private IndexWriter iWriter; private static String FILES_TO_INDEX = "/Users/ssi/forindexing"; public static void main(String[] args) throws Exception { Indexer m = new Indexer(); m.index("./newindex"); } public void index(String indexName) throws Exception { this.indexName = indexName; final File docDir = new File(FILES_TO_INDEX); if(!docDir.exists() || !docDir.canRead()){ System.err.println("Something wrong... " + docDir.getPath()); System.exit(1); } Date start = new Date(); PerFieldAnalyzerWrapper analyzers = new PerFieldAnalyzerWrapper(new SimpleAnalyzer()); analyzers.addAnalyzer("text", new SnowballAnalyzer("English", stopWords)); Directory directory = FSDirectory.open(new File(this.indexName)); IndexWriter.MaxFieldLength maxLength = IndexWriter.MaxFieldLength.UNLIMITED; iWriter = new IndexWriter(directory, analyzers, true, maxLength); System.out.println("Indexing to dir..........." + indexName); if(docDir.isDirectory()){ File[] files = docDir.listFiles(); if(files != null){ for (int i = 0; i < files.length; i++) { try { indexDocument(files[i]); }catch (FileNotFoundException fnfe){ fnfe.printStackTrace(); } } } } System.out.println("Optimizing...... "); iWriter.optimize(); iWriter.close(); Date end = new Date(); System.out.println("Time to index was" + (end.getTime()-start.getTime()) + "miliseconds"); } private void indexDocument(File someDoc) throws IOException { Document doc = new Document(); Field name = new Field("name", someDoc.getName(), Field.Store.YES, Field.Index.ANALYZED); Field text = new Field("text", new FileReader(someDoc), Field.TermVector.WITH_POSITIONS_OFFSETS); doc.add(name); doc.add(text); iWriter.addDocument(doc); } }

    Read the article

  • Search Lucene with precise edit distances

    - by askullhead
    I would like to search a Lucene index with edit distances. For example, say, there is a document with a field FIRST_NAME; I want all documents with first names that are 1 edit distance away from, say, 'john'. I know that Lucene supports fuzzy searches (FIRST_NAME:john~) and takes a number between 0 and 1 to control the fuzziness. The problem (for me) is this number does not directly translate to an edit distance. And when the values in the documents are short strings (less than 3 characters) the fuzzy search has difficulty finding them. For example if there is a document with FIRST_NAME 'J' and I search for FIRST_NAME:I~0.0 I don't get anything back.

    Read the article

  • lucene get matched terms in query

    - by iamrohitbanga
    what is the best way to find out which terms in a query matched against a given document returned as a hit in lucene? I have tried a weird method involving hit highlighting package in lucene contrib and also a method that searches for every word in the query against the top most document ("docId: xy AND description: each_word_in_query"). Do not get satisfactory results? hit highlighting does not report some of the words that matched for a document other than the first one. i am not sure if the second approach is the best alternative.

    Read the article

  • Can I store and join based on external attributes in Lucene/Solr

    - by Kibbee
    Is there a way to store information about documents that are stored in Lucene such that I don't have to update the entire document to update certain attributes about the documents? For instance, let's say I had a bunch of documents, and that I wanted to update a permissions list of who was allowed to see the documents on a daily, or more frequent, basis. Would it be possible to update all the permissions each day, without updating all the documents. I could do it by keeping a exactly which permissions were added and removed, but I would rather just be able to take the end list of permissions, and use that, rather than have to keep track of all the permission changes and post those entire documents to Lucene.

    Read the article

  • Using FieldSelector when searching with Lucene

    - by Christian
    I'm searching articles in PubMed via Lucene. Each of the 20,000,000 articles has an abstract with ~250 words and an ID. At the moment I store my searches, with each take multiple seconds, in a TopDocs object. Searchs can find thousands of articles. I'm just interested in the ID of the article. Does Lucene load the abstracts internally into the TopDocs? If so can I prevent that behavior through FieldSelectors or do FieldSelectors only work with IndexReader and don't work with IndexSearcher?

    Read the article

  • Lucene boost: I need to make it work better

    - by zvikico
    I'm using Lucene to index components with names and types. Some components are more important, thus, get a bigger boost. However, I cannot get my boost to work properly. I sill get some components appear later (get worse score), even though they have a higher boost. Note that the indexing is done on one field only and I've set the boost to that field alone. I'm using Lucene in Java. I don't think it has anything to do with the field length. I've seen components with the same name (but different type) get the wrong score.

    Read the article

  • Lucene complex structure search

    - by archer
    Basically I do have pretty simple database that I'd like to index with Lucene. Domains are: // Person domain class Person { Set<Pair> keys; } // Pair domain class Pair { KeyItem keyItem; String value; } // KeyItem domain, name is unique field within the DB (!!) class KeyItem{ String name; } I've tens of millions of profiles and hundreds of millions of Pairs, however, since most of KeyItem's "name" fields duplicates, there are only few dozens KeyItem instances. Came up to that structure to save on KeyItem instances. Basically any Profile with any fields could be saved into that structure. Lets say we've profile with properties - name: Andrew Morton - eduction: University of New South Wales, - country: Australia, - occupation: Linux programmer. To store it, we'll have single Profile instance, 4 KeyItem instances: name, education,country and occupation, and 4 Pair instances with values: "Andrew Morton", "University of New South Wales", "Australia" and "Linux Programmer". All other profile will reference (all or some) same instances of KeyItem: name, education, country and occupation. My question is, how to index all of that so I can search for Profile for some particular values of KeyItem::name and Pair::value. Ideally I'd like that kind of query to work: name:Andrew* AND occupation:Linux* Should I create custom Indexer and Searcher? Or I could use standard ones and just map KeyItem and Pair as Lucene components somehow?

    Read the article

  • Indexing and Searching Over Word Level Annotation Layers in Lucene

    - by dmcer
    I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to the store, the annotations might look like: Word POS Chunk NER ==== === ===== ======== The DT NP Person man NN NP Person went VBD VP - to TO PP - the DT NP Location store NN NP Location I'd like to index a bunch of documents with annotations like these using Lucene and then perform searches across the different layers. An example of a simple query would be to retrieve all documents where Washington is tagged as a person. While I'm not absolutely committed to the notation, syntactically end-users might enter the query as follows: Query: Word=Washington,NER=Person I'd also like to do more complex queries involving the sequential order of annotations across different layers, e.g. find all the documents where there's a word tagged person followed by the words arrived at followed by a word tagged location. Such a query might look like: Query: "NER=Person Word=arrived Word=at NER=Location" What's a good way to go about approaching this with Lucene? Is there anyway to index and search over document fields that contain structured tokens?

    Read the article

  • Lucene raise document score if sibling entity matches query

    - by Pitagoras
    I have the following design situation. I use hibernate search (lucene in the back). Tha application manages ITEMs which have title, description and tags. These are full text indexed. On the other hand, we have COLLECTION of ITEMs. The user can create a COLLECTION and add as many ITEMs as she wants. ITEMs can also belong to many COLLECTIONs. I have a boosted query so that search terms that appear in the tags are more important than in the title, and lastly in the description. But I need an additional matching criteria: for a given ITEM, it whould rank better if other documents in some COLLECTION where the ITEM belongs, also match the query. This is like to say: the title/tags/description of "fellow" items (i.e. items in some shared collection) make the item rank better. I was thinking that adding an ITEM to a COLLECTION would add something like "extra tags" to every other ITEM in the collection, being these extra tags the elements to match in the added ITEM. I feel a more clever solution lucene-wise should exists. Any ideas/pointers are welcome. Thanks.

    Read the article

  • Lucene multiple indexes : Normalize document scores??

    - by Roey
    Hi All. Suppose I've got multiple lucene indexes (not replicas) on several PC's. I query each index and then merge the results. Is there any way to normalize the document scores so that I could sort by score (relevance)? I mean, the scores for document A from index A would not be comparable with document B from index B, unless I do some sort of normalization.... not so? Thanks Roey

    Read the article

  • how to perform proper indexing and searching in Lucene.Net

    - by Ashish
    Dear All, I have a list of all words in the document. I want to index it and latter I want to retrieve a particular word and some near by words (10 words before the result and 10 words after the result). What is the proper way of indexing and searching in Lucene.net? Please reply me as soon as possible. Thanking you, Ashish

    Read the article

  • How to change default conjunction with Lucene MultiFieldQueryParser

    - by Luke H
    I have some code using Lucene that leaves the default conjunction operator as OR, and I want to change it to AND. Some of the code just uses a plain QueryParser, and that's fine - I can just call setDefaultOperator on those instances. Unfortunately, in one place the code uses a MultiFieldQueryParser, and calls the static "parse" method (taking String, String[], BooleanClause.Occur[], Analyzer), so it seems that setDefaultOperator can't help, because it's an instance method. Is there a way to keep using the same parser but have the default conjunction changed?

    Read the article

  • Zend Lucene - cannot search numbers

    - by Pavel Dubinin
    Using Zend Lucene I cannot search numbers in description fields Added it like this: $doc->addField(Zend_Search_Lucene_Field::Text('description', $current_item['item_short_description'], 'utf-8')); Googling for this showed that applying following code should solve the problem, but it did not..: Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive()); any thougts?

    Read the article

  • Lucene.Net Keyword based case insensitive query ?

    - by Yoann. B
    Hi, I need to make a Lucene exact case insensitive keyword match query. I tried using KeywordAnalyzer but it's case sensitive ... Sample : Keyword : "Windows Server 2003" = Got Results Keyword : "windows server 2003" = No results ... Another sample (multi keywords) : Keywords : "ASP.NET, SQL Server" = Got results Keywords : "asp.net, sql server" = No results

    Read the article

  • Distributed Lucene.NET

    - by user72185
    Hi, I have a Terabyte of data, maybe more, which I'd like to index and search with Lucene. I'd like to be able to split the index out to different machines, similar to what Solr does (if I understand Solr correctly). Are there any existing tools to do this on the Windows platform? Thanks!

    Read the article

  • lucene query issue

    - by Sunil
    I am using Lucene with Alfresco. Here is my query: ( TYPE:"{com.company.customised.content.model}test" && (@\{com.company.customised.content.model\}testNo:111 && (@\{com.company.customised.content.model\}skill:or)) I have to search documents which are having property skill of value "or". The above query is not giving any results (I am getting failed to parse query). If I use the query up until testNo (ignoring skill), I am getting proper results: ( TYPE:"{com.company.customised.content.model}test" && (@\{com.company.customised.content.model\}testNo:111)) Can you please help me? Thanks

    Read the article

  • Cheat sheets for Lucene/Solr?

    - by noname
    Is there any cheat sheet out there for Lucene/Solr query parameters, schema.xml elements (all the analyzers, tokenizers, etc.)? Or somewhere else I can find ALL query parameters? I cant find any with Google.

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >