Search Results

Search found 7 results on 1 pages for 'tfidf'.

Page 1/1 | 1 

  • tfidf, am I understanding it right?

    - by alskndalsnd
    Hey everyone, I am interested in doing some document clustering, and right now I am considering using TF-IDF for this. If I am not wrong, TFIDF is particularly used for evaluating the relevance of a document given a query. If I do not have a particular query, how can I apply tfidf to clustering?

    Read the article

  • Package to compare LSA, TFIDF, Cosine metrics and Language Models

    - by gouwsmeister
    Hi, I'm looking for a package (any language, really) that I can use on a corpus of 50 documents to perform interdocument similarity testing in various metrics, like tfidf, okapi, language models, lsa, etc. I want as a result a document similarity matrix, i.e. doc1 is x% similar to doc2, etc... This is for research purposes, not for production. I specifically want the doc similarity matrix as I want to correlate this with human ratings. Thank you in advance!

    Read the article

  • Lucene numDocs and doqFreq on custom similarity class

    - by David A
    Hi All, im doing an aplication with Lucene (im a noob with it) and im facing some problems. My aplication uses the Lucene 2.4.0 library with a custom similaraty implementation (the jar is imported) In my app im calculating doqFreq and numDocs manually (im adding the values of all indexes and then i calculate a global value in order to use it on every query) and i want to use that values on a custom similarity implementation in order to calculate a new IDF. The problem is that I dont know how to use (or send) the new doqFreq and numDocs values from my app on that new similarty implementation as I dont want to change lucene´s code apart from this extra class. Any suggestions or examples? I read the docs but i dont now how to aproach this :s Thanks

    Read the article

  • Python Memory leak - Solved, but still puzzled

    - by disappearedng
    Dear everyone, I have successfully debugged my own memory leak problems. However, I have noticed some very strange occurence. for fid, fv in freqDic.iteritems(): outf.write(fid+"\t") #ID for i, term in enumerate(domain): #Vector tfidf = self.tf(term, fv) * self.idf( term, docFreqDic) if i == len(domain) - 1: outf.write("%f\n" % tfidf) else: outf.write("%f\t" % tfidf) outf.flush() print "Memory increased by", int(self.memory_mon.usage()) - startMemory outf.close() def tf(self, term, freqVector): total = freqVector[TOTAL] if total == 0: return 0 if term not in freqVector: ## When you don't have these lines memory leaks occurs return 0 ## return float(freqVector[term]) / freqVector[TOTAL] def idf(self, term, docFrequencyPerTerm): if term not in docFrequencyPerTerm: return 0 return math.log( float(docFrequencyPerTerm[TOTAL])/docFrequencyPerTerm[term]) Basically let me describe my problem: 1) I am doing tfidf calculations 2) I traced that the source of memory leaks is coming from defaultdict. 3) I am using the memory_mon from http://stackoverflow.com/questions/276052/how-to-get-current-cpu-and-ram-usage-in-python 4) The reason for my memory leaks is as follows: a) in self.tf, if the lines: if term not in freqVector: return 0 are not added that will cause the memory leak. (I verified this myself using memory_mon and noticed a sharp increase in memory that kept on increasing) The solution to my problem was 1) since fv is a defaultdict, any reference to it that are not found in fv will create an entry. Over a very large domain, this will cause memory leaks. I decided to use dict instead of default dict and the memory problem did go away. My only puzzle is: since fv is created in "for fid, fv in freqDic.iteritems():" shouldn't fv be destroyed at the end of every for loop? I tried putting gc.collect() at the end of the for loop but gc was not able to collect everything (returns 0). Yes, the hypothesis is right, but the memory should stay fairly consistent with ever for loop if for loops do destroy all temp variables. This is what it looks like with that two line in self.tf: Memory increased by 12 Memory increased by 948 Memory increased by 28 Memory increased by 36 Memory increased by 36 Memory increased by 32 Memory increased by 28 Memory increased by 32 Memory increased by 32 Memory increased by 32 Memory increased by 40 Memory increased by 32 Memory increased by 32 Memory increased by 28 and without the the two line: Memory increased by 1652 Memory increased by 3576 Memory increased by 4220 Memory increased by 5760 Memory increased by 7296 Memory increased by 8840 Memory increased by 10456 Memory increased by 12824 Memory increased by 13460 Memory increased by 15000 Memory increased by 17448 Memory increased by 18084 Memory increased by 19628 Memory increased by 22080 Memory increased by 22708 Memory increased by 24248 Memory increased by 26704 Memory increased by 27332 Memory increased by 28864 Memory increased by 30404 Memory increased by 32856 Memory increased by 33552 Memory increased by 35024 Memory increased by 36564 Memory increased by 39016 Memory increased by 39924 Memory increased by 42104 Memory increased by 42724 Memory increased by 44268 Memory increased by 46720 Memory increased by 47352 Memory increased by 48952 Memory increased by 50428 Memory increased by 51964 Memory increased by 53508 Memory increased by 55960 Memory increased by 56584 Memory increased by 58404 Memory increased by 59668 Memory increased by 61208 Memory increased by 62744 Memory increased by 64400 I look forward to your answer

    Read the article

  • create TableModel and populate jTable dynamically

    - by Julia
    Hi all! I want to store the results of reading lucene index into jTable, so that I can make it sortable by different columns. From index I am reading terms with different measures of their frequencies. Table columns are these : [string term][int absFrequency][int docFrequency][double invFrequency] So i in AbstractTableModel I can define column names, but i dont know how to get the Object[][]data with results from the following method: public static void FrequencyMap(Directory indexDir) throws Exception { List<Object>redoviLista = new ArrayList<Object>(); //final Map<String,TermRow> map = new TreeMap<String,TermRow>(); List<String>termList = new ArrayList<String>(); IndexReader iReader = IndexReader.open(indexDir); FilterIndexReader fReader = new FilterIndexReader(iReader); int numOfDocs = fReader.numDocs(); TermEnum terms = fReader.terms(); while (terms.next()){ Term term = terms.term(); String termText = term.text(); termList.add(termText); //Calculating the frequencies int df = iReader.docFreq(term); double idf = 0.0F; idf = Math.log10((double) numOfDocs / df); double tfidf = (df*idf); //Here comes important part Object oneTableRow[] = {termText, df, idf, tfidf}; redoviLista.add(jedanRed); // So i thaught to store them into list, and then later to copy into table, but didnt manage } iReader.close(); // So I need something like this, and i Neeed this array to be stored out of this method Object[][]data = new Object[redoviLista.size()][]; for (int i = 0; i < data.length; i++) { data[i][0] = redoviLista.get(i); } So I am kindda stuck here to proceed to implement AbstractTableModel and populate and display this table .... :/ Please help!

    Read the article

  • Scipy Negative Distance? What?

    - by disappearedng
    I have a input file which are all floating point numbers to 4 decimal place. i.e. 13359 0.0000 0.0000 0.0001 0.0001 0.0002` 0.0003 0.0007 ... (the first is the id). My class uses the loadVectorsFromFile method which multiplies it by 10000 and then int() these numbers. On top of that, I also loop through each vector to ensure that there are no negative values inside. However, when I perform _hclustering, I am continually seeing the error, "Linkage Z contains negative values". I seriously think this is a bug because: I checked my values, the values are no where small enough or big enough to approach the limits of the floating point numbers and the formula that I used to derive the values in the file uses absolute value (my input is DEFINITELY right). Can someone enligten me as to why I am seeing this weird error? What is going on that is causing this negative distance error? ===== def loadVectorsFromFile(self, limit, loc, assertAllPositive=True, inflate=True): """Inflate to prevent "negative" distance, we use 4 decimal points, so *10000 """ vectors = {} self.winfo("Each vector is set to have %d limit in length" % limit) with open( loc ) as inf: for line in filter(None, inf.read().split('\n')): l = line.split('\t') if limit: scores = map(float, l[1:limit+1]) else: scores = map(float, l[1:]) if inflate: vectors[ l[0]] = map( lambda x: int(x*10000), scores) #int might save space else: vectors[ l[0]] = scores if assertAllPositive: #Assert that it has no negative value for dirID, l in vectors.iteritems(): if reduce(operator.or_, map( lambda x: x < 0, l)): self.werror( "Vector %s has negative values!" % dirID) return vectors def main( self, inputDir, outputDir, limit=0, inFname="data.vectors.all", mappingFname='all.id.features.group.intermediate'): """ Loads vector from a file and start clustering INPUT vectors is { featureID: tfidfVector (list), } """ IDFeatureDic = loadIdFeatureGroupDicFromIntermediate( pjoin(self.configDir, mappingFname)) if not os.path.exists(outputDir): os.makedirs(outputDir) vectors = self.loadVectorsFromFile( limit, pjoin( inputDir, inFname)) for threshold in map( lambda x:float(x)/30, range(20,30)): clusters = self._hclustering(threshold, vectors) if clusters: outputLoc = pjoin(outputDir, "threshold.%s.result" % str(threshold)) with open(outputLoc, 'w') as outf: for clusterNo, cluster in clusters.iteritems(): outf.write('%s\n' % str(clusterNo)) for featureID in cluster: feature, group = IDFeatureDic[featureID] outline = "%s\t%s\n" % (feature, group) outf.write(outline.encode('utf-8')) outf.write("\n") else: continue def _hclustering(self, threshold, vectors): """function which you should call to vary the threshold vectors: { featureID: [ tfidf scores, tfidf score, .. ] """ clusters = defaultdict(list) if len(vectors) > 1: try: results = hierarchy.fclusterdata( vectors.values(), threshold, metric='cosine') except ValueError, e: self.werror("_hclustering: %s" % str(e)) return False for i, featureID in enumerate( vectors.keys()):

    Read the article

1