Search Results

Search found 39 results on 2 pages for 'corpus'.

Page 1/2 | 1 2  | Next Page >

  • Russian-to-English Parallel Word Corpus?

    - by Cygorger
    Hi: I am looking for a simple Russian to English word corpus. It can be as simple as a csv that lists a russian word in the first column and the equivalent English word in the second. Any ideas where I can find such a thing? Does the NLTK toolkit have something like this? Thanks

    Read the article

  • understanding semcor corpus structure h

    - by Sharmila
    I'm learning NLP. I currently playing with Word Sense Disambiguation. I'm planning to use the semcor corpus as training data but I have trouble understanding the xml structure. I tried googling but did not get any resource describing the content structure of semcor. <s snum="1"> <wf cmd="ignore" pos="DT">The</wf> <wf cmd="done" lemma="group" lexsn="1:03:00::" pn="group" pos="NNP" rdf="group" wnsn="1">Fulton_County_Grand_Jury</wf> <wf cmd="done" lemma="say" lexsn="2:32:00::" pos="VB" wnsn="1">said</wf> <wf cmd="done" lemma="friday" lexsn="1:28:00::" pos="NN" wnsn="1">Friday</wf> <wf cmd="ignore" pos="DT">an</wf> <wf cmd="done" lemma="investigation" lexsn="1:09:00::" pos="NN" wnsn="1">investigation</wf> <wf cmd="ignore" pos="IN">of</wf> <wf cmd="done" lemma="atlanta" lexsn="1:15:00::" pos="NN" wnsn="1">Atlanta</wf> <wf cmd="ignore" pos="POS">'s</wf> <wf cmd="done" lemma="recent" lexsn="5:00:00:past:00" pos="JJ" wnsn="2">recent</wf> <wf cmd="done" lemma="primary_election" lexsn="1:04:00::" pos="NN" wnsn="1">primary_election</wf> <wf cmd="done" lemma="produce" lexsn="2:39:01::" pos="VB" wnsn="4">produced</wf> <punc>``</punc> <wf cmd="ignore" pos="DT">no</wf> <wf cmd="done" lemma="evidence" lexsn="1:09:00::" pos="NN" wnsn="1">evidence</wf> <punc>''</punc> <wf cmd="ignore" pos="IN">that</wf> <wf cmd="ignore" pos="DT">any</wf> <wf cmd="done" lemma="irregularity" lexsn="1:04:00::" pos="NN" wnsn="1">irregularities</wf> <wf cmd="done" lemma="take_place" lexsn="2:30:00::" pos="VB" wnsn="1">took_place</wf> <punc>.</punc> </s> I'm assuming wnsn is 'word sense'. Is it correct? What does the attribute lexsn mean? How does it map to wordnet? What does the attribute pn refer to? (third line) How is the rdf attribute assigned? (again third line) In general, what are the possible attributes?

    Read the article

  • Is there any way to add a new location to the list of places where nltk looks for the wordnet corpus?

    - by Programming Noob
    I can't use the nltk wordnet lemmatizer because I can't download the wordnet corpus on my university computer due to access rights issues. I get the following error when I try to do so: ********************************************************************** Resource 'corpora/wordnet' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - '/home/XX/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' ********************************************************************** When I had the same issue at home, I could resolve it by two ways: Using nltk.download(), the standard way and Creating a new folder at location /home/XX/nltk_data and just pasting the corpus directory inside it. Now at the university I only have access to /home/XX/bin and not /home/XX directly. So is there anyway I could paste the wordnet corpus into /home/XX/bin and then somehow make nltk look for the corpus in that folder?

    Read the article

  • Downloadable HTML Test Corpus

    - by Alex Jordan
    I am working on a browser plug-in for Firefox, and I would like to be able to do some automated testing to make sure that it's handling a variety of different HTML/JavaScript features correctly. Does anyone know of a good downloadable corpus of HTML and/or JavaScript pages that could be used for this type of testing?

    Read the article

  • Algorithm detect repeating/similiar strings in a corpus of data -- say email subjects, in Python

    - by RizwanK
    I'm downloading a long list of my email subject lines , with the intent of finding email lists that I was a member of years ago, and would want to purge them from my Gmail account (which is getting pretty slow.) I'm specifically thinking of newsletters that often come from the same address, and repeat the product/service/group's name in the subject. I'm aware that I could search/sort by the common occurrence of items from a particular email address (and I intend to), but I'd like to correlate that data with repeating subject lines.... Now, many subject lines would fail a string match, but "Google Friends : Our latest news" "Google Friends : What we're doing today" are more similar to each other than a random subject line, as is: "Virgin Airlines has a great sale today" "Take a flight with Virgin Airlines" So -- how can I start to automagically extract trends/examples of strings that may be more similar. Approaches I've considered and discarded ('because there must be some better way'): Extracting all the possible substrings and ordering them by how often they show up, and manually selecting relevant ones Stripping off the first word or two and then count the occurrence of each sub string Comparing Levenshtein distance between entries Some sort of string similarity index ... Most of these were rejected for massive inefficiency or likelyhood of a vast amount of manual intervention required. I guess I need some sort of fuzzy string matching..? In the end, I can think of kludgy ways of doing this, but I'm looking for something more generic so I've added to my set of tools rather than special casing for this data set. After this, I'd be matching the occurring of particular subject strings with 'From' addresses - I'm not sure if there's a good way of building a data structure that represents how likely/not two messages are part of the 'same email list' or by filtering all my email subjects/from addresses into pools of likely 'related' emails and not -- but that's a problem to solve after this one. Any guidance would be appreciated.

    Read the article

  • Word frequency tally script is too slow

    - by Dave Jarvis
    Background Created a script to count the frequency of words in a plain text file. The script performs the following steps: Count the frequency of words from a corpus. Retain each word in the corpus found in a dictionary. Create a comma-separated file of the frequencies. The script is at: http://pastebin.com/VAZdeKXs Problem The following lines continually cycle through the dictionary to match words: for i in $(awk '{if( $2 ) print $2}' frequency.txt); do grep -m 1 ^$i\$ dictionary.txt >> corpus-lexicon.txt; done It works, but it is slow because it is scanning the words it found to remove any that are not in the dictionary. The code performs this task by scanning the dictionary for every single word. (The -m 1 parameter stops the scan when the match is found.) Question How would you optimize the script so that the dictionary is not scanned from start to finish for every single word? The majority of the words will not be in the dictionary. Thank you!

    Read the article

  • converting a treebank of vertical trees to s-expressions

    - by Andreas
    I need to preprocess a treebank corpus of sentences with parse trees. The input format is a vertical representation of trees, like so: S =NP ==(DT +def) the == (N +ani) man =VP ==V walks ...and I need it like: (S (NP (DT the) (N man)) (VP (V walks))) I have code that almost does it, but not quite. There's always a missing paren somewhere. Should I use a proper parser, maybe a CFG? The current code is at http://github.com/andreasvc/eodop/blob/master/arbobanko.py The code also contains real examples from the treebank.

    Read the article

  • Data on the Frequency of Edit Operations Required to Correct a Misspelt Word

    - by gvkv
    Does anybody know of any data that relates to the frequency of the types of mistakes the people make when they misspell a word? I'm not referring to words themselves, but tje errors that are made by the typist. For example, I personally make transposition errors the most followed by deletion errors (that is, not including a letter I should), substitution errors and lastly, insertion errors. However, it would not surprise me to find out that typing a wrong letter (a substitution error, e.g., xat instead of cat) is more frequent than not including a letter. My purpose is to be able to make best guesses at correcting a word when I only have the original user's input. The idea being that if one type of error is more frequent than others, then it's more likely that correcting a word via that type of operation is correct. I don't object to using a database of commonly misspelt words but I prefer an algorithmic solution to depending on a corpus--especially if it might be faster.

    Read the article

  • Hibernate : Foreign key constraint violation problem

    - by Vinze
    I have a com.mysql.jdbc.exceptions.MySQLIntegrityConstraintViolationException in my code (using Hibernate and Spring) and I can't figure why. My entities are Corpus and Semspace and there's a many-to-one relation from Semspace to Corpus as defined in my hibernate mapping configuration : <class name="xxx.entities.Semspace" table="Semspace" lazy="false" batch-size="30"> <id name="id" column="idSemspace" type="java.lang.Integer" unsaved-value="null"> <generator class="identity"/> </id> <property name="name" column="name" type="java.lang.String" not-null="true" unique="true" /> <many-to-one name="corpus" class="xxx.entities.Corpus" column="idCorpus" insert="false" update="false" /> [...] </class> <class name="xxx.entities.Corpus" table="Corpus" lazy="false" batch-size="30"> <id name="id" column="idCorpus" type="java.lang.Integer" unsaved-value="null"> <generator class="identity"/> </id> <property name="name" column="name" type="java.lang.String" not-null="true" unique="true" /> </class> And the Java code generating the exception is : Corpus corpus = Spring.getCorpusDAO().getCorpusById(corpusId); Semspace semspace = new Semspace(); semspace.setCorpus(corpus); semspace.setName(name); Spring.getSemspaceDAO().save(semspace); I checked and the corpus variable is not null (so it is in database as retrieved with the DAO) The full exception is : com.mysql.jdbc.exceptions.MySQLIntegrityConstraintViolationException: Cannot add or update a child row: a foreign key constraint fails (`xxx/Semspace`, CONSTRAINT `FK4D6019AB6556109` FOREIGN KEY (`idCorpus`) REFERENCES `Corpus` (`idCorpus`)) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:931) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2941) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1623) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1715) at com.mysql.jdbc.Connection.execSQL(Connection.java:3249) at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1268) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1541) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1455) at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1440) at org.apache.commons.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:102) at org.hibernate.id.IdentityGenerator$GetGeneratedKeysDelegate.executeAndExtract(IdentityGenerator.java:73) at org.hibernate.id.insert.AbstractReturningDelegate.performInsert(AbstractReturningDelegate.java:33) at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2158) at org.hibernate.persister.entity.AbstractEntityPersister.insert(AbstractEntityPersister.java:2638) at org.hibernate.action.EntityIdentityInsertAction.execute(EntityIdentityInsertAction.java:48) at org.hibernate.engine.ActionQueue.execute(ActionQueue.java:250) at org.hibernate.event.def.AbstractSaveEventListener.performSaveOrReplicate(AbstractSaveEventListener.java:298) at org.hibernate.event.def.AbstractSaveEventListener.performSave(AbstractSaveEventListener.java:181) at org.hibernate.event.def.AbstractSaveEventListener.saveWithGeneratedId(AbstractSaveEventListener.java:107) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.saveWithGeneratedOrRequestedId(DefaultSaveOrUpdateEventListener.java:187) at org.hibernate.event.def.DefaultSaveEventListener.saveWithGeneratedOrRequestedId(DefaultSaveEventListener.java:33) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.entityIsTransient(DefaultSaveOrUpdateEventListener.java:172) at org.hibernate.event.def.DefaultSaveEventListener.performSaveOrUpdate(DefaultSaveEventListener.java:27) at org.hibernate.event.def.DefaultSaveOrUpdateEventListener.onSaveOrUpdate(DefaultSaveOrUpdateEventListener.java:70) at org.hibernate.impl.SessionImpl.fireSave(SessionImpl.java:535) at org.hibernate.impl.SessionImpl.save(SessionImpl.java:523) at org.hibernate.impl.SessionImpl.save(SessionImpl.java:519) at org.springframework.orm.hibernate3.HibernateTemplate$12.doInHibernate(HibernateTemplate.java:642) at org.springframework.orm.hibernate3.HibernateTemplate.execute(HibernateTemplate.java:373) at org.springframework.orm.hibernate3.HibernateTemplate.save(HibernateTemplate.java:639) at xxx.dao.impl.AbstractDAO.save(AbstractDAO.java:26) at org.apache.jsp.functions.semspaceManagement_jsp._jspService(semspaceManagement_jsp.java:218) [...] The foreign key constraint has been created (and added to database) by hibernate and I don't see where the constraint can be violated. The table are innodb and I tried to drop all tables and recreate it the problem remains... EDIT : Well I think I have a start of answer... I change the log level of hibernate to DEBUG and before it crash I have the following log insert into Semspace (name, [...]) values (?, [...]) So it looks like it does not try to insert the idCorpus and as it is not null it uses the default value "0" which does not refers to an existing entry in Corpus table...

    Read the article

  • algorithm q: Fuzzy matching of structured data

    - by user86432
    I have a fairly small corpus of structured records sitting in a database. Given a tiny fraction of the information contained in a single record, submitted via a web form (so structured in the same way as the table schema), (let us call it the test record) I need to quickly draw up a list of the records that are the most likely matches for the test record, as well as provide a confidence estimate of how closely the search terms match a record. The primary purpose of this search is to discover whether someone is attempting to input a record that is duplicate to one in the corpus. There is a reasonable chance that the test record will be a dupe, and a reasonable chance the test record will not be a dupe. The records are about 12000 bytes wide and the total count of records is about 150,000. There are 110 columns in the table schema and 95% of searches will be on the top 5% most commonly searched columns. The data is stuff like names, addresses, telephone numbers, and other industry specific numbers. In both the corpus and the test record it is entered by hand and is semistructured within an individual field. You might at first blush say "weight the columns by hand and match word tokens within them", but it's not so easy. I thought so too: if I get a telephone number I thought that would indicate a perfect match. The problem is that there isn't a single field in the form whose token frequency does not vary by orders of magnitude. A telephone number might appear 100 times in the corpus or 1 time in the corpus. The same goes for any other field. This makes weighting at the field level impractical. I need a more fine-grained approach to get decent matching. My initial plan was to create a hash of hashes, top level being the fieldname. Then I would select all of the information from the corpus for a given field, attempt to clean up the data contained in it, and tokenize the sanitized data, hashing the tokens at the second level, with the tokens as keys and frequency as value. I would use the frequency count as a weight: the higher the frequency of a token in the reference corpus, the less weight I attach to that token if it is found in the test record. My first question is for the statisticians in the room: how would I use the frequency as a weight? Is there a precise mathematical relationship between n, the number of records, f(t), the frequency with which a token t appeared in the corpus, the probability o that a record is an original and not a duplicate, and the probability p that the test record is really a record x given the test and x contain the same t in the same field? How about the relationship for multiple token matches across multiple fields? Since I sincerely doubt that there is, is there anything that gets me close but is better than a completely arbitrary hack full of magic factors? Barring that, has anyone got a way to do this? I'm especially keen on other suggestions that do not involve maintaining another table in the database, such as a token frequency lookup table :). This is my first post on StackOverflow, thanks in advance for any replies you may see fit to give.

    Read the article

  • Additional spaces in String having read text file to String using FileInputStream

    - by David
    Hi, I'm trying to read in a text file to a String variable. The text file has multiple lines. Having printed the String to test the "read-in" code, there is an additional space between every character. As I am using the String to generate character bigrams, the spaces are making the sample text useless. The code is try{ FileInputStream fstream = new FileInputStream(textfile); DataInputStream in = new DataInputStream(fstream); BufferedReader br = new BufferedReader(new InputStreamReader(in)); //Read corpus file line-by-line, concatenating each line to the String "corpus" while ((strLine = br.readLine()) != null) { corpus = (corpus.concat(strLine)); } in.close(); //Close the input stream } catch (Exception e) {//Catch exception if any System.err.println("Error test check: " + e.getMessage()); } I'd be grateful for any advice. Thanks.

    Read the article

  • Compose synthetic English phrase that would contain 160 bits of recoverable information

    - by Alexander Gladysh
    I have 160 bits of random data. Just for fun, I want to generate pseudo-English phrase to "store" this information in. I want to be able to recover this information from the phrase. Note: This is not a security question, I don't care if someone else will be able to recover the information or even detect that it is there or not. Criteria for better phrases, from most important to the least: Short Unique Natural-looking The current approach, suggested here: Take three lists of 1024 nouns, verbs and adjectives each (picking most popular ones). Generate a phrase by the following pattern, reading 20 bits for each word: Noun verb adjective verb, Noun verb adjective verb, Noun verb adjective verb, Noun verb adjective verb. Now, this seems to be a good approach, but the phrase is a bit too long and a bit too dull. I have found a corpus of words here (Part of Speech Database). After some ad-hoc filtering, I calculated that this corpus contains, approximately 50690 usable adjectives 123585 nouns 15301 verbs This allows me to use up to 16 bits per adjective (actually 16.9, but I can't figure how to use fractional bits) 15 bits per noun 13 bits per verb For noun-verb-adjective-verb pattern this gives 57 bits per "sentence" in phrase. This means that, if I'll use all words I can get from this corpus, I can generate three sentences instead of four (160 / 57 ˜ 2.8). Noun verb adjective verb, Noun verb adjective verb, Noun verb adjective verb. Still a bit too long and dull. Any hints how can I improve it? What I see that I can try: Try to compress my data somehow before encoding. But since the data is completely random, only some phrases would be shorter (and, I guess, not by much). Improve phrase pattern, so it would look better. Use several patterns, using the first word in phrase to somehow indicate for future decoding which pattern was used. (For example, use the last letter or even the length of the word.) Pick pattern according to the first bytes of the data. ...I'm not that good with English to come up with better phrase patterns. Any suggestions? Use more linguistics in the pattern. Different tenses etc. ...I guess, I would need much better word corpus than I have now for that. Any hints where can I get a suitable one?

    Read the article

  • Avoid an "out of memory error" in Java(eclipse), when using large data structure?

    - by gnomed
    OK, so I am writing a program that unfortunately needs to use a huge data structure to complete its work, but it is failing with a "out of memory error" during its initialization. While I understand entirely what that means and why it is a problem, I am having trouble overcoming it, since my program needs to use this large structure and I don't know any other way to store it. The program first indexes a large corpus of text files that I provide. This works fine. Then it uses this index to initialize a large 2D array. This array will have nXn entries, where "n" is the number of unique words in the corpus of text. For the relatively small chunk I am testing it on(about 60 files) it needs to make approximately 30,000x30,000 entries. this will probably be bigger once I run it on my full intended corpus too. It consistently fails every time, after it indexes, while it is initializing the data structure(to be worked on later). Things I have done include: revamp my code to use a primitive "int[]" instead of a "TreeMap" eliminate redundant structures, etc... Also, I have run eclipse with "eclipse -vmargs -Xmx2g" to max out my allocated memory I am fairly confident this is not going to be a simple line of code solution, but is most likely going to require a very new approach. I am looking for what that approach is, any ideas? Thanks, B.

    Read the article

  • NLP - Queries using semantic wildcards in full text searching, maybe with Lucene?

    - by Zsolt
    Let's say I have a big corpus (for example in english or an arbitrary language), and I want to perform some semantic search on it. For example I have the query: "Be careful: [art] armada of [sg] is coming to [do sg]!" And the corpus contains the following sentence: "Be careful: an armada of alien ships is coming to destroy our planet!" It can be seen that my query string could contain "semantic placeholders", such as: [art] - some placeholder for articles (for example a / an in English) [sg], [do sg] - some placeholders for NPs and VPs (subjects and predicates) I would like to develop a library which would be capable to handle these queries efficiently. I suspect that some kind of POS-tagging would be necessary for parsing the text, but because I don't want to fully reimplement an already existing full-text search engine to make it work, I'm considering that how could I integrate this behaviour into a search engine like Lucene? I know there are SpanQueries which could behave similarly in some cases, but as I can see, Lucene doesn't do any semantic stuff with stored texts. It is possible to implement a behavior like this? Or do I have to write an own search engine?

    Read the article

  • How does it hurt to use Linux (Ubuntu) as a guest OS for all my tasks?

    - by sauparna
    I have a machine running Windows, where the disk has two partitions C (50 GB) and D (250GB). I do research in Information Retrieval and need to work with a large corpus (more than 50 GB) and in Linux. So if I want to install Linux on the existing system, keeping the Windows installation intact, will it be fine to run it in a virtual box? (say, QEMU, VMWare, etc.) An alternative is using Wubi. In that case the Linux installation has to be on drive C. Then, if I keep a small Linux installation (say 5GB) on C, and my corpus on D (mounted in Linux), how will it affect the performance of my programs which would be accessing the mounted Windows drive D. Is it feasible to use Linux this way? Which of the above is better if at all they are a way out? Note : Since my post in July 2010, I have been using and have tried several ways of maintaining a disk-image that I can mount in Linux. I had a 100GB qcow2 disk and a 100GB raw disk, both formatted to an EXT3 file system. I was mounting and connecting to the qcow2 disk using qemu-nbd. The problem was that every now and then, the connection to the disk would get lost and the running programs would throw disk I/O errors. The raw disk would mount and work fine as a loop mounted device, but when writing data to it, the mount.ntfs program would hog the CPU and the process would take an enormous amount of time. I was in fact running make on a piece of software located on this raw disk, and after a point of time make was waiting while mount.ntfs would show 100% CPU usage.

    Read the article

  • How can I explain to dspam that the user "brandon" is the same as "brandon@mydomain"

    - by Brandon Craig Rhodes
    I am using dspam for spam filtering by running the "dspamd" daemon under Ubuntu 9.10 and then setting up a Postfix rule that says: smtpd_recipient_restrictions = ... check_client_access pcre:/etc/postfix/dspam_everything ... where that PCRE map looks like this: /./ FILTER lmtp:[127.0.0.1]:11124 This works well, and means that all users on my system get all of their email, whether "dspam" thinks it is innocent or not, and have the option of filtering on its decisions or ignoring them. The problem comes when I want to train dspam using my email archives. After reading about the "dspam" command, I tried this on the files in my Inbox and spam boxes (which date from when I was using another filtering solution): for file in Mail/Inbox/*; do cat $file | dspam --class=innocent --source=corpus; done for file in Mail/spam/*; do cat $file | dspam --class=spam --source=corpus; done The symptom I noticed after doing all of this was that dspam was horrible at classifying spam — it couldn't find any! The problem, when I tracked it down, was that I was training the user "brandon" with the above commands, but the incoming email was instead compared against the username "brandon@mydomain", so it was running against a completely empty training database! So, what can I do to make the above commands actually train my fully-qualified email address rather than my bare username? I would like to avoid having to run "dspam" as root with a "--user" option. I would have expected that the "dspam" configuration files would have had an "append_domain" attribute or something with which to decorate local usernames with an appropriate email domain, but I can't find any such thing. When I used to use the Berkeley DB backend to "dspam", I solved this problem by creating a symlink from one of the databases to the other. :-) But that solution eventually died because the BDB backend is not thread-safe, so now I have moved to the PostgreSQL back-end and need a way to solve the problem there. And, no, the table where it keeps usernames has a UNIQUE constraint that prevents me from listing both usernames as mapping to the same ID. :-)

    Read the article

  • How does it hurt to use Linux (Ubuntu) as a guest OS for all my tasks?

    - by sauparna
    I have a machine running Windows, where the disk has two partitions C (50 GB) and D (250GB). I do research in Information Retrieval and need to work with a large corpus (more than 50 GB) and in Linux. So if I want to install Linux on the existing system, keeping the Windows installation intact, will it be fine to run it in a virtual box? (say, QEMU, VMWare, etc.) An alternative is using Wubi. In that case the Linux installation has to be on drive C. Then, if I keep a small Linux installation (say 5GB) on C, and my corpus on D (mounted in Linux), how will it affect the performance of my programs which would be accessing the mounted Windows drive D. Is it feasible to use Linux this way? Which of the above is better if at all they are a way out? Note : Since my post in July 2010, I have been using and have tried several ways of maintaining a disk-image that I can mount in Linux. I had a 100GB qcow2 disk and a 100GB raw disk, both formatted to an EXT3 file system. I was mounting and connecting to the qcow2 disk using qemu-nbd. The problem was that every now and then, the connection to the disk would get lost and the running programs would throw disk I/O errors. The raw disk would mount and work fine as a loop mounted device, but when writing data to it, the mount.ntfs program would hog the CPU and the process would take an enormous amount of time. I was in fact running make on a piece of software located on this raw disk, and after a point of time make was waiting while mount.ntfs would show 100% CPU usage.

    Read the article

  • Using a "white list" for extracting terms for Text Mining, Part 2

    - by [email protected]
    In my last post, we set the groundwork for extracting specific tokens from a white list using a CTXRULE index. In this post, we will populate a table with the extracted tokens and produce a case table suitable for clustering with Oracle Data Mining. Our corpus of documents will be stored in a database table that is defined as create table documents(id NUMBER, text VARCHAR2(4000)); However, any suitable Oracle Text-accepted data type can be used for the text. We then create a table to contain the extracted tokens. The id column contains the unique identifier (or case id) of the document. The token column contains the extracted token. Note that a given document many have many tokens, so there will be one row per token for a given document. create table extracted_tokens (id NUMBER, token VARCHAR2(4000)); The next step is to iterate over the documents and extract the matching tokens using the index and insert them into our token table. We use the MATCHES function for matching the query_string from my_thesaurus_rules with the text. DECLARE     cursor c2 is       select id, text       from documents; BEGIN     for r_c2 in c2 loop        insert into extracted_tokens          select r_c2.id id, main_term token          from my_thesaurus_rules          where matches(query_string,                        r_c2.text)>0;     end loop; END; Now that we have the tokens, we can compute the term frequency - inverse document frequency (TF-IDF) for each token of each document. create table extracted_tokens_tfidf as   with num_docs as (select count(distinct id) doc_cnt                     from extracted_tokens),        tf       as (select a.id, a.token,                            a.token_cnt/b.num_tokens token_freq                     from                        (select id, token, count(*) token_cnt                        from extracted_tokens                        group by id, token) a,                       (select id, count(*) num_tokens                        from extracted_tokens                        group by id) b                     where a.id=b.id),        doc_freq as (select token, count(*) overall_token_cnt                     from extracted_tokens                     group by token)   select tf.id, tf.token,          token_freq * ln(doc_cnt/df.overall_token_cnt) tf_idf   from num_docs,        tf,        doc_freq df   where df.token=tf.token; From the WITH clause, the num_docs query simply counts the number of documents in the corpus. The tf query computes the term (token) frequency by computing the number of times each token appears in a document and divides that by the number of tokens found in the document. The doc_req query counts the number of times the token appears overall in the corpus. In the SELECT clause, we compute the tf_idf. Next, we create the nested table required to produce one record per case, where a case corresponds to an individual document. Here, we COLLECT all the tokens for a given document into the nested column extracted_tokens_tfidf_1. CREATE TABLE extracted_tokens_tfidf_nt              NESTED TABLE extracted_tokens_tfidf_1                  STORE AS extracted_tokens_tfidf_tab AS              select id,                     cast(collect(DM_NESTED_NUMERICAL(token,tf_idf)) as DM_NESTED_NUMERICALS) extracted_tokens_tfidf_1              from extracted_tokens_tfidf              group by id;   To build the clustering model, we create a settings table and then insert the various settings. Most notable are the number of clusters (20), using cosine distance which is better for text, turning off auto data preparation since the values are ready for mining, the number of iterations (20) to get a better model, and the split criterion of size for clusters that are roughly balanced in number of cases assigned. CREATE TABLE km_settings (setting_name  VARCHAR2(30), setting_value VARCHAR2(30)); BEGIN  INSERT INTO km_settings (setting_name, setting_value) VALUES     VALUES (dbms_data_mining.clus_num_clusters, 20);  INSERT INTO km_settings (setting_name, setting_value)     VALUES (dbms_data_mining.kmns_distance, dbms_data_mining.kmns_cosine);   INSERT INTO km_settings (setting_name, setting_value) VALUES     VALUES (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_off);   INSERT INTO km_settings (setting_name, setting_value) VALUES     VALUES (dbms_data_mining.kmns_iterations,20);   INSERT INTO km_settings (setting_name, setting_value) VALUES     VALUES (dbms_data_mining.kmns_split_criterion,dbms_data_mining.kmns_size);   COMMIT; END; With this in place, we can now build the clustering model. BEGIN     DBMS_DATA_MINING.CREATE_MODEL(     model_name          => 'TEXT_CLUSTERING_MODEL',     mining_function     => dbms_data_mining.clustering,     data_table_name     => 'extracted_tokens_tfidf_nt',     case_id_column_name => 'id',     settings_table_name => 'km_settings'); END;To generate cluster names from this model, check out my earlier post on that topic.

    Read the article

  • Package to compare LSA, TFIDF, Cosine metrics and Language Models

    - by gouwsmeister
    Hi, I'm looking for a package (any language, really) that I can use on a corpus of 50 documents to perform interdocument similarity testing in various metrics, like tfidf, okapi, language models, lsa, etc. I want as a result a document similarity matrix, i.e. doc1 is x% similar to doc2, etc... This is for research purposes, not for production. I specifically want the doc similarity matrix as I want to correlate this with human ratings. Thank you in advance!

    Read the article

  • Latent Dirichlet Allocation, pitfalls, tips and programs

    - by Gregg Lind
    I'm experimenting with Latent Dirichlet Allocation for topic disambiguation and assignment, and I'm looking for advice. Which program is the "best", where best is some combination of easiest to use, best prior estimation, fast How do I incorporate my intuitions about topicality. Let's say I think I know that some items in the corpus are really in the same category, like all articles by the same author. Can I add that into the analysis? Any unexpected pitfalls or tips I should know before embarking? I'd prefer is there are R or Python front ends for whatever program, but I expect (and accept) that I'll be dealing with C.

    Read the article

  • Ngram IDF smoothing

    - by adi92
    I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents. I basically need something like Amazon's Statistically Improbable Phrases, i.e. phrases that distinguish a document from all the others The problem that I am running into is that some (3,4)-grams in my data which have super-high idf actually consist of component unigrams and bigrams which have really low idf.. For example, "you've never tried" has a very high idf, while each of the component unigrams have very low idf.. I need to come up with a function that can take in document frequencies of an n-gram and all its component (n-k)-grams and return a more meaningful measure of how much this phrase will distinguish the parent document from the rest. If I were dealing with probabilities, I would try interpolation or backoff models.. I am not sure what assumptions/intuitions those models leverage to perform well, and so how well they would do for IDF scores. Anybody has any better ideas?

    Read the article

1 2  | Next Page >