Search Results

Search found 234 results on 10 pages for 'stanford nlp'.

Page 3/10 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page >

How do you parse a paragraph of text into sentences? (perferrably in Ruby)

- by henry74

How do you take paragraph or large amount of text and break it into sentences (perferably using Ruby) taking into account cases such as Mr. and Dr. and U.S.A? (Assuming you just put the sentences into an array of arrays) UPDATE: One possible solution I thought of involves using a parts-of-speech tagger (POST) and a classifier to determine the end of a sentence: Getting data from Mr. Jones felt the warm sun on his face as he stepped out onto the balcony of his summer home in Italy. He was happy to be alive. CLASSIFIER Mr./PERSON Jones/PERSON felt/O the/O warm/O sun/O on/O his/O face/O as/O he/O stepped/O out/O onto/O the/O balcony/O of/O his/O summer/O home/O in/O Italy/LOCATION ./O He/O was/O happy/O to/O be/O alive/O ./O POST Mr./NNP Jones/NNP felt/VBD the/DT warm/JJ sun/NN on/IN his/PRP$ face/NN as/IN he/PRP stepped/VBD out/RP onto/IN the/DT balcony/NN of/IN his/PRP$ summer/NN home/NN in/IN Italy./NNP He/PRP was/VBD happy/JJ to/TO be/VB alive./IN Can we assume, since Italy is a location, the period is the valid end of the sentence? Since ending on "Mr." would have no other parts-of-speech, can we assume this is not a valid end-of-sentence period? Is this the best answer to the my question? Thoughts?

Read the article
Extracting pure content / text from HTML Pages by excluding navigation and chrome content

- by Ankur Gupta

Hi, I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc. Despite the above approach I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 article i.e. 50%. Error as in Can you Suggest an alternative strategy for extraction of pure content, Would/Can learning Natural Language rocessing help in extracting correct abstract from these articles ? How would you approach the above problem ?. Are these any research papers on the same ?. Regards Ankur Gupta

Read the article
another porter stemming algorithm implementation question ?

- by mike

Hi, I am trying to implement porter stemming algorithm, but i am having difficualties understanding this point Step 1c (*v*) Y -> I happy -> happi sky -> sky Isn't that the the opposite of what we want to do , why does the algorithim convert the Y into I. for the complete algorithm here http://tartarus.org/~martin/PorterStemmer/def.txt Thanks

Read the article
Open Source Library for Linguistic Inquiry and Word Count (LIWC)

- by zfranciscus

Hi, I am looking for an open source library for Linguistic Inquiry and Word Count (LIWC). Something in java or python will be good, though I am open to use other language. Does anyone know where I can get one ? Cheers,

Read the article
How to determine the (natural) language of a document?

- by Robert Petermeier

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in. Is there any "standard" algorithm for this problem that can be implemented in a few hours' time? Or alternatively, a free .NET library or toolkit that can do this? I know about LingPipe, but it is Java Not free for "semi-commercial" usage This problem seems to be surprisingly hard. I checked out the Google AJAX Language API (which I found by searching this site first), but it was ridiculously bad. For six web pages in German to which I pointed it only one guess was correct. The other guesses were Swedish, English, Danish and French... A simple approach I came up with is to use a list of stop words. My app already uses such a list for German documents in order to analyze them with Lucene.Net. If my app scans the documents for occurrences of stop words from either language the one with more occurrences would win. A very naive approach, to be sure, but it might be good enough. Unfortunately I don't have the time to become an expert at natural-language processing, although it is an intriguing topic.

Read the article
Latent Dirichlet Allocation, pitfalls, tips and programs

- by Gregg Lind

I'm experimenting with Latent Dirichlet Allocation for topic disambiguation and assignment, and I'm looking for advice. Which program is the "best", where best is some combination of easiest to use, best prior estimation, fast How do I incorporate my intuitions about topicality. Let's say I think I know that some items in the corpus are really in the same category, like all articles by the same author. Can I add that into the analysis? Any unexpected pitfalls or tips I should know before embarking? I'd prefer is there are R or Python front ends for whatever program, but I expect (and accept) that I'll be dealing with C.

Read the article
How to make concept representation with the help of bag of words

- by agazerboy

Hi All, Thanks for stoping to read my question :) this is very sweet place full of GREAT peoples ! I have a question about "creating sentences with words". NO NO it is not about english grammar :) Let me explain, If I have bag of words like "person apple apple person person a eat person will apple eat hungry apple hungry" and it can generate some kind of following sentence "hungry person eat apple" I don't in which field this topic will relate. Where should I try to find an answer. I tried to search google but I only found english grammar stuff :) Any body there who can tell me which algo can work in this problem? or any program Thanks P.S: It is not an assignment :) if it would be i would ask for source code ! I don't even know in which field I should look for :)

Read the article
Determining whether values can potentially match a regular expression, given more input

- by Andreas Grech

I am currently writing an application in JavaScript where I'm matching input to regular expressions, but I also need to find a way how to match strings to parts of the regular expressions. For example: var invalid = "x", potentially = "g", valid = "ggg", gReg = /^ggg$/; gReg.test(invalid); //returns false (correct) gReg.test(valid); //returns true (correct) Now I need to find a way to somehow determine that the value of the potentially variable doesn't exactly match the /^ggg$/ expression, BUT with more input, it potentially can! So for example in this case, the potentially variable is g, but if two more g's are appended to it, it will match the regular expression /^ggg$/ But in the case of invalid, it can never match the /^ggg$/ expression, no matter how many characters you append to it. So how can I determine if a string has or doesn't have potential to match a particular regular expression?

Read the article
Naive Bayesian for Topic detection using "Bag of Words" approach

- by AlgoMan

I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ? Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence improving and learning about new words that map to topic. And also changing the probabilities of words. How should i go about doing this ? Is my approach the right one ? Which programming language would be best suited for the implementation ?

Read the article
Writing annotataion schemas for Callisto

- by Ken Bloom

Does anybody know where I can find documentation on how to write annotation schemas for Callisto? I'm looking to write something a little more complicated than I can generate from a DTD -- that only gives me the ability to tag different kinds of text mentions. I'm looking to create a schema that represents a single type of relationship between five or six different kinds of textual mentions (and some of these types of mentions have attributes that I need to assign values to), and possibly having a second type of relationship between the first two instances of the first type of relationship. (Alternatively, does anybody know of any software that would be better for this kind of schema? I've been looking at WordFreak, but it's a little clumsy, and it doesn't support attributes on its textual mentions.)

Read the article
Searching text for geonames

- by Vojtech R.

Hi, which part of huge package nltk I must study and use, if I need mark geonames in text?

Read the article
How to extract common / significant phrases from a series of text entries

- by arronsky

I have a series of text items- raw HTML from a MYSQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching). My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format: "Try the hamburger" (in 44 reviews) e.g., the "Review Highlights" section of this page: http://www.yelp.com/biz/sushi-gen-los-angeles/ I have NLTK installed and I've played around with it a bit, but am honestly overwhelmed by the options. This seems like a rather common problem and I haven't been able to find a straightforward solution by searching here. Thanks in advance for any help.

Read the article
Detecting syllables in a word

- by user50705

I need to find a fairly efficient way to detect syllables in a word. E.g., invisible - in-vi-sib-le There are some syllabification rules that could be used: V CV VC CVC CCV CCCV CVCC *where V is a vowel and C is a consonant. e.g., pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC) I've tried few methods, among which were using regex (which helps only if you want to count syllables) or hard coded rule definition (a brute force approach which proves to be very inefficient) and finally using a finite state automata (which did not result with anything useful). The purpose of my application is to create a dictionary of all syllables in a given language. This dictionary will later be used for spell checking applications (using Bayesian classifiers) and text to speech synthesis. I would appreciate if one could give me tips on an alternate way to solve this problem besides my previous approaches. I work in Java, but any tip in C/C++, C#, Python, Perl... would work for me.

Read the article
Dependency parsing

- by C.

Hi I particularly like the transduce feature offered by agfl in their EP4IR http://www.agfl.cs.ru.nl/EP4IR/english.html The download page is here: http://www.agfl.cs.ru.nl/download.html Is there any way i can make use of this in a c# program? Do I need to convert classes to c#? Thanks :)

Read the article
Python/PyParsing: Difficulty with setResultsName

- by Rosarch

I think I'm making a mistake in how I call setResultsName(): from pyparsing import * DEPT_CODE = Regex(r'[A-Z]{2,}').setResultsName("Dept Code") COURSE_NUMBER = Regex(r'[0-9]{4}').setResultsName("Course Number") COURSE_NUMBER.setParseAction(lambda s, l, toks : int(toks[0])) course = DEPT_CODE + COURSE_NUMBER course.setResultsName("course") statement = course From IDLE: >>> myparser import * >>> statement.parseString("CS 2110") (['CS', 2110], {'Dept Code': [('CS', 0)], 'Course Number': [(2110, 1)]}) The output I hope for: >>> myparser import * >>> statement.parseString("CS 2110") (['CS', 2110], {'Course': ['CS', 2110], 'Dept Code': [('CS', 0)], 'Course Number': [(2110, 1)]}) Does setResultsName() only work for terminals?

Read the article
Theory: "Lexical Encoding"

- by _ande_turner_

I am using the term "Lexical Encoding" for my lack of a better one. A Word is arguably the fundamental unit of communication as opposed to a Letter. Unicode tries to assign a numeric value to each Letter of all known Alphabets. What is a Letter to one language, is a Glyph to another. Unicode 5.1 assigns more than 100,000 values to these Glyphs currently. Out of the approximately 180,000 Words being used in Modern English, it is said that with a vocabulary of about 2,000 Words, you should be able to converse in general terms. A "Lexical Encoding" would encode each Word not each Letter, and encapsulate them within a Sentence. // An simplified example of a "Lexical Encoding" String sentence = "How are you today?"; int[] sentence = { 93, 22, 14, 330, QUERY }; In this example each Token in the String was encoded as an Integer. The Encoding Scheme here simply assigned an int value based on generalised statistical ranking of word usage, and assigned a constant to the question mark. Ultimately, a Word has both a Spelling & Meaning though. Any "Lexical Encoding" would preserve the meaning and intent of the Sentence as a whole, and not be language specific. An English sentence would be encoded into "...language-neutral atomic elements of meaning ..." which could then be reconstituted into any language with a structured Syntactic Form and Grammatical Structure. What are other examples of "Lexical Encoding" techniques? If you were interested in where the word-usage statistics come from : http://www.wordcount.org

Read the article
Java text classification problem

- by yox

Hello, I have a set of Books objects, classs Book is defined as following : Class Book{ String title; ArrayList<tags> taglist; } Where title is the title of the book, example : Javascript for dummies. and taglist is a list of tags for our example : Javascript, jquery, "web dev", .. As I said a have a set of books talking about different things : IT, BIOLOGY, HISTORY, ... Each book has a title and a set of tags describing it.. I have to classify automaticaly those books into separated sets by topic, example : IT BOOKS : Java for dummies Javascript for dummies Learn flash in 30 days C++ programming HISTORY BOOKS : World wars America in 1960 Martin luther king's life BIOLOGY BOOKS : .... Do you guys know a classification algorithm/method to apply for that kind of problems ? A solution is to use an external API to define the category of the text, but the problem here is that books are in different languages : french, spanish, english ..

Read the article
Compose synthetic English phrase that would contain 160 bits of recoverable information

- by Alexander Gladysh

I have 160 bits of random data. Just for fun, I want to generate pseudo-English phrase to "store" this information in. I want to be able to recover this information from the phrase. Note: This is not a security question, I don't care if someone else will be able to recover the information or even detect that it is there or not. Criteria for better phrases, from most important to the least: Short Unique Natural-looking The current approach, suggested here: Take three lists of 1024 nouns, verbs and adjectives each (picking most popular ones). Generate a phrase by the following pattern, reading 20 bits for each word: Noun verb adjective verb, Noun verb adjective verb, Noun verb adjective verb, Noun verb adjective verb. Now, this seems to be a good approach, but the phrase is a bit too long and a bit too dull. I have found a corpus of words here (Part of Speech Database). After some ad-hoc filtering, I calculated that this corpus contains, approximately 50690 usable adjectives 123585 nouns 15301 verbs This allows me to use up to 16 bits per adjective (actually 16.9, but I can't figure how to use fractional bits) 15 bits per noun 13 bits per verb For noun-verb-adjective-verb pattern this gives 57 bits per "sentence" in phrase. This means that, if I'll use all words I can get from this corpus, I can generate three sentences instead of four (160 / 57 ˜ 2.8). Noun verb adjective verb, Noun verb adjective verb, Noun verb adjective verb. Still a bit too long and dull. Any hints how can I improve it? What I see that I can try: Try to compress my data somehow before encoding. But since the data is completely random, only some phrases would be shorter (and, I guess, not by much). Improve phrase pattern, so it would look better. Use several patterns, using the first word in phrase to somehow indicate for future decoding which pattern was used. (For example, use the last letter or even the length of the word.) Pick pattern according to the first bytes of the data. ...I'm not that good with English to come up with better phrase patterns. Any suggestions? Use more linguistics in the pattern. Different tenses etc. ...I guess, I would need much better word corpus than I have now for that. Any hints where can I get a suitable one?

Read the article
Ngram IDF smoothing

- by adi92

I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents. I basically need something like Amazon's Statistically Improbable Phrases, i.e. phrases that distinguish a document from all the others The problem that I am running into is that some (3,4)-grams in my data which have super-high idf actually consist of component unigrams and bigrams which have really low idf.. For example, "you've never tried" has a very high idf, while each of the component unigrams have very low idf.. I need to come up with a function that can take in document frequencies of an n-gram and all its component (n-k)-grams and return a more meaningful measure of how much this phrase will distinguish the parent document from the rest. If I were dealing with probabilities, I would try interpolation or backoff models.. I am not sure what assumptions/intuitions those models leverage to perform well, and so how well they would do for IDF scores. Anybody has any better ideas?

Read the article
Using Markov models to convert all caps to mixed case and related problems

- by hippietrail

I've been thinking about using Markov techniques to restore missing information to natural language text. Restore mixed case to text in all caps Restore accents / diacritics to languages which should have them but have been converted to plain ASCII Convert rough phonetic transcriptions back into native alphabets That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context. I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Markov chains to resolve the ambiguities. Am I on the right track? Are there already some services, libraries, or tools for this sort of thing? Examples GEORGE LOST HIS SIM CARD IN THE BUSH - George lost his SIM card in the bush tantot il rit a gorge deployee - tantôt il rit à gorge déployée

Read the article
PyParsing: What does Combine() do?

- by Rosarch

What is the difference between: foo = TOKEN1 + TOKEN2 and foo = Combine(TOKEN1 + TOKEN2) Thanks.

Read the article
PyParsing: Not all tokens passed to setParseAction()

- by Rosarch

I'm parsing sentences like "CS 2110 or INFO 3300". I would like to output a format like: [[("CS" 2110)], [("INFO", 3300)]] To do this, I thought I could use setParseAction(). However, the print statements in statementParse() suggest that only the last tokens are actually passed: >>> statement.parseString("CS 2110 or INFO 3300") Match [{Suppress:("or") Re:('[A-Z]{2,}') Re:('[0-9]{4}')}] at loc 7(1,8) string CS 2110 or INFO 3300 loc: 7 tokens: ['INFO', 3300] Matched [{Suppress:("or") Re:('[A-Z]{2,}') Re:('[0-9]{4}')}] -> ['INFO', 3300] (['CS', 2110, 'INFO', 3300], {'Course': [(2110, 1), (3300, 3)], 'DeptCode': [('CS', 0), ('INFO', 2)]}) I expected all the tokens to be passed, but it's only ['INFO', 3300]. Am I doing something wrong? Or is there another way that I can produce the desired output? Here is the pyparsing code: from pyparsing import * def statementParse(str, location, tokens): print "string %s" % str print "loc: %s " % location print "tokens: %s" % tokens DEPT_CODE = Regex(r'[A-Z]{2,}').setResultsName("DeptCode") COURSE_NUMBER = Regex(r'[0-9]{4}').setResultsName("CourseNumber") OR_CONJ = Suppress("or") COURSE_NUMBER.setParseAction(lambda s, l, toks : int(toks[0])) course = DEPT_CODE + COURSE_NUMBER.setResultsName("Course") statement = course + Optional(OR_CONJ + course).setParseAction(statementParse).setDebug()

Read the article
Algorithm to match natural text in mail

- by snøreven

I need to separate natural, coherent text/sentences in emails from lists, signatures, greetings and so on before further processing. example: Hi tom, last monday we did bla bla, lore Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. list item 2 list item 3 list item 3 Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid x ea commodi consequat. Quis aute iure reprehenderit in voluptate velit regards, K. ---line-of-funny-characters-####### example inc. 33 evil street, london mobile: 00 234534/234345 Ideally the algorithm would match only the bold parts. Is there any recommended approach - or are there even existing algorithms for that problem? Should I try approximate regular expressions or more statistical stuff based on number of punctation marks, length and so on?

Read the article
Hierarchy of meaning

- by asldkncvas

I am looking for a method to build a hierarchy of words. Background: I am a "amateur" natural language processing enthusiast and right now one of the problems that I am interested in is determining the hierarchy of word semantics from a group of words. For example, if I have the set which contains a "super" representation of others, i.e. [cat, dog, monkey, animal, bird, ... ] I am interested to use any technique which would allow me to extract the word 'animal' which has the most meaningful and accurate representation of the other words inside this set. Note: they are NOT the same in meaning. cat != dog != monkey != animal BUT cat is a subset of animal and dog is a subset of animal. I know by now a lot of you will be telling me to use wordnet. Well, I will try to but I am actually interested in doing a very domain specific area which WordNet doesn't apply because: 1) Most words are not found in Wordnet 2) All the words are in another language; translation is possible but is to limited effect. another example would be: [ noise reduction, focal length, flash, functionality, .. ] so functionality includes everything in this set. I have also tried crawling wikipedia pages and applying some techniques on td-idf etc but wikipedia pages doesn't really do much either. Can someone possibly enlighten me as to what direction my research should go towards? (I could use anything)

Read the article
are there any c# libraries for Named Entity Recognition?

- by Taz

I am looking for any free libraries for Named Entity Recognition in c# or any other .net language.

Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page >