understanding semcor corpus structure h

Posted by Sharmila on Stack Overflow See other posts from Stack Overflow or by Sharmila
Published on 2011-01-03T10:27:00Z Indexed on 2011/01/04 2:53 UTC
Read the original article Hit count: 484

Filed under:

I'm learning NLP. I currently playing with Word Sense Disambiguation. I'm planning to use the semcor corpus as training data but I have trouble understanding the xml structure. I tried googling but did not get any resource describing the content structure of semcor.

<s snum="1">
<wf cmd="ignore" pos="DT">The</wf>
<wf cmd="done" lemma="group" lexsn="1:03:00::" pn="group" pos="NNP" rdf="group" wnsn="1">Fulton_County_Grand_Jury</wf>
<wf cmd="done" lemma="say" lexsn="2:32:00::" pos="VB" wnsn="1">said</wf>
<wf cmd="done" lemma="friday" lexsn="1:28:00::" pos="NN" wnsn="1">Friday</wf>
<wf cmd="ignore" pos="DT">an</wf>
<wf cmd="done" lemma="investigation" lexsn="1:09:00::" pos="NN" wnsn="1">investigation</wf>
<wf cmd="ignore" pos="IN">of</wf>
<wf cmd="done" lemma="atlanta" lexsn="1:15:00::" pos="NN" wnsn="1">Atlanta</wf>
<wf cmd="ignore" pos="POS">'s</wf>
<wf cmd="done" lemma="recent" lexsn="5:00:00:past:00" pos="JJ" wnsn="2">recent</wf>
<wf cmd="done" lemma="primary_election" lexsn="1:04:00::" pos="NN" wnsn="1">primary_election</wf>
<wf cmd="done" lemma="produce" lexsn="2:39:01::" pos="VB" wnsn="4">produced</wf>
<punc>``</punc>
<wf cmd="ignore" pos="DT">no</wf>
<wf cmd="done" lemma="evidence" lexsn="1:09:00::" pos="NN" wnsn="1">evidence</wf>
<punc>''</punc>
<wf cmd="ignore" pos="IN">that</wf>
<wf cmd="ignore" pos="DT">any</wf>
<wf cmd="done" lemma="irregularity" lexsn="1:04:00::" pos="NN" wnsn="1">irregularities</wf>
<wf cmd="done" lemma="take_place" lexsn="2:30:00::" pos="VB" wnsn="1">took_place</wf>
<punc>.</punc>
</s>

I'm assuming wnsn is 'word sense'. Is it correct?
What does the attribute lexsn mean? How does it map to wordnet?
What does the attribute pn refer to? (third line)
How is the rdf attribute assigned? (again third line)
In general, what are the possible attributes?

Developer IT

understanding semcor corpus structure h - Developer IT

understanding semcor corpus structure h

nlp

linguistics

corpus

computational-linguistics

Related posts about nlp

stanford pos tagger runs out of memory?

NLP with greatly contrained input and abilities

NLP - Word Alignment

AGFL npx grammar nlp techniques dependency parsing

Starting out NLP - Python + large data set

Related posts about linguistics

Extracting ""((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun"" from Text (Justeson & Katz, 1995)

Getting started with character and text processing (encoding, regular expressions)

Arabic taggged Corpora

How to make concept representation with the help of bag of words

understanding semcor corpus structure h

Categories cloud