Extracting Demographic and Contact Information from unstructured text files

Posted by jn29098 on Stack Overflow See other posts from Stack Overflow or by jn29098
Published on 2010-06-01T01:50:48Z Indexed on 2010/06/01 1:53 UTC
Read the original article Hit count: 406

Filed under:

text

|

text-extraction

|

information-extraction

I am looking to extract specific items out of a large pool of unstructured documents. These documents could be 1-5 pages of text formatted in various ways by the user, but in most cases would contain at least:

Name
Address (physical)
Email Address
Phone number
website URL

I'm looking for a semantic parser that can attempt to extract these elements from the documents so that I can load that information into a relational database and work with these records as contacts.

Other services I've looked for, while valuable for other purposes, do not address this specific need.

Any thoughts, suggestions or leads?

© Stack Overflow or respective owner

Related posts about text

Error in running script [closed]

as seen on Programmers - Search for 'Programmers'
I'm trying to run heathusf_v1.1.0.tar.gz found here I installed tcsh to make build_heathusf work. But, when I run ./build_heathusf, I get the following (I'm running that on a Fedora Linux system from Terminal): $ ./build_heathusf Compiling programs to build a library of image processing functions… >>> More
Coloring even heighten columns

as seen on Stack Overflow - Search for 'Stack Overflow'
I try to set different a background colors for left and right columns and to maintain the same height. So I set a background color for outer wrapper ("container" div) so it will set a color to rightBar. But this didn't work. Online Demo I want it to work on all browsers. Markup: <!DOCTYPE… >>> More
HTML: How to create a DIV with only vertical scroll-bar to show long paragraphs on a webpage?

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to show terms and condition note on my website. I dont want to use text field and also dont want to use my whole page. I just want to display my text in selected area and want to use only vertical scroll-bar to go down and read all text. Currently I am using this code: <div style="width:10;height:10;overflow:scroll"… >>> More
Qt Linking Error.

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I configure qt-x11 with following options ./configure -prefix /iTalk/qtx11 -prefix-install -bindir /iTalk/qtx11-install/bin -libdir /iTalk/qtx11-install/lib -docdir /iTalk/qtx11-install/doc -headerdir /iTalk/qtx11-install/include -datadir /iTalk/qtx11-install/data -examplesdir /iTalk/qtx11-install/examples… >>> More
XSLT Escape Character not working

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to use escape charaters in my text output, as i would like too surround the output in emailData tags. I am using <xsl:text><emailData></xsl:text> In the XSLT to esnure that this works however because i am using a tool called Cast Iron for some reason it… >>> More

Related posts about text-extraction

Preserve "long" spaces in PDFBox text extraction

as seen on Stack Overflow - Search for 'Stack Overflow'
I am using PDFBox to extract text from PDF. The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other This works really well, except that all kinds of horizontal space gets converted into a single space character, so that I cannot tell columns… >>> More
text extraction from video game dialogue files [on hold]

as seen on Game Development - Search for 'Game Development'
As part of an academic project, I am trying to access the dialogue files (whether audio or text) from a variety of sports video games (Madden or NBA 2kX would be fantastic). I have searched extensively on other sites (scholarly text-mining publications, r/gaming, r/madden, modding sites, etc.) for… >>> More
PDF Text Extraction Approach Using OCR

as seen on Stack Overflow - Search for 'Stack Overflow'
Has anybody attempted to extract text from a PDF using an OCR library and Java? What did you find to be the most reliable library for text extraction. Most of the approaches I've seen (tesseract, GOCR) are C libraries that would require some JNI code to be written. I'm familiar with pdfbox, which… >>> More
Text extraction with java html parsers

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to use an html parser that does the following in a nice, elegant way Extract text (this is most important) Extract links, meta keywords Reconstruct original doc (optional but nice feature to have) From my investigation so far jericho seems to fit. Any other open source libraries you guys… >>> More
need help working with the Jericho Html Parser

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi all I've simply used the following program on the url below http://jericho.htmlparser.net/samples/console/src/ExtractText.java My goal is to be able to extract the main body text, to be able to summarize it and present the summarized text as output to the user. My problem is that, I'm not sure… >>> More