Text mining on large database (data mining)
        Posted  
        
            by yox
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by yox
        
        
        
        Published on 2010-04-13T22:16:15Z
        Indexed on 
            2010/04/13
            22:23 UTC
        
        
        Read the original article
        Hit count: 665
        
Hello,
I have a large database of resumes (CV), and a certain table skills grouping all users skills.
inside that table there's a field skill_text that describes the skill in full text.
I'm looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skills..
Here are some examples skills extracted from the DB :
- Sectoral and competitive analysis
 - Business Development (incl. in international settings)
 - Specific structure and road design software - Microstation, Macao, AutoCAD (basic knowledge)
 - Creative work (Photoshop, In-Design, Illustrator)
 - checking and reporting back on campaign progress
 - organising and attending events and exhibitions
 - Development : Aptana Studio, PHP, HTML, CSS, JavaScript, SQL, AJAX
 - Discipline: One to one marketing, E-marketing (SEO & SEA, display, emailing, affiliate program) Mix marketing, Viral Marketing, Social network marketing.
 
The output shoud be something like :
- Sectoral and competitive analysis
 - Business Development
 - Specific structure and road design software -
 - Macao
 - AutoCAD
 - Photoshop
 - In-Design
 - Illustrator
 - organising events
 - Development
 - Aptana Studio
 - PHP
 - HTML
 - CSS
 - JavaScript
 - SQL
 - AJAX
 - Mix marketing
 - Viral Marketing
 - Social network marketing
 - emailing
 - SEO
 - One to one marketing
 
As you see only skills remains no other representation text.
I know this is possible using text mining technics but how to do it ? the database is realy large.. it's a good thing because we can calculate text frequency and decide if it's a real skill or just meaningless text... The big problem is .. how to determin that "blablabla" is a skill ?
thanks
© Stack Overflow or respective owner