Extracting pure content / text from HTML Pages by excluding navigation and chrome content
        Posted  
        
            by Ankur Gupta
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by Ankur Gupta
        
        
        
        Published on 2009-11-08T15:42:04Z
        Indexed on 
            2010/05/22
            23:41 UTC
        
        
        Read the original article
        Hit count: 580
        
Hi,
I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc
I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc.
Despite the above approach I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 article i.e. 50%. Error as in
Can you
- Suggest an alternative strategy for extraction of pure content, 
- Would/Can learning Natural Language rocessing help in extracting correct abstract from these articles ? 
- How would you approach the above problem ?. 
- Are these any research papers on the same ?. 
Regards
Ankur Gupta
© Stack Overflow or respective owner