Extracting pure content / text from HTML Pages by excluding navigation and chrome content

Posted by Ankur Gupta on Stack Overflow See other posts from Stack Overflow or by Ankur Gupta
Published on 2009-11-08T15:42:04Z Indexed on 2010/05/22 23:41 UTC
Read the original article Hit count: 566

Filed under:

html

|

artificial-intelligence

|

nlp

|

html-content-extraction

|

text-extraction

Hi,

I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc

I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc.

Despite the above approach I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 article i.e. 50%. Error as in

Can you

Suggest an alternative strategy for extraction of pure content,
Would/Can learning Natural Language rocessing help in extracting correct abstract from these articles ?
How would you approach the above problem ?.
Are these any research papers on the same ?.

Regards

Ankur Gupta

© Stack Overflow or respective owner

Related posts about html

Install usblib package - Ubuntu

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I need the package libusb for another package I am installing. I tried the following which seemed to install the package, sudo apt-get install libusb-dev but when I try to install the other package I get, configure: error: Package requirements (libusb-1.0 >= 0.9.1) were not met: No package… >>> More
Prevent malicious vulnerability scan increasing load on a server

as seen on Server Fault - Search for 'Server Fault'
Hi all, this week we have been suffering some malicious vulnerability scans to our servers, increasing the load on them, making them nearly unusable. The attack is easy to defend, just blocking the offending ip, but only after discovering it. Is there any form of prevent it? Is it normal that… >>> More
can't install psycopg2 in my env on mac os x lion

as seen on Server Fault - Search for 'Server Fault'
I tried install psycopg2 via pip in my virtual env, but got this error: ld: library not found for -lpq (full log here: http://pastebin.com/XdmGyJ4u ) I tried install postgres 9.1 from .dmg and via port, (gksks)iMac-Alexander:~ lorddaedra$ locate libpq /Developer/SDKs/MacOSX10.7.sdk/usr/include/libpq /Developer/SDKs/MacOSX10… >>> More
Bitnami redmine error SVN

as seen on Server Fault - Search for 'Server Fault'
I'm installing the Bitnami Redmine stack (redmine + subversion). Firstly I install configure and test it locally (Ubuntu 14.04 LTS). And everything is OK. I install Bitnami stack on server (Red Hat 4.4.7-4) and configure SVN. I commit files into SVN and connect project into Redmine with SVN repository… >>> More
Can the .htaccess file slow down a website to a crawl? If so, are there better ways to solve these problems with different rewrite rules and such?

as seen on Pro Webmasters - Search for 'Pro Webmasters'
here is my htaccess file...... RewriteCond %{REQUEST_URI} ^/patients/billing/FAQ_billing\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/billing/getintouch\.html$ RewriteRule ^patients/billing/(.*)\.html$ $1.php [L,NC] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/a\.html$ [OR] RewriteCond… >>> More

Related posts about artificial-intelligence

Artificial Intelligence Research within Microsoft

as seen on ASP.net Weblogs - Search for 'ASP.net Weblogs'
I've got the contact with Eric Horvitz today, interview with him about Artificial Intelligence-related research within Microsoft you can see below: ...(read more) >>> More
Artificial Intelligence Research within Microsoft

as seen on ASP.net Weblogs - Search for 'ASP.net Weblogs'
I've got the contact with Eric Horvitz today, interview with him about Artificial Intelligence-related research within Microsoft you can see below: ...(read more) >>> More
Design for a machine learning artificial intelligence framework

as seen on Stack Overflow - Search for 'Stack Overflow'
This is a community wiki which aims to provide a good design for a machine learning/artificial intelligence framework (ML/AI framework). Please contribute to the design of a language-agnostic framework which would allow multiple ML/AI algorithms to be plugged into a single framework which: runs… >>> More
Design for a machine learning artificial intelligence framework (community wiki)

as seen on Stack Overflow - Search for 'Stack Overflow'
This is a community wiki which aims to provide a good design for a machine learning/artificial intelligence framework (ML/AI framework). Please contribute to the design of a language-agnostic framework which would allow multiple ML/AI algorithms to be plugged into a single framework which: runs… >>> More
what languages are used in AI research today?

as seen on Stack Overflow - Search for 'Stack Overflow'
hi. I am currently dabbling in expert systems, emacs lisp, and reading up about artificial intelligence. Traditionally, artificial intelligence is associated with LISP and expert systems with CLIPS. However, I have noticed in computational sciences how much Python is being used. What about the… >>> More