Html Agility Pack for Reading “Real World” HTML

Posted by WeigeltRo on ASP.net Weblogs See other posts from ASP.net Weblogs or by WeigeltRo
Published on Sun, 07 Oct 2012 21:43:58 GMT Indexed on 2012/10/08 3:38 UTC
Read the original article Hit count: 955

Filed under:

.NET

|

english

In an ideal world, all data you need from the web would be available via well-designed services. In the real world you sometimes have to scrape the data off a web page. Ugly, dirty – but if you really want that data, you have no choice.

Just don’t write (yet another) HTML parser.

I stumbled across the Html Agility Pack (HAP) a long time ago, but just now had the need for a robust way to read HTML.

A quote from the website:

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Using the HAP was a simple matter of getting the Nuget package, taking a look at the example and dusting off some of my XPath knowledge from years ago.

The documentation on the Codeplex site is non-existing, but if you’ve queried a DOM or used XPath or XSLT before you shouldn’t have problems finding your way around using Intellisense (ReSharper tip: Press Ctrl+Shift+F1 on class members for reading the full doc comments).

© ASP.net Weblogs or respective owner

Related posts about .NET

Apt-Get Update: failure to fetch; can't connect to any sources

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I realize there are dozens of "apt-get update: failure to fetch" questions (I read through all I could find), but my present circumstance is unique to 12.04 and it affects all sources; not just launchpad. Additionally, I've tried several different servers in Europe and the U.S. as well as the "main… >>> More
12.04: Apt-Get Update: failure to fetch; can't connect to any sources

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I realize there are dozens of "apt-get update: failure to fetch" questions (I read through all I could find), but my present circumstance is unique to 12.04 and it affects all sources; not just launchpad. Additionally, I've tried several different servers in Europe and the U.S. as well as the "main… >>> More
What's New in ASP.NET 4

as seen on Geeks with Blogs - Search for 'Geeks with Blogs'
The .NET Framework version 4 includes enhancements for ASP.NET 4 in targeted areas. Visual Studio 2010 and Microsoft Visual Web Developer Express also include enhancements and new features for improved Web development. This document provides an overview of many of the new features that are included… >>> More
.NET Reflector 6, .NET Reflector Pro, TestDriven.NET, .NET 4.0 and Mono

as seen on Simple Talk - Search for 'Simple Talk'
By now you may well have noticed that .NET Reflector 6 and .NET Reflector Pro are out in the wild. The official launch happened today, although we actually put the software out last Thursday as part of a phased release plan to ensure that everything went smoothly today which, so far, it seems to have… >>> More
Redmine on Apache2 with Passenger issue

as seen on Server Fault - Search for 'Server Fault'
I installed Redmine and run it in Apache2 with the Passenger module. Apache2 boots, Passenger module gets loaded and the Redmine welcome page is shown, however when trying to login or navigate to other parts of the Redmine site, the browser keeps loading and loading and loading forever, although the… >>> More

Related posts about english

Three Steps to Becoming an Expert Oracle Linux System Administrator

as seen on Oracle Blogs - Search for 'Oracle Blogs'
Oracle provides a complete system administration curriculum to take you from your initial experience of Unix to being an expert Oracle Linux system administrator. You can take these live instructor-led courses from your own desk through live-virtual events or by traveling to an education center… >>> More
Oracle Linux Training Calendar

as seen on Oracle Blogs - Search for 'Oracle Blogs'
The Oracle Linux System Administrator Curriculum is designed to provide you with the knowledge and skills necessary to effectively administer an Oracle Linux environment. These classes will help you prepare to install, configure, and manage your enterprise Linux environment as well as prepare… >>> More
Get Oracle Linux Certified at Much Reduced Price

as seen on Oracle Blogs - Search for 'Oracle Blogs'
You have already heard the great news that you can now prove your knowledge on Oracle Linux 5 and 6 with the new Oracle Certified Associate, Oracle Linux 5 and 6 System Administrator exam. Until December 21th 2013, this exam is in beta phase so you can get a fully-fledged certification at… >>> More
Easy and Rapid Deployment of Application Workloads with Oracle VM

as seen on Oracle Blogs - Search for 'Oracle Blogs'
Oracle VM is designed for easy and rapid deployment of application workloads. In addition to allowing for rapid deployment of an entire application stack, Oracle VM now gives administrators more fine-grained control of the application payloads inside the virtual machine. To get started on Oracle… >>> More
How to Convert Non-English Characters to English Using JavaScript

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a c# function which converts all non-english characters to proper characters for a given text. like as follows public static string convertString(string phrase) { int maxLength = 100; string str = phrase.ToLower(); int i = str.IndexOfAny( new char[]… >>> More