Managed (.net) library with html-tidy like functionality?

Posted by Eamon Nerbonne on Stack Overflow See other posts from Stack Overflow or by Eamon Nerbonne
Published on 2010-04-27T11:54:05Z Indexed on 2010/04/27 14:33 UTC
Read the original article Hit count: 692

Filed under:

htmltidy

Does anybody know of an html cleaner for .NET that can parse html and (for instance) convert it to a more machine friendly format such as xhtml?

I've tried the HTML Agility Pack, but that fails to correctly parse even fairly simple examples.

To give an example of html that should be parsed correctly:

<html><body>
    <ul><li>TestElem1
        <li>TestElem2
        <li>TestElem3 List:
            <ul><li>Nested1
                <li>Nested2</li>
                <li>Nested3
            </ul>
        <li>TestElem4
    </ul>
    <p>paragraph 1
    <p>paragraph 2
    <p>paragraph 3
</body></html>

li tags don't need to be closed (see spec), and neither do P tags. In other words, the above sample should be parsed as:

<html><body>
    <ul><li>TestElem1</li>
        <li>TestElem2</li>
        <li>TestElem3 List:
            <ul><li>Nested1</li>
                <li>Nested2</li>
                <li>Nested3</li>
            </ul></li>
        <li>TestElem4</li>
    </ul>
    <p>paragraph 1</p>
    <p>paragraph 2</p>
    <p>paragraph 3</p>
</body></html>

Since the aim is to use the library on various machines, it's a big disadvantage to need to fall back to native code (such as a wrapper around html tidy) which would require extra deployment hassle and sacrifice platform independance.

Any suggestions? To recap, I'm looking for:

An html cleaner ala HTML tidy
Must be able to deal with real world html, not just xhtml, at the very least correctly reading valid html 4
Must be able to convert to a more easily processable xml format
Should be a purely managed app.

Developer IT

Managed (.net) library with html-tidy like functionality? - Developer IT

Managed (.net) library with html-tidy like functionality?

.NET

html-parsing

htmltidy

Related posts about .NET

Apt-Get Update: failure to fetch; can't connect to any sources

12.04: Apt-Get Update: failure to fetch; can't connect to any sources

What's New in ASP.NET 4

.NET Reflector 6, .NET Reflector Pro, TestDriven.NET, .NET 4.0 and Mono

Redmine on Apache2 with Passenger issue

Related posts about html-parsing

html parsing with libxml

RUBY Nokogiri CSS HTML Parsing

HTML parsing - fetch and update data from the .html file

html parsing in c#

Java HTML Parsing

Categories cloud