HtmlAgilityPack giving problems with malformed html
- by Kapil
I want to extract meaningful text out of an html document and I was using html-agility-pack for the same. Here is my code:
string convertedContent = HttpUtility.HtmlDecode(ConvertHtml(HtmlAgilityPack.HtmlEntity.DeEntitize(htmlAsString)));
ConvertHtml:
    public string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);
        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }
ConvertTo:
    public void ConvertTo(HtmlAgilityPack.HtmlNode node, TextWriter outText)
    {
        string html;
        switch (node.NodeType)
        {
            case HtmlAgilityPack.HtmlNodeType.Comment:
                // don't output comments
                break;
            case HtmlAgilityPack.HtmlNodeType.Document:
                foreach (HtmlNode subnode in node.ChildNodes)
                {
                  ConvertTo(subnode, outText);
                }
                break;
            case HtmlAgilityPack.HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                    break;
                // get text
                html = ((HtmlTextNode)node).Text;
                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                    break;
                // check the text is meaningful and not a bunch of whitespaces
                if (html.Trim().Length > 0)
                {
                    outText.Write(HtmlEntity.DeEntitize(html) + " ");
                }
                break;
            case HtmlAgilityPack.HtmlNodeType.Element:
                switch (node.Name)
                {
                    case "p":
                        // treat paragraphs as crlf
                        outText.Write("\r\n");
                        break;
                }
                if (node.HasChildNodes)
                {
                foreach (HtmlNode subnode in node.ChildNodes)
                 {
                  ConvertTo(subnode, outText);
                 }
                }
                break;
        }
    }
Now in some cases when the html pages are malformed (for example the following page - http://rareseeds.com/cart/products/Purple_of_Romagna_Artichoke-646-72.html has a malformed meta-tag like <meta content="text/html; charset=uft-8" http-equiv="Content-Type">) [Note "uft" instead of utf] my code is puking at the time I am trying to load the html document.
Can someone suggest me how can I overcome these malformed html pages and still extract relevant text out of a html document?
Thanks,
Kapil