Parsing HTML using HtmlParser

Posted by Blankman on Stack Overflow See other posts from Stack Overflow or by Blankman
Published on 2010-04-18T02:21:20Z Indexed on 2010/04/18 2:23 UTC
Read the original article Hit count: 611

Filed under:
|
|

My html has 20 or so rows of the following HTML pattern. So the below is considered a single instance of the pattern. Each instance of this pattern represents a product. Again the below is a single instance, it spans multiple rows in the HTML table.

<table>
..
<!-- product starts here, this html comment is not in the real html -->
<tr>
   <td rowspan="5" class="product" valign="top"><nobr> ????????????</td>
</tr>
<tr>
   <td class="title" ??????????>?????????</td>
   <td class="title" ??????????>?????????</td>
   <td class="title" ??????????>?????????</td>
   <td class="title" ??????????>?????????</td>
   <td class="title" ??????????>?????????</td>
   <td class="title" ??????????>?????????</td>
</tr>
<tr>
    <td class="data" ?????? </td>
    <td class="data" ?????? </td>
    <td class="data" ?????? </td>
    <td class="data" ?????? </td>
    <td class="data" ?????? </td>
    <td class="data" ?????? </td>
</tr>
</tr>
<tr>
    <td colspan="5" ????????</td>
</tr>
<tr>
      <td colspan="6" width="100%">&nbsp;<hr></td>
</tr>
<!-- product ends here, this html comment is not in the real html -->
<!-- above pattern repeats multiple times in the HTML -->
..
<table>

I am trying to use HtmlParser for this.

Parser rowParser = new Parser();
rowParser.setInputHtml(page.getHtml());  // page object represents a html page
rowParser.setEncoding("UTF-8");


 NodeFilter productRowFilter = new AndFilter(
                                new TagNameFilter("tr"),
                                new HasChildFilter(
                                    new AndFilter(
                                        new TagNameFilter("td"),
                                        new HasAttributeFilter("class", "product")))

The above filter doesn't work, just showing you what I have so far.

I need to somehow combine these filters, and use the last td to mark the end of the pattern i.e. the td with the colspan=6 and width=100% with child element hr.

I have been struggling with this, and have resorted to Regex'ing but was told numerous times to NOT use regex for html parsing, so here I am!

Your help is much appreciated!

© Stack Overflow or respective owner

Related posts about java

Related posts about htmlparser