Proper usage of JTidy to purify HTML

Posted by Raj on Stack Overflow See other posts from Stack Overflow or by Raj
Published on 2010-03-30T16:49:07Z Indexed on 2010/03/30 16:53 UTC
Read the original article Hit count: 688

Filed under:
|
|
|

Hello, I am trying to use JTidy (jtidy-r938.jar) to sanitize an input HTML string, but I seem to have problems getting the default settings right. Often strings such as "hello world" end up as "helloworld" after tidying. I wanted to show what I'm doing here, and any pointers would be really appreciated:

Assume that rawHtml is the String containing the input (real world) HTML. This is what I'm doing:

    InputStream is = new ByteArrayInputStream(rawHtml.getBytes("UTF-8"));

    Tidy tidy = new Tidy();
    tidy.setQuiet(true);
    tidy.setShowWarnings(false);
    tidy.setXHTML(true);

    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    tidy.parseDOM(is, baos);
    String pure = baos.toString();

First off, does anything look fundamentally wrong with the above code? I seem to be getting weird results with this.

Thanks in advance!

© Stack Overflow or respective owner

Related posts about java

Related posts about jtidy