Search Results

Search found 4 results on 1 pages for 'tagsoup'.

Page 1/1 | 1 

  • JDOM 1.1: hyphen is not a valid comment character

    - by Stefan Kendall
    I'm using tagsoup to clean some HTML I'm scraping from the internet, and I'm getting the following error when parsing through pages with comments: The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM comment: Comment data cannot start with a hyphen. I'm using JDOM 1.1, and here's the code that does the actual cleaning: SAXBuilder builder = new org.jdom.input.SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build // Don't check the doctype! At our usage rate, we'll get 503 responses // from the w3. builder.setEntityResolver(dummyEntityResolver); Reader in = new StringReader(str); org.jdom.Document doc = builder.build(in); String cleanXmlDoc = new org.jdom.output.XMLOutputter().outputString(doc); Any idea what's going wrong, or how to fix this? I need to be able to parse pages with long comment strings of <!--------- data ------------>

    Read the article

  • Setting a custom XOM EntityResolver

    - by Stefan Kendall
    I need to not validate against a doctype, so I'd like to set a custom EntityResolver that accepts everything. I'm getting data back from tagsoup, so I know my data is well-formed and correct. Furthermore, I need to rapidly hit a number of documents, so when I do this with the default EntityResolver, I get 503 from w3.org. How, then, can I use a XOM builder with a custom entity resolver?

    Read the article

  • Groovy XmlSlurper

    - by Langali
    I am trying to parse a html file using Groovy XmlSlurper. <div id="users"> <h1>Name: Joe Doe</h1> <div id="user"> <div id="user_summary">Game: 1</div> <object width="640" height="385"><param name="movie" value="http://www.youtube.com/v/DApLO_HDhD0&hl=en_US&fs=1&"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/DApLO_HDhD0&hl=en_US&fs=1&" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"></embed></object> </div> <div id="user"> <div id="user_summary">Game: 2</div> ... </div> <div id="user"> .... </div> </div> <div id="featured_users"> <div id="user"> ... </div> <div id="user"> .... </div> </div> I need to grab each user (and not featured user) with his name, summary and object tag (which the video embed code). Anybody wanna give it a shot? Here's a start: def parser =new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()) def response = parser.parseText(htmlString) def users = response.depthFirst().collect { it }.findAll { it.@id == "users" } users.each { ...... } I cant seem to be able to get much further:

    Read the article

  • Vote on Pros and Cons of Java HTML to XML cleaners

    - by George Bailey
    I am looking to allow HTML emails (and other HTML uploads) without letting in scripts and stuff. I plan to have a white list of safe tags and attributes as well as a whitelist of CSS tags and value regexes (to prevent automatic return receipt). I asked a question: Parse a badly formatted XML document (like an HTML file) I found there are many many ways to do this. Some systems have built in sanitizers (which I don't care so much about). This page is a very nice listing page but I get kinda lost http://java-source.net/open-source/html-parsers It is very important that the parsers never throw an exception. There should always be best guess results to the parse/clean. It is also very important that the result is valid XML that can be traversed in Java. I posted some product information and said Community Wiki. Please post any other product suggestions you like and say Community Wiki so they can be voted on. Also any comments or wiki edits on what part of a certain product is better and what is not would be greatly appreciated. (for example,, speed vs accuracy..) It seems that we will go with either jsoup (seems more active and up to date) or TagSoup (compatible with JDK4 and been around awhile). A +1 for any of these products would be if they could convert all style sheets into inline style on the elements.

    Read the article

1