Search Results

Search found 7 results on 1 pages for 'cyberneko'.

Page 1/1 | 1 

  • Processing XML comments in order using SAX & Cyberneko

    - by Joel
    I'm using cyberneko to clean and process html documents. I need to be able to process all the comments that occur in the original html documents. I've configured the cyberneko sax parser to process comments like so: parser.setProperty("http://xml.org/sax/properties/lexical-handler", consumer); ...using the same consumer as I am for DOM events. I get a callback for each of the comments: @Override public void comment(char[] arg0, int arg1, int arg2) throws SAXException { System.out.println("COMMENT::: "+new String(arg0, arg1, arg2)); } The problem I have is that all the comments are processed first, out of context of the DOM. i.e. I get a callback for all the comments before the document head, body etc.... What I'd like is for the comment callbacks to occur in the order they occur in the DOM. Edit: what I'm actually trying to do is parse the instructions for IE in the original html, such as: <!--[if lte IE 6]><body class="news ie"><![endif]--> At the moment they are all dropped, I need to include them in the cleaned HTML document.

    Read the article

  • XmlSlurper/NekoHTML document fragment parsing - No HTML or BODY tags wanted

    - by Misha Koshelev
    Dear All, I am trying to parse the following HTML fragment, and I would like to get the same fragment as output (without HTML and BODY tags). Is this possible? If so, how? Thank you Misha p.s. I am reading here: http://nekohtml.sourceforge.net/faq.html#fragments and I believe I have added the correct options below. However, the output is still incorrect :( Thank you Misha import groovy.xml.MarkupBuilder import groovy.xml.StreamingMarkupBuilder import groovy.util.XmlNodePrinter import groovy.util.slurpersupport.NodeChild def text=""" <div><h2>Test</h2> <div>Hi</div> </div> """ // Parse def config=new org.cyberneko.html.HTMLConfiguration() config.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment",true) def html=new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(text) // Output def printNode(NodeChild node) { def writer = new StringWriter() writer << new StreamingMarkupBuilder().bind { mkp.declareNamespace('':node[0].namespaceURI()) mkp.yield node } new XmlNodePrinter().print(new XmlParser().parseText(writer.toString())) } printNode(html) Output: <HTML> <tag0:HEAD xmlns:tag0="http://www.w3.org/1999/xhtml"/> <BODY> <DIV> <H2> Test </H2> <DIV> Hi </DIV> </DIV> </BODY> </HTML>

    Read the article

  • Cleaning mixed type <script> tags

    - by yossale
    I'm cleaning HTML using cyberneko and xerces. However , some $#@@!@@ websites still use BOTH <script>...</script> and <script.../> So what happens is this : given <script..../> <div> Some Text </div> <script> scripting stuff </script> , neko parses all the above line as a script , so I get <script..../> &lt div &gt Some Text &lt/div &gt &lt script &gt scripting stuff </script> , And then I lose all the inside content :( Any advice?

    Read the article

  • Groovy pretty print XmlSlurper output from HTML?

    - by Misha Koshelev
    Dear All: I am using several different versions to do this but all seem to result in this error: [Fatal Error] :1:171: The prefix "xmlns" cannot be bound to any namespace explicitly; neither can the namespace for "xmlns" be bound to any prefix explicitly. I load html as: // Load html file def fis=new FileInputStream("2.html") def html=new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(fis.text) Versions I've tried: http://johnrellis.blogspot.com/2009/08/hmmm_04.html import groovy.xml.StreamingMarkupBuilder import groovy.xml.XmlUtil def streamingMarkupBuilder=new StreamingMarkupBuilder() println XmlUtil.serialize(streamingMarkupBuilder.bind{mkp.yield html}) http://old.nabble.com/How-to-print-XmlSlurper%27s-NodeChild-with-indentation--td16857110.html // Output import groovy.xml.MarkupBuilder import groovy.xml.StreamingMarkupBuilder import groovy.util.XmlNodePrinter import groovy.util.slurpersupport.NodeChild def printNode(NodeChild node) { def writer = new StringWriter() writer << new StreamingMarkupBuilder().bind { mkp.declareNamespace('':node[0].namespaceURI()) mkp.yield node } new XmlNodePrinter().print(new XmlParser().parseText(writer.toString())) } Any advice? Thank you! Misha

    Read the article

  • Interfacing HTTPBuilder and HTMLUnit... some code

    - by Misha Koshelev
    Ok, this isn't even a question: import com.gargoylesoftware.htmlunit.HttpMethod import com.gargoylesoftware.htmlunit.WebClient import com.gargoylesoftware.htmlunit.WebResponseData import com.gargoylesoftware.htmlunit.WebResponseImpl import com.gargoylesoftware.htmlunit.util.Cookie import com.gargoylesoftware.htmlunit.util.NameValuePair import static groovyx.net.http.ContentType.TEXT import java.io.File import java.util.logging.Logger import org.apache.http.impl.cookie.BasicClientCookie /** * HTTPBuilder class * * Allows Javascript processing using HTMLUnit * * @author Misha Koshelev */ class HTTPBuilder { /** * HTTP Builder - implement this way to avoid underlying logging output */ def httpBuilder /** * Logger */ def logger /** * Directory for storing HTML files, if any */ def saveDirectory=null /** * Index of current HTML file in directory */ def saveIdx=1 /** * Current page text */ def text=null /** * Response for processJavascript (Complex Version) */ def resp=null /** * URI for processJavascript (Complex Version) */ def uri=null /** * HttpMethod for processJavascript (Complex Version) */ def method=null /** * Default constructor */ public HTTPBuilder() { // New HTTPBuilder httpBuilder=new groovyx.net.http.HTTPBuilder() // Logging logger=Logger.getLogger(this.class.name) } /** * Constructor that allows saving output files for testing */ public HTTPBuilder(saveDirectory,saveIdx) { this() this.saveDirectory=saveDirectory this.saveIdx=saveIdx } /** * Save text and return corresponding XmlSlurper object */ public saveText() { if (saveDirectory) { def file=new File(saveDirectory.toString()+File.separator+saveIdx+".html") logger.finest "HTTPBuilder.saveText: file=\""+file.toString()+"\"" file<<text saveIdx++ } new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(text) } /** * Wrapper around supertype get method */ public Object get(Map<String,?> args) { logger.finer "HTTPBuilder.get: args=\""+args+"\"" args.contentType=TEXT httpBuilder.get(args) { resp,reader-> text=reader.text this.resp=resp this.uri=args.uri this.method=HttpMethod.GET saveText() } } /** * Wrapper around supertype post method */ public Object post(Map<String,?> args) { logger.finer "HTTPBuilder.post: args=\""+args+"\"" args.contentType=TEXT httpBuilder.post(args) { resp,reader-> text=reader.text this.resp=resp this.uri=args.uri this.method=HttpMethod.POST saveText() } } /** * Load cookies from specified file */ def loadCookies(file) { logger.finer "HTTPBuilder.loadCookies: file=\""+file.toString()+"\"" file.withObjectInputStream { ois-> ois.readObject().each { cookieMap-> def cookie=new BasicClientCookie(cookieMap.name,cookieMap.value) cookieMap.remove("name") cookieMap.remove("value") cookieMap.entrySet().each { entry-> cookie."${entry.key}"=entry.value } httpBuilder.client.cookieStore.addCookie(cookie) } } } /** * Save cookies to specified file */ def saveCookies(file) { logger.finer "HTTPBuilder.saveCookies: file=\""+file.toString()+"\"" def cookieMaps=new ArrayList(new LinkedHashMap()) httpBuilder.client.cookieStore.getCookies().each { cookie-> def cookieMap=[:] cookieMap.version=cookie.version cookieMap.name=cookie.name cookieMap.value=cookie.value cookieMap.domain=cookie.domain cookieMap.path=cookie.path cookieMap.expiryDate=cookie.expiryDate cookieMaps.add(cookieMap) } file.withObjectOutputStream { oos-> oos.writeObject(cookieMaps) } } /** * Process Javascript using HTMLUnit (Simple Version) */ def processJavascript() { logger.finer "HTTPBuilder.processJavascript (Simple)" def webClient=new WebClient() def tempFile=File.createTempFile("HTMLUnit","") tempFile<<text def page=webClient.getPage("file://"+tempFile.toString()) webClient.waitForBackgroundJavaScript(10000) text=page.asXml() webClient.closeAllWindows() tempFile.delete() saveText() } /** * Process Javascript using HTMLUnit (Complex Version) * Closure, if specified, used to determine presence of necessary elements */ def processJavascript(closure) { logger.finer "HTTPBuilder.processJavascript (Complex)" // Convert response headers def headers=new ArrayList() resp.allHeaders.each() { header-> headers.add(new NameValuePair(header.name,header.value)) } def responseData=new WebResponseData(text.bytes,resp.statusLine.statusCode,resp.statusLine.toString(),headers) def response=new WebResponseImpl(responseData,uri.toURL(),method,0) // Transfer cookies def webClient=new WebClient() httpBuilder.client.cookieStore.getCookies().each { cookie-> webClient.cookieManager.addCookie(new Cookie(cookie.domain,cookie.name,cookie.value,cookie.path,cookie.expiryDate,cookie.isSecure())) } def page=webClient.loadWebResponseInto(response,webClient.getCurrentWindow()) // Wait for condition if (closure) { for (i in 1..20) { if (closure(page)) { break; } synchronized(page) { page.wait(500); } } } // Return text text=page.asXml() webClient.closeAllWindows() saveText() } } Allows one to interface HTTPBuilder with HTMLUnit! Enjoy Misha

    Read the article

  • replace XmlSlurper tag with arbitrary XML

    - by Misha Koshelev
    Dear All: I am trying to replace specific XmlSlurper tags with arbitrary XML strings. The best way I have managed to come up with to do this is: #!/usr/bin/env groovy import groovy.xml.StreamingMarkupBuilder def page=new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(""" <html> <head></head> <body> <one attr1='val1'>asdf</one> <two /> <replacemewithxml /> </body> </html> """.trim()) import groovy.xml.XmlUtil def closure closure={ bind,node-> if (node.name()=="REPLACEMEWITHXML") { bind.mkp.yieldUnescaped "<replacementxml>sometext</replacementxml>" } else { bind."${node.name()}"(node.attributes()) { mkp.yield node.text() node.children().each { child-> closure(bind,child) } } } } println XmlUtil.serialize( new StreamingMarkupBuilder().bind { bind-> closure(bind,page) } ) However, the only problem is the text() element seems to capture all child text nodes, and thus I get: <?xml version="1.0" encoding="UTF-8"?> <HTML>asdf<HEAD/> <BODY>asdf<ONE attr1="val1">asdf</ONE> <TWO/> <replacementxml>sometext</replacementxml> </BODY> </HTML> Any ideas/help much appreciated. Thank you! Misha p.s. Also, out of curiosity, if I change the above to the "Groovier" notation as follows, the groovy compiler thinks I am trying to access the ${node.name()} member of my test class. Is there a way to specify this is not the case while still not passing the actual builder object? Thank you! :) def closure closure={ node-> if (node.name()=="REPLACEMEWITHXML") { mkp.yieldUnescaped "<replacementxml>sometext</replacementxml>" } else { "${node.name()}"(node.attributes()) { mkp.yield node.text() node.children().each { child-> closure(child) } } } } println XmlUtil.serialize( new StreamingMarkupBuilder().bind { closure(page) } )

    Read the article

  • HTTP Builder/Groovy - lost 302 (redirect) handling?

    - by Misha Koshelev
    Dear All: I am reading here http://groovy.codehaus.org/modules/http-builder/doc/handlers.html "In cases where a response sends a redirect status code, this is handled internally by Apache HttpClient, which by default will simply follow the redirect by re-sending the request to the new URL. You do not need to do anything special in order to follow 302 responses." This seems to work fine when I simply use the get() or post() methods without a closure. However, when I use a closure, I seem to lose 302 handling. Is there some way I can handle this myself? Thank you p.s. Here is my log output showing it is a 302 response [java] FINER: resp.statusLine: "HTTP/1.1 302 Found" Here is the relevant code: // Copyright (C) 2010 Misha Koshelev. All Rights Reserved. package com.mksoft.fbbday.main import groovyx.net.http.ContentType import java.util.logging.Level import java.util.logging.Logger class HTTPBuilder { def dataDirectory HTTPBuilder(dataDirectory) { this.dataDirectory=dataDirectory } // Main logic def logger=Logger.getLogger(this.class.name) def closure={resp,reader-> logger.finer("resp.statusLine: \"${resp.statusLine}\"") if (logger.isLoggable(Level.FINEST)) { def respHeadersString='Headers:'; resp.headers.each() { header->respHeadersString+="\n\t${header.name}=\"${header.value}\"" } logger.finest(respHeadersString) } def text=reader.text def lastHtml=new File("${dataDirectory}${File.separator}last.html") if (lastHtml.exists()) { lastHtml.delete() } lastHtml<<text new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(text) } def processArgs(args) { if (logger.isLoggable(Level.FINER)) { def argsString='Args:'; args.each() { arg->argsString+="\n\t${arg.key}=\"${arg.value}\"" } logger.finer(argsString) } args.contentType=groovyx.net.http.ContentType.TEXT args } // HTTPBuilder methods def httpBuilder=new groovyx.net.http.HTTPBuilder () def get(args) { httpBuilder.get(processArgs(args),closure) } def post(args) { args.contentType=groovyx.net.http.ContentType.TEXT httpBuilder.post(processArgs(args),closure) } } Here is a specific tester: #!/usr/bin/env groovy import groovyx.net.http.HTTPBuilder import groovyx.net.http.Method import static groovyx.net.http.ContentType.URLENC import java.util.logging.ConsoleHandler import java.util.logging.Level import java.util.logging.Logger // MUST ENTER VALID FACEBOOK EMAIL AND PASSWORD BELOW !!! def email='' def pass='' // Remove default loggers def logger=Logger.getLogger('') def handlers=logger.handlers handlers.each() { handler->logger.removeHandler(handler) } // Log ALL to Console logger.setLevel Level.ALL def consoleHandler=new ConsoleHandler() consoleHandler.setLevel Level.ALL logger.addHandler(consoleHandler) // Facebook - need to get main page to capture cookies def http = new HTTPBuilder() http.get(uri:'http://www.facebook.com') // Login def html=http.post(uri:'https://login.facebook.com/login.php?login_attempt=1',body:[email:email,pass:pass]) assert html==null // Why null? html=http.post(uri:'https://login.facebook.com/login.php?login_attempt=1',body:[email:email,pass:pass]) { resp,reader-> assert resp.statusLine.statusCode==302 // Shouldn't we be redirected??? // http://groovy.codehaus.org/modules/http-builder/doc/handlers.html // "In cases where a response sends a redirect status code, this is handled internally by Apache HttpClient, which by default will simply follow the redirect by re-sending the request to the new URL. You do not need to do anything special in order to follow 302 responses. " } Here are relevant logs: FINE: Receiving response: HTTP/1.1 302 Found Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.DefaultClientConnection receiveResponseHeader FINE: << HTTP/1.1 302 Found Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.DefaultClientConnection receiveResponseHeader FINE: << Cache-Control: private, no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.DefaultClientConnection receiveResponseHeader FINE: << Expires: Sat, 01 Jan 2000 00:00:00 GMT Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.DefaultClientConnection receiveResponseHeader FINE: << Location: http://www.facebook.com/home.php? Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.DefaultClientConnection receiveResponseHeader FINE: << P3P: CP="DSP LAW" Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.DefaultClientConnection receiveResponseHeader FINE: << Pragma: no-cache Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.DefaultClientConnection receiveResponseHeader FINE: << Set-Cookie: datr=1275687438-9ff6ae60a89d444d0fd9917abf56e085d370277a6e9ed50c1ba79; expires=Sun, 03-Jun-2012 21:37:24 GMT; path=/; domain=.facebook.com Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.DefaultClientConnection receiveResponseHeader FINE: << Set-Cookie: lxe=koshelev%40post.harvard.edu; expires=Tue, 28-Sep-2010 15:24:04 GMT; path=/; domain=.facebook.com; httponly Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.DefaultClientConnection receiveResponseHeader FINE: << Set-Cookie: lxr=deleted; expires=Thu, 04-Jun-2009 21:37:23 GMT; path=/; domain=.facebook.com; httponly Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.DefaultClientConnection receiveResponseHeader FINE: << Set-Cookie: pk=183883c0a9afab1608e95d59164cc7dd; path=/; domain=.facebook.com; httponly Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.DefaultClientConnection receiveResponseHeader FINE: << Content-Type: text/html; charset=utf-8 Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.DefaultClientConnection receiveResponseHeader FINE: << X-Cnection: close Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.DefaultClientConnection receiveResponseHeader FINE: << Date: Fri, 04 Jun 2010 21:37:24 GMT Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.DefaultClientConnection receiveResponseHeader FINE: << Content-Length: 0 Jun 4, 2010 4:37:22 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies FINE: Cookie accepted: "[version: 0][name: datr][value: 1275687438-9ff6ae60a89d444d0fd9917abf56e085d370277a6e9ed50c1ba79][domain: .facebook.com][path: /][expiry: Sun Jun 03 16:37:24 CDT 2012]". Jun 4, 2010 4:37:22 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies FINE: Cookie accepted: "[version: 0][name: lxe][value: koshelev%40post.harvard.edu][domain: .facebook.com][path: /][expiry: Tue Sep 28 10:24:04 CDT 2010]". Jun 4, 2010 4:37:22 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies FINE: Cookie accepted: "[version: 0][name: lxr][value: deleted][domain: .facebook.com][path: /][expiry: Thu Jun 04 16:37:23 CDT 2009]". Jun 4, 2010 4:37:22 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies FINE: Cookie accepted: "[version: 0][name: pk][value: 183883c0a9afab1608e95d59164cc7dd][domain: .facebook.com][path: /][expiry: null]". Jun 4, 2010 4:37:22 PM org.apache.http.impl.client.DefaultRequestDirector execute FINE: Connection can be kept alive indefinitely Jun 4, 2010 4:37:22 PM groovyx.net.http.HTTPBuilder doRequest FINE: Response code: 302; found handler: post302$_run_closure2@7023d08b Jun 4, 2010 4:37:22 PM groovyx.net.http.HTTPBuilder doRequest FINEST: response handler result: null Jun 4, 2010 4:37:22 PM org.apache.http.impl.conn.SingleClientConnManager releaseConnection FINE: Releasing connection org.apache.http.impl.conn.SingleClientConnManager$ConnAdapter@605b28c9 You can see there is clearly a location argument. Thank you Misha

    Read the article

1