Groovy XmlSlurper for HTML Parsing
Very common task: you need to parse XML. When using groovy there is groovy.util.XmlSlurper for that. We know that HTML is just a special XML – but when you have to parse from online ressources you have to assume that it’s never well-formed. So in order not to get errors while parsing and simultaneously being able to use XmlSlurper’s great node-traversing functionality – there has to be a solution to make it work…
Don’t try to build your own SAXParser respectivly SAXParserFactory with special features that doesn’t load DTDs, has a dummy entity-resolver, is not validating or ignores missing end-tags. Far too much effort – not groovy…
[groovy title="Not groovy"] // SAXParserFactory saxParserFactory = javax.xml.parsers.SAXParserFactory.newInstance() // saxParserFactory.validating = false // saxParserFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false) //// saxParserFactory.setFeature("http://xml.org/sax/features/validation", false) //// saxParserFactory.setFeature("http://apache.org/xml/features/validation/schema", false) // // SAXParser saxParser = saxParserFactory.newSAXParser() // saxParser.setEntityResolver(entityResolver); // XmlSlurper xmlSlurper = new XmlSlurper(saxParser) // Create instance of XmlSlurper and set EntityResolver // def xmlSlurper = new XmlSlurper(false, false) // xmlSlurper.setEntityResolver(DummyDTD.entityResolver) // http://groovy.329449.n5.nabble.com/XmlSlurper-without-Validation-and-DTD-access-td334171.html // class DummyDTD { // def static entityResolver = [ // resolveEntity: { publicId, systemId -> // } // ] as org.xml.sax.EntityResolver // } [/groovy]
When you use the Groovy HttpBuilder for grabbing your markup you can just use neko-html because it’s a direct dependency of HttpBuilder.
[groovy title="http://repository.codehaus.org/org/codehaus/groovy/modules/http-builder/http-builder/0.5.2/http-builder-0.5.2.pom"] <dependency> <!-- Only needed for HTML parsing --> <groupId>net.sourceforge.nekohtml</groupId> <artifactId>nekohtml</artifactId> <version>1.9.9</version> </dependency> [/groovy]
Because neko-html is providing a ready-made SAXParser for HTML – creating a Groovy XmlSlurper for HTML parsing is as easy as calling the constructor using neko-htmls SAXParser:
[groovy title="XmlSlurper for parsing HTML"] groovy.util.XmlSlurper.XmlSlurper xmlSlurper = new groovy.util.XmlSlurper.XmlSlurper(new org.cyberneko.html.parsers.SAXParser()) [/groovy]