Groovy XmlSlurper for HTML Parsing
Very common task: you need to parse XML. When using groovy there is groovy.util.XmlSlurper for that. We know that HTML is just a special XML – but when you have to parse from online ressources you have to assume that it’s never well-formed. So in order not to get errors while parsing and simultaneously being able to use XmlSlurper’s great node-traversing functionality – there has to be a solution to make it work…
Don’t try to build your own SAXParser respectivly SAXParserFactory with special features that doesn’t load DTDs, has a dummy entity-resolver, is not validating or ignores missing end-tags. Far too much effort – not groovy…
[groovy title="Not groovy"]
// SAXParserFactory saxParserFactory = javax.xml.parsers.SAXParserFactory.newInstance()
// saxParserFactory.validating = false
// saxParserFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
//// saxParserFactory.setFeature("http://xml.org/sax/features/validation", false)
//// saxParserFactory.setFeature("http://apache.org/xml/features/validation/schema", false)
//
// SAXParser saxParser = saxParserFactory.newSAXParser()
// saxParser.setEntityResolver(entityResolver);
// XmlSlurper xmlSlurper = new XmlSlurper(saxParser)
// Create instance of XmlSlurper and set EntityResolver
// def xmlSlurper = new XmlSlurper(false, false)
// xmlSlurper.setEntityResolver(DummyDTD.entityResolver)
// http://groovy.329449.n5.nabble.com/XmlSlurper-without-Validation-and-DTD-access-td334171.html
// class DummyDTD {
// def static entityResolver = [
// resolveEntity: { publicId, systemId ->
// }
// ] as org.xml.sax.EntityResolver
// }
[/groovy]
When you use the Groovy HttpBuilder for grabbing your markup you can just use neko-html because it’s a direct dependency of HttpBuilder.
[groovy title="http://repository.codehaus.org/org/codehaus/groovy/modules/http-builder/http-builder/0.5.2/http-builder-0.5.2.pom"] <dependency> <!-- Only needed for HTML parsing --> <groupId>net.sourceforge.nekohtml</groupId> <artifactId>nekohtml</artifactId> <version>1.9.9</version> </dependency> [/groovy]
Because neko-html is providing a ready-made SAXParser for HTML – creating a Groovy XmlSlurper for HTML parsing is as easy as calling the constructor using neko-htmls SAXParser:
[groovy title="XmlSlurper for parsing HTML"] groovy.util.XmlSlurper.XmlSlurper xmlSlurper = new groovy.util.XmlSlurper.XmlSlurper(new org.cyberneko.html.parsers.SAXParser()) [/groovy]



