grails framework

Groovy XmlSlurper for HTML Parsing


Very common task: you need to parse XML. When using groovy there is groovy.util.XmlSlurper for that. We know that HTML is just a special XML – but when you have to parse from online ressources you have to assume that it’s never well-formed. So in order not to get errors while parsing and simultaneously being able to use XmlSlurper’s great node-traversing functionality – there has to be a solution to make it work…

Don’t try to build your own SAXParser respectivly SAXParserFactory with special features that doesn’t load DTDs, has a dummy entity-resolver, is not validating  or ignores missing end-tags. Far too much effort – not groovy…

[groovy title="Not groovy"]
//        SAXParserFactory saxParserFactory = javax.xml.parsers.SAXParserFactory.newInstance()
//        saxParserFactory.validating = false
//        saxParserFactory.setFeature("", false)
////        saxParserFactory.setFeature("", false)
////        saxParserFactory.setFeature("", false)
//        SAXParser saxParser = saxParserFactory.newSAXParser()
//        saxParser.setEntityResolver(entityResolver);

//        XmlSlurper xmlSlurper = new XmlSlurper(saxParser)

// Create instance of XmlSlurper and set EntityResolver
//        def xmlSlurper = new XmlSlurper(false, false)
//        xmlSlurper.setEntityResolver(DummyDTD.entityResolver)

//    class DummyDTD {
//          def static entityResolver = [
//                  resolveEntity: { publicId, systemId ->
//                  }
//          ] as org.xml.sax.EntityResolver
//    }

When you use the Groovy HttpBuilder for grabbing your markup you can just use neko-html because it’s a direct dependency of HttpBuilder.

[groovy title=""]
	<!-- Only needed for HTML parsing -->

Because neko-html is providing a ready-made SAXParser for HTML – creating a Groovy XmlSlurper for HTML parsing is as easy as calling the constructor using neko-htmls SAXParser:

[groovy title="XmlSlurper for parsing HTML"]
  groovy.util.XmlSlurper.XmlSlurper xmlSlurper = new groovy.util.XmlSlurper.XmlSlurper(new org.cyberneko.html.parsers.SAXParser())