Choose Your Java XML Parser
he XML parser world is a dynamic one. As standards change, the parsers change as well--XML parsers are becoming more sophisticated. For most programming projects, the parser, at minimum, must support DOM Level 2, SAX 2, XSLT, and Namespaces. All the parsers discussed here provide these capabilities; however, there are distinct differences in performance, reliability, and conformance to standards. In this article, I'll compare the latest parsers from Sun, Oracle, and the Apache Software Foundation.
Apache parser
The Apache parser version 1.2.3 (commonly known as Xerces) is an open-source effort based on IBM's XML4J parser. Xerces has full support for the W3C Document Object Model (DOM) Level 1 and the Simple API for XML (SAX) 1.0 and 2.0; however it currently has only limited support for XML Schemas, DOM Level 2 (version 1). Add the xerces.jar file to your CLASSPATH to use the parser. You can use Xalan, also available from Apache's Web site, for XSLT processing. You can configure both the DOM and SAX parsers. Xerces uses the SAX2 method getFeature() and setFeature() to query and set various parser features. For example, to create a validating DOM parser instance, you would write:
DOMParser domp = new DOMParser();
try {
domp.setFeature ("http://xml.org/dom/features/validation", true);
} catch (SAXExcepton ex) {
System.out.println(ex);
}
Other modifiable features include support for Schemas and namespaces.
The following example shows a minimal program that counts the number of <servlet> tags in an XML file using the DOM. The second import line specifically refers to the Xerces parser. The main method creates a new DOMParser instance and then invokes its parse() method. If the parse operation succeeds, you can retrieve a Document object through which you can access and manipulate the DOM tree using standard DOM API calls. This simple example retrieves the "servlet" nodes and prints out the number of nodes retrieved.
import org.w3c.dom.*;
import org.apache.xerces.parsers.DOMParser;
public class DOM
{
public static void main(String[] args)
{
try {
DOMParser parser = new DOMParser();
parser.parse(args[0]);
Document doc = parser.getDocument();
NodeList nodes = doc.getElementsByTagName("servlet");
System.out.println("There are " + nodes.getLength() +
" elements.");
} catch (Exception ex) {
System.out.println(ex);
}
}
}
You can use SAX to accomplish the same task. SAX is event-oriented. In the following example, inherits from DefaultHandler, which has default implementations for all the SAX event handlers, and overrides two methods: startElement() and endDocument(). The parser calls the startElement() method each time it encounters a new element in the XML file. In the overridden startElement method, the code checks for the "servlet" tag, and increments the tagCounter counter variable.. When the parser reaches the end of the XML file, it calls the endDocument() method. The code prints out the counter variable at that point. Set the ContentHandler and the ErrorHandler properties of the the SAXParser() instance in the main() method , and then use the parse() method to start the actual parsing.
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import org.apache.xerces.parsers.SAXParser;
public class SAX extends DefaultHandler
{
int tagCount = 0;
public void startElement(String uri, String localName,
String rawName, Attributes attributes)
{
if (rawName.equals("servlet")) {
tagCount++;
}
}
public void endDocument()
{
System.out.println("There are " + tagCount +
" <servlet> elements.");
}
public static void main(String[] args)
{
try {
SAX SAXHandler = new SAX();
SAXParser parser = new SAXParser();
parser.setContentHandler(SAXHandler);
parser.setErrorHandler(SAXHandler);
parser.parse(args[0]);
}
catch (Exception ex) {
System.out.println(ex);
}
}
}
With Xerces installed and your CLASSPATH set, you can compile and run the above programs as follows:
javac DOM.java
javac SAX.java
C:\xml\code>java DOM web.xml
There are 6 <servlet> elements.
C:\xml\code>java SAX web.xml
There are 6 <servlet> elements.