Writing XML documents is very straightforward, but reading them is not nearly as
simple. Fortunately, we can use an XML parser to read the document for us. The
XML parser exposes the contents of an XML document through an API. A client
application reads an XML document through this API. As well as reading the
document and providing the contents to the client application, the parser also checks
the document for well-formedness and (optionally) validity. If it finds an error, it
informs the client application.
XML parser is a software module to read documents and a means to provide access to
their content. XML parser generates a structured tree to return the results to the
browser. An XML parser is similar to a processor that determines the structure and
properties of the data. An XML parser can read a XML document to create an output
to generate a display form. Now, XML parser for Java runs on any platform where
there is Java virtual machine. It is sometimes called XML4J. It has an interface which
allows you to take a string of XML formatted text, pick the XML tags and use them to
extract the tagged information.
Among the various XML parsers, the two mostly used ones are SAX parser & DOM
parser. Here is a brief description of these two different parsers.
SAX
SAX, the Simple API for XML, is the gold standard of XML APIs. It is the most
complete and correct by far. Given a fully validating parser that supports all its
optional features, there is very little you can’t do with it. It has one or two holes, but
they're really off in the weeds of the XML specifications, and you have to look pretty
hard to find them. SAX is an event driven API. The SAX classes and interfaces model
the parser, the stream from which the document is read, and the client application
receiving data from the parser. However, no class models the XML document itself.
Instead the parser feeds content to the client application through a callback interface,
much like the ones used in Swing and the AWT. This makes SAX very fast and very
memory efficient (since it doesn’t have to store the entire document in memory).
However, SAX programs can be harder to design and code because you normally
need to develop your own data structures to hold the content from the document.
SAX works best when your processing is fairly local; that is, when all the information
you need to use is close together in the document. For example, you might process
one element at a time. Applications that require access to the entire document at once
in order to take useful action would be better served by one of the tree-based APIs
like DOM or JDOM. Finally, because SAX is so efficient, it’s the only real choice for
truly huge XML documents. Of course, “truly huge” has to be defined relative to
available memory. However, if the documents you're processing are in the gigabyte
range, you really have no choice but to use SAX.
DOM
DOM, the Document Object Model, is a fairly complex API that models an XML
document as a tree. Unlike SAX, DOM is a read-write API. It can both parse existing
XML documents and create new ones. Each XML document is represented as
Document object. Documents are searched, queried, and updated by invoking methods
on this Document object and the objects it contains. This makes DOM much more
convenient when random access to widely separated parts of the original document is
required. However, it is quite memory intensive compared to SAX, and not nearly as
well suited to streaming applications.
Ahead in the document I have included example of XML parsing with Java using both
of these parsers.
XML Parsing with JAVA
I would like to start with an example of how to parse a XML file create Java Objects
and manipulate them.
The idea here is to parse the employees.xml file with content as below
<?xml version="1.0" encoding="UTF-8"?>
<Office>
<Employee type="permanent">
<Name>Debamalya</Name>
<Id>235960</Id>
<Age>25</Age>
</Employee>
<Employee type="contract">
<Name>Rishin</Name>
<Id>3675</Id>
<Age>24</Age>
</Employee>
<Employee type="permanent">
<Name>Debalina</Name>
<Id>3676</Id>
<Age>28</Age>
</Employee>
</Office>
From the parsed content create a list of Employee objects and print it to the console.
The output would be something like
Employee Details - Name:Debamalya, Type:permanent, Id:235960, Age:25.
Employee Details - Name:Rishin, Type:contract, Id:3675, Age:24.
Employee Details - Name:Debalina, Type:permanent, Id:3676, Age:28.
I will start with a DOM parser to parse the xml file, create Employee value objects
and add them to a list. To ensure we parsed the file correctly let's iterate through the
list and print the employees data to the console. Later we will see how to implement
the same using SAX parser.
In a real world situation you might get an xml file from a third party vendor which
you need to parse and update your database.
Using DOM Parser:
This program DomParserExample.java uses DOM API.
The steps are
• Get a document builder using document builder factory and parse the xml file
to create a DOM object.
• Get a list of employee elements from the DOM.
• For each employee element get the id, name, age and type. Create an
employee value object and add it to the list.
• At the end iterate through the list and print the employees to verify we parsed
it right.
a) Getting a document builder
private void parseXmlFile(){ DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); try { //Using factory get an instance of document builder DocumentBuilder db = dbf.newDocumentBuilder(); //parse using builder to get DOM representation of the XML file dom = db.parse("employees.xml"); }catch(ParserConfigurationException pce) { pce.printStackTrace(); }catch(SAXException se) { se.printStackTrace(); }catch(IOException ioe) { ioe.printStackTrace(); } }
b) Get a list of employee elements
Get the rootElement from the DOM object. From the root element get all employee
elements. Iterate through each employee element to load the data.
private void parseDocument(){ //get the root element Element docEle = dom.getDocumentElement(); //get a nodelist of elements NodeList nl = docEle.getElementsByTagName("Employee"); if(nl != null && nl.getLength() > 0) { for(int i = 0 ; i < nl.getLength();i++) { //get the employee element Element el = (Element)nl.item(i); //get the Employee object Employee e = getEmployee(el); //add it to list myEmpls.add(e); } } }
c) Reading in data from each employee.
/** * I take an employee element and read the values in, create * an Employee object and return it */ private Employee getEmployee(Element empEl) { //for eachelement get text or int values of //name ,id, age and name String name = getTextValue(empEl,"Name"); int id = getIntValue(empEl,"Id"); int age = getIntValue(empEl,"Age"); String type = empEl.getAttribute("type"); //Create a new Employee with the value read from the xml nodes Employee e = new Employee(name,id,age,type); return e; } /** * I take a xml element and the tag name, look for the tag and * get the text content * i.e for xml snippet if * the Element points to employee node and tagName is *'name' I will return Deb */ private String getTextValue(Element ele, String tagName) { String textVal = null; NodeList nl = ele.getElementsByTagName(tagName); if(nl != null && nl.getLength() > 0) { Element el = (Element)nl.item(0); textVal = el.getFirstChild().getNodeValue(); } return textVal; } /** * Calls getTextValue and returns a int value */ private int getIntValue(Element ele, String tagName) { //in production application you would catch the exception return Integer.parseInt(getTextValue(ele,tagName)); } Deb
d) Iterating and printing.
private void printData(){ System.out.println("No of Employees '" + myEmpls.size() + "'."); Iterator it = myEmpls.iterator(); while(it.hasNext()) { System.out.println(it.next().toString()); } }
Using SAX Parser:
This program SAXParserExample.java parses a XML document and prints it on the
console.
Sax parsing is event based modeling. When a Sax parser parses a XML document and
every time it encounters a tag it calls the corresponding tag handler methods.
When it encounters a Start Tag it calls this method
public void startElement(String uri,..
When it encounters a End Tag it calls this method
public void endElement(String uri,...
Like the DOM example this program also parses the xml file, creates a list of
employees and prints it to the console. The steps involved are
• Create a Sax parser and parse the xml
• In the event handler create the employee object
• Print out the data
Basically the class extends DefaultHandler to listen for call back events. And we
register this handler with the Sax parser to notify us of call back events. We are only
interested in start event, end event and character event.
In start event if the element is employee we create a new instant of employee object
and if the element is Name/Id/Age we initialize the character buffer to get the text
value.
In end event if the node is employee then we know we are at the end of the employee
node and we add the Employee object to the list. If it is any other node like
Name/Id/Age we call the corresponding methods like setName/SetId/setAge on the
Employee object. Java Bean classes can be used for this. In character event we store
the data in a temp string variable.
a) Create a Sax Parser and parse the xml
private void parseDocument() { //get a factory SAXParserFactory spf = SAXParserFactory.newInstance(); try { //get a new instance of parser SAXParser sp = spf.newSAXParser(); //parse the file and also register this class for call backs sp.parse("employees.xml", this); }catch(SAXException se) { se.printStackTrace(); }catch(ParserConfigurationException pce) { pce.printStackTrace(); }catch (IOException ie) { ie.printStackTrace(); } }
b) In the event handlers create the Employee object and call the corresponding setter
methods.
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException { //reset tempVal = ""; if(qName.equalsIgnoreCase("Employee")) { //create a new instance of employee tempEmp = new Employee(); tempEmp.setType(attributes.getValue("type")); } } public void characters(char[] ch, int start, int length) throws SAXException { tempVal = new String(ch,start,length); } public void endElement(String uri, String localName, String qName) throws SAXException { if(qName.equalsIgnoreCase("Employee")) { //add it to the list myEmpls.add(tempEmp); }else if (qName.equalsIgnoreCase("Name")) { tempEmp.setName(tempVal); }else if (qName.equalsIgnoreCase("Id")) { tempEmp.setId(Integer.parseInt(tempVal)); }else if (qName.equalsIgnoreCase("Age")) { tempEmp.setAge(Integer.parseInt(tempVal)); } }
c) Iterating and printing.
private void printData(){ System.out.println("No of Employees '" + myEmpls.size() + "'."); Iterator it = myEmpls.iterator(); while(it.hasNext()) { System.out.println(it.next().toString()); } }
Writing XML with Java
The previous programs illustrated how to parse an existing XML file using both SAX
and DOM Parsers. But generating a XML file from scratch is a different story, for
instance you might like to generate an xml file for the data extracted from a database.
To keep the example simple this program XMLCreatorExample.java generates XML
from a list preloaded with hard coded data. The output will be book.xml file with the
following content.
<?xml version="1.0" encoding="UTF-8"?>
<Books>
<Book Subject="Java 1.5">
<Author>Kathy Sierra .. etc</Author>
<Title>Head First Java</Title>
</Book>
<Book Subject="Java Architect">
<Author>Kathy Sierra .. etc</Author>
<Title>Head First Design Patterns</Title>
</Book>
</Books>
The steps involved are
• Load Data
• Get an instance of Document object using document builder factory
• Create the root element Books
• For each item in the list create a Book element and attach it to Books element
• Serialize DOM to FileOutputStream to generate the xml file "book.xml".
a) Load Data.
/** * Add a list of books to the list * In a production system you might populate the list from a db */ private void loadData(){ myData.add(new Book("Head First Java", "Kathy Sierra .. etc","Java 1.5")); myData.add(new Book("Head First Design Patterns", "Kathy Sierra .. etc","Java Architect")); }
b) Getting an instance of DOM.
/** * Using JAXP in implementation independent manner create a document object * using which we create a xml tree in memory */ private void createDocument() { //get an instance of factory DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); try { //get an instance of builder DocumentBuilder db = dbf.newDocumentBuilder(); //create an instance of DOM dom = db.newDocument(); }catch(ParserConfigurationException pce) { //dump it System.out.println("Error while trying to instantiate DocumentBuilder " + pce); System.exit(1); } }
}
c) Create the root element Books.
/** * The real workhorse which creates the XML structure */ private void createDOMTree(){ //create the root element Element rootEle = dom.createElement("Books"); dom.appendChild(rootEle); //No enhanced for Iterator it = myData.iterator(); while(it.hasNext()) { Book b = (Book)it.next(); //For each Book object create element and attach it to root Element bookEle = createBookElement(b); rootEle.appendChild(bookEle); } }
d) Creating a book element.
/** * Helper method which creates a XML element * @param b The book for which we need to create an xml representation * @return XML element snippet representing a book */ private Element createBookElement(Book b){ Element bookEle = dom.createElement("Book"); bookEle.setAttribute("Subject", b.getSubject()); //create author element and author text node and attach it to bookElement Element authEle = dom.createElement("Author"); Text authText = dom.createTextNode(b.getAuthor()); authEle.appendChild(authText); bookEle.appendChild(authEle); //create title element and title text node and attach it to bookElement Element titleEle = dom.createElement("Title"); Text titleText = dom.createTextNode(b.getTitle()); titleEle.appendChild(titleText); bookEle.appendChild(titleEle); return bookEle; }
e) Serialize DOM to FileOutputStream to generate the xml file "book.xml".
/** * This method uses Xerces specific classes * prints the XML document to file. */ private void printToFile(){ try { //print OutputFormat format = new OutputFormat(dom); format.setIndenting(true); //to generate output to console use this serializer //XMLSerializer serializer = new XMLSerializer(System.out, format); //to generate a file output use fileoutputstream instead of system.out XMLSerializer serializer = new XMLSerializer( new FileOutputStream(new File("book.xml")), format); serializer.serialize(dom); } catch(IOException ie) { ie.printStackTrace(); } }
Note:
The Xerces internal classes OutputFormat and XMLSerializer are in different
packages.
In JDK 1.5 with built in Xerces parser they are under
com.sun.org.apache.xml.internal.serialize.OutputFormat
com.sun.org.apache.xml.internal.serialize.XMLSerializer
In Xerces 2.7.1 which we are using to run these examples they are under
org.apache.xml.serialize.XMLSerializer
org.apache.xml.serialize.OutputFormat
We are using Xerces 2.7.1 with JDK 1.4 and JDK 1.3 as the default parser with JDK
1.4 is Crimson and there is no built in parser with JDK 1.3.
Also please remember it is not advisable to use parser implementation specific classes
like OutputFormat and XMLSerializer as they are only available in Xerces and if
you switch to another parser in the future you may have to rewrite.
Another example, of writing a XML containing the first 10 Fibonacci numbers is as
follows.
<?xml version="1.0"?>
<Fibonacci_Numbers>
<fibonacci>1</fibonacci>
<fibonacci>1</fibonacci>
<fibonacci>2</fibonacci>
<fibonacci>3</fibonacci>
<fibonacci>5</fibonacci>
<fibonacci>8</fibonacci>
<fibonacci>13</fibonacci>
<fibonacci>21</fibonacci>
<fibonacci>34</fibonacci>
<fibonacci>55</fibonacci>
</Fibonacci_Numbers>
<Fibonacci_Numbers>
<fibonacci>1</fibonacci>
<fibonacci>1</fibonacci>
<fibonacci>2</fibonacci>
<fibonacci>3</fibonacci>
<fibonacci>5</fibonacci>
<fibonacci>8</fibonacci>
<fibonacci>13</fibonacci>
<fibonacci>21</fibonacci>
<fibonacci>34</fibonacci>
<fibonacci>55</fibonacci>
</Fibonacci_Numbers>
To produce this, just add string literals for the <fibonacci> and </fibonacci> tags
inside the print statements, as well as a few extra print statements to produce the XML
declaration and the root element start- and end-tags. XML documents are just text,
and you can output them any way you’d output any other text document. The
FibonacciXML.java is created for this.
import java.math.BigInteger; public class FibonacciXML { public static void main(String[] args) { BigInteger low = BigInteger.ONE; BigInteger high = BigInteger.ONE; System.out.println(""); System.out.println(""); for (int i = 0; i < 10; i++) { System.out.print(" "); } }"); System.out.print(low); System.out.println(" "); BigInteger temp = high; high = high.add(low); low = temp; } System.out.println("
Running Programs in JAVA
The instructions to compile and run these programs vary, based on the JDK that you
are using. This is due to the way the XML parser is bundled with various Java
distributions. These instructions are for Windows OS. For Unix or Linux OS you just
need to change the folder paths accordingly. Xerces parser is bundled with the JDK
1.5 distribution. So you need not download the parser separately.
Running DOMParserExample
1. Place DomParserExample.java, Employee.java, employees.xml to
c:\xercesTest
2. Go to command prompt and type
cd c:\xercesTest
3. To compile, type
javac -classpath . DomParserExample.java
4. To run, type
java -classpath . DomParserExample
Running SAXParserExample
1. Place SAXParserExample.java, Employee.java, employees.xml to
c:\xercesTest
2. Go to command prompt and type
cd c:\xercesTest
3. To compile, type
javac -classpath . SAXParserExample.java
4. To run,type
java -classpath . SAXParserExample
Running XMLCreatorExample
1. Place XMLCreatorExample.java, Book.java to c:\xercesTest
2. Go to command prompt and type
cd c:\xercesTest
3. To compile, type
javac -classpath . XMLCreatorExample.java
4. To run, type
java -classpath . XMLCreatorExample
Running FibonacciXML
1. Place FibonacciXML.java to c:\xercesTest
2. Go to command prompt and type
cd c:\xercesTest
Internal Use 15
XML Parsing with JAVA
3. To compile, type
javac -classpath . FibonacciXML.java
4. To run, type
java -classpath .
Comparison
Both SAX & DOM have there advantages & disadvantages but need to be used
according to the requirement.
SAX:
- Parses node by node
- Doesn’t store the XML in memory
- We can’t insert or delete a node
- Top to bottom traversing
DOM
- Stores the entire XML document into memory before processing.
- Occupies more memory
- We can insert or delete nodes
- Traverse in any direction.
If we need to find a node and doesn’t need to insert or delete we can go with SAX
itself otherwise we can use DOM parser, provided we have enough memory in place.
Conclusion
I hope this document will be useable to enlighten a beginner to be able to successfully
code for extracting data from an xml. XMLs are one of the most widely used
structures for storing data, and Java provides the most commonly used parsers. In real
life situations, we receive XMLs from a third party source which are needed to be
parsed & data need to be stored in databases. These motives can be easily met using
DOM or SAX XML parsers in Java.
We can have a JMS configured system where XMLs are received in a
automated way(these can be done using MDB), the same XMLs can be parsed using
Java parsers. The parser code can be scheduled to run automatically using a .ksh
script. The parsed value can be easily stored in oracle databases using simple JDBC
codes. In most of the IT projects XML sizes are usually huge & those are complex. In
such cases it is not possible to use DOM parser, but SAX parser is used frequently.
Although DOM parsers are easier to be coded, SAX parsers are more rapidly used in
case of real-life systems.