Thursday, May 6, 2010

Parsing XML with SAX


There are a few ways to parse XML. The first you maybe familiar with is the DOM parser. The DOM is a tree based structure which makes up XML's structure. You can traverse the DOM to find a tag, attribute or value you are seeking. The other method I will be talking about is SAX or Simple API for XML.

Typically the difference between DOM parsing and SAX parsing is that SAX parsing reads the file as a stream, whereas DOM parsing constructs the complete structure in memory. The advantage of SAX parsing is on mobile devices with limited memory or slow connection speeds. It's also really good at parsing very large XML files that can be gigabytes in size. A disadvantage is that you cannot traverse or backtrack. SAX parsing is also a little more complicated.

In order to setup your program to work with SAX you need to understand a little bit about event handling and a quick big picture. When the stream starts reading in data, certain handlers that you setup to are called via SAX's callback methods that look for tags, attributes and values which will be called if they match the criterium that you specified. Handlers in SAX are basically method overrides that allow you to take control over what happens. There are 3 methods you typically override in SAX, the startElement method, the endElement method and the characters method. When you instantiate your Handler class and feed it your XML stream, startElement is called. You conditionally check the tag that is passed into your method. With each tag that is passed into startElement and matched to a conditional statement, you flip boolean flags that you setup to keep track of which element was traversed.

Once startElement is finished executing (the length of one tag) the characters method is called to extract the values or attributes out of that tag. Keep in mind, these methods are called by the SAX library, not you, so you are simply catching values and flipping booleans to determine if you are in a location you want to subscribe to. Inside characters you receive a String containing the values you are looking for. Typically this looks like the startElement and endElement methods, which the exception that the string is passed to an class through public mutator method that you could name ParsedData. ParsedData would contain fields such as Title, Author, ISBN for example.

Once the tag has been passed over, the endElement is called. This method does the same as startElement with the exception that the booleans you flipped to true in startElement are now flipped to false. The next element in then streamed through and the process continues.

A Diagram Representing SAX's Possible States

In my project WattDroid (my first Android application) I've implemented SAX handlers with documentation to get you started.

Some more information is located here.