Mechanical assembly lines achieve efficiency by moving product through a series of fixed machines, each one specialized...
to do one function very efficiently. This familiar image has inspired software designers to attempt something similar. The first example I can think of is the Unix toolkit for pipeline processing of text.
At a higher level of abstraction, application architects use concepts such "workflow" or "dataflow" to describe the movement of information as documents or messages through a set of processes. These days more and more documents and messages are formatted in XML, so why not an XML pipeline?
Why another XML tool?
Lets face it, XML may provide a lot of great data manipulation functionality, but speed and low memory use are not at the top of the list. The general XML manipulation tools such as XSLT, XPath and XQuery are flexible, but the additional layers of code are anything but fast.
XML pipeline components can be designed to do one thing and do it well. I think there is reason to believe that we could evolve a set of pipeline components that could be configured and plugged together to accomplish processing tasks rapidly and with minimum resource use. Later in this article and in part 2 I will review some of the attempts that have been made, but for right now lets look as some basics.
How to feed a pipeline
There are three different ways to move XML data in a pipeline, as characters either in a stream or as Strings, as SAX events or as Document Object Model (DOM) elements. I did some time trials by reading a 8.9meg XML document three different ways using Java 1.5 standard library classes. Here are the results normalized to the plain stream reading time: as a stream - 1.0; as a stream turned into SAX events - 2.0; and as a stream turned into a DOM - 9.2.
As far a memory usage is concerned, a DOM in memory takes much more space than just the characters in the file because of all the objects created. SAX events on the other hand are quite small. I think SAX pipelines offer many advantages in addition to speed and small memory footprint for processing large XML documents.
A pipeline for SAX events
SAX stands for Simplified API for XML Processing. A SAX parser recognizes the various parts of an XML document, creates objects incorporating the data and passes the objects to "handler" methods which have been registered with the parser. The key point which opens the floodgates of possibilities is that the stream of SAX events generated by an SAX parser contains the complete infoset of the original XML document, one piece at a time. Any process that can work with a single event can do some work and then pass the event on to the next process.
Here are the signatures of the Java methods that handle the data for XML start and end element tags in the org.xml.sax.ContentHander interface. In a pipeline, you would use code in these methods to examine and possibly modify these parameters before passing them to the next handler in the pipeline.
public void startElement(String uri, String localName, String qName, Attributes atts)
public void endElement(String uri, String localName, String qName)
The content of these parameters is as follows:
uri: If the document uses XML namespaces AND the parser has namespace processing turned on, this String will contain the URI - otherwise it will be an empty String.
localName: If the parser has namespace processing turned on, this String will contain the element name minus any namespace prefix - otherwise it will be an empty String.
qName: This String will have the complete name of the element, with prefix if any.
atts: This is a reference to an Attributes object containing the names and values of all attributes in the element start tag.
For example, given a SOAP message that starts like this:
<?xml version="1.0" encoding="utf-8"?> <soap:Envelope xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="https://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"> <soap:Body>
The data passed to the startElement method when the parser encountered the <soap:Body> tag would be: uri = "http://schemas.xmlsoap.org/soap/envelope/", localName = "Body" and qName = "soap:Body". The Attributes object would exist but it would have a count of zero attributes.
Building and using SAX pipelines
Understanding the data flow in a practical SAX pipeline is a little hard to explain without diagrams. Fortunately, chapter 8, "SAX Filters," of Elliotte Rusty Harold's excellent book "Processing XML with Java" has numerous examples of connecting SAX handling components with diagrams showing the data flow. This chapter has been made available on line as shown in the references below. These examples use the standard JAXP (Java API for XML Processing) library classes so if you have Java 1.4 or 1.5 installed, you don't need anything else.
Applications of SAX pipeline components
Here are some of the tasks a single component can accomplish. Some of these I have coded myself, others come from published examples.
Extracting Statistics - using the startElement method, a component can keep a count of various elements and the frequency of various attribute values.
Removing Elements - a component can selectively remove specified elements so that one master XML document can serve many purposes.
Adding elements or attributes based on computation - for example you could do a database query to look up a part number and add a part description to a purchase order.
Why not use XSLT Instead?
While it is true that XSLT could in theory be used to perform these tasks, it is slower and more memory intensive. XSLT shines in many areas, especially when a major rearrangement of the data is required. An SAX pipeline is going to be much faster if the problem is suited to sequential processing of elements without a major rearrangement.
In my next article I will cover the W3C XML-Pipeline specification and some example toolkits based on pipeline principles.