How to parse a large XML document is a common problem in XML applications. A large XML document always has many repeatable elements and the application needs to handle these elements iteratively . The problem is obtaining the elements from the document with the least possible overhead. Sometimes XML documents are so large (100MB or more) that they are difficult to handle with traditional XML parsers.
One traditional parser is Document Object Model (DOM) based. It is easy to use, supports navigation in any direction (e.g., parent and previous sibling) and allows for arbitrary modifications. But as an exchange DOM, it will parse the whole document and construct a complete document tree in memory before we can obtain the elements. It may also consume large amounts of memory when parsing large XML documents.
TIBCO BusinessWorks (BW) uses XML in a similar way as DOM. It loads the entire XML document into memory as a tree. Generally this is good as it provides a convenient way to navigate, manipulate and map XML with XPATH and XSLT. But it also shares the drawback of DOM. With large XML files, it may occupy too much memory and in some extreme situations may cause an OutOfMemory error.
Simple API for XML (SAX) may be a solution. But as a natural pull model it may be too complicated an application for this specific task. With StAX , you can split large XML documents into trunks efficiently without the drawbacks of traditional push parsers.
This article shows how to retrieve repeatable information from XML documents and handle them separately. It will also show how to implement a solution for large XML files in BW with StAX, Java Code Activity and File Poller Activity.
What is StAX
Streaming API for XML (StAX) is an application programming interface (API) to read and write XML documents in the Java programming language.
StAX offers a pull parser that gives client applications full control over the parsing process. The StAX parser provides a “cursor” in the XML document. The application moves the “cursor” forward, pulling the information from the parser as needed.
StAX provides another event-based (upon cursor-based) pulling API. The application pulls events instead of cursor from the parser one by one and deals with it if needed, until the end of the stream or until the application stops.
XMLEventReader interface is the major interface for reading XML document. It iterates over it as a stream.
XMLEventWriter interface is the major interface for writing purposes.
Now, let’s see how to split a large XML file using StAX:
With XMLInputFactory.newInstance(), we get an instance of XMLInputFactory with the default implementation. It can be used to create XMLEventReader to read XML files.
With XMLOutputFactory.newInstance(), we get an instance of XMLOutputFactory with the default implementation. It can be used to create XMLEventWriter. We also set “javax.xml.stream.isRepairingNamespaces” to Boolean — TRUE as we want to keep the namespace in the output XML files.
In this way, we build a XMLEventReader to read the XML File.
Using XMLEventReader To Go Through XML File
With XMLEventReader.nextEvent(), we can get the next XMLEvent in the XML File. XMLEvent can be a StartElement, EndElement, StartDocument, EndDocument, etc. Here, we check the QName of the StartElement. If it is the same as the target QName (which is the one repeatable element in the XML file in this case), we write this element and its content into an output file with writeToFile(). Below is the code for wrtieToFile().
Writing Selected Element into file with XMLEventWriter
We create an XMLEventWriter with XMLOutputFactory.createXMLEventWriter(). With XMLEventWriter.add(), we can write XMLEvent/XMLElement to the target XML File. It is the user’s responsibility to make sure that the output XML is well-formed and so the user must check the EndElement event and make sure it matches the StartElement in pairs. Here, we finish all the codes required to split XML file into trunks.
Build a Solution with StAX in BW
Integrating StAX in BW
First, choosing an implementation of StAX. There are some open source implementations you can choose from, one is Woodstox and another is StAX Reference Implementation (RI).
Next, the steps to integrate StAX with BW for a solution to handle large XML files.
- Copy the .jar file into <BW_HOME>/lib.
- Create a new project in Designer named StAXSplitter and add a new process to it named splitXMLFile.
- Select a Java Code Activity in the process and add some input parameters as below :
- Copy and paste all code into Java Code Activity > Code and in invoke(), then add the following code:
1. Compile the code by clicking the Compile Button. This process can be used to split a large XML file into small trunks
2. Create another process to handle every trunk file separately. File Poller Starter can be used to trigger the event. The
process can be similar to the following: