Table of Contents
Streaming Transformations for XML (STX) is a transformation language designed to work on a stream of SAX events. It was born from the insight that XSL Transformations (XSLT) may not be the best choice for some kinds of transformation tasks. XSLT is relatively easy to learn and doesn't require programming skills for simple transformations. So, XSLT can be seen as a scripting language in XML for XML. Most if not all of XSLT's popularity stems from these advantages.
Since XSLT is based on XPath, it operates on a tree representation of an XML document. An XSLT-based transformation requires parsing the whole document, building an in-memory tree representation, performing the transformation process, building a result tree, and serializing this tree as XML. The first and the last step may be left out, depending on your application and the data it requires, but in the end the tree transformation is the heart of XSLT. That's why memory requirements proportional to the size of the XML data belongs to XSLT like angle brackets to XML.
So, how to process large XML files? If you're able to write a program in your favorite programming language then you can choose Simple API for XML (SAX). SAX simply delivers events to the application and doesn't store any data by itself. That's the job of the application that uses SAX. The drawback of this approach is the effort you have to spend for storing relevant data yourself in custom data structures and keeping track of the XML element hierarchy. An alternative on a programming level offers W3C Document Object Model (DOM). DOM provides a tree structure to the application where all data from the XML document are stored in correspondent nodes. Of course DOM is not the only Application Programming Interface (API) for that purpose, but it is a typical representative.
So, on the programming level you have the choice between tree oriented (DOM) and event oriented (SAX) processing of XML data. For a script-like transforming language there is (well, has been) only XSLT, which is tree oriented. STX fills the gap by providing an event based transformation language on the top of SAX. As we will see in an instant, STX reuses many of XSLT's concepts and language constructs.
The STX initiative began in February 2002 with a post from Petr Cimprich to the xml-dev mailing list. Shortly after a special mailing list and a project on sourceforge.net have been established, hosting the current specifications, see [STX]. At the moment there are two open source implementations for STX available: the Perl-based processor XML::STX (http://stx.gingerall.cz/stx/xml-stx/) by Petr Cimprich and the Java-based processor Joost (http://joost.sourceforge.net/) by Oliver Becker, the author of this article. Everybody who is interested is invited to join the discussion and to bring the STX development forward. An article introducing the basic concepts of STX has been published by xml.com, see [Intro].
The transformation of the Resource Description Framework (RDF) dump of the Open Directory (see http://rdf.dmoz.org/) seems to be a perfect task for STX.[1] The content dump is about 1 GByte large and looks like this:
<?xml version='1.0' encoding='UTF-8'?>
<RDF xmlns:r="http://www.w3.org/TR/RDF/"
xmlns:d="http://purl.org/dc/elements/1.0/"
xmlns="http://dmoz.org/rdf">
<Topic r:id="Top">
<catid>1</catid>
</Topic>
...
<Topic r:id="Top/Shopping">
<catid>13<catid>
<link r:resource="http://www.esmarts.com/"/>
<link r:resource="http://www.bdscodak.com"/>
<link r:resource="http://www.choicemall.com/"/>
</Topic>
<ExternalPage about="http://www.esmarts.com/">
<d:Title>eSmarts</d:Title>
<d:Description>
eSmarts helps consumers find the lowest possible prices on the
web. They compare prices at different Internet stores, list
coupons (including many $10 off coupons), discuss sales and
share great shopping tips.
</d:Description>
</ExternalPage>
<ExternalPage about="http://www.bdscodak.com">
<d:Title>
BD Scodak - personalized children's books for your child's education
</d:Title>
<d:Description>
BD Scodak is your source for personalized children's books
customized with your child's information right next to popular
cartoon, religious, sports, and tv characters and themes.
</d:Description>
</ExternalPage>
<ExternalPage about="http://www.choicemall.com/">
<d:Title>Choice World</d:Title>
<d:Description>
Choice Mall - The #1 global marketplace on the Internet.
Thousands of stores offer quality, unique products and services,
art and entertainment, books and music, gifts, food, real estate,
health, sports, and fitness -- all under one roof!
</d:Description>
</ExternalPage>
<Topic r:id="Top/Society">
<catid>14</catid>
<link r:resource="http://www.yforum.com/"/>
</Topic>
<ExternalPage about="http://www.yforum.com/">
<d:Title>Y? The National Forum on People's Differences</d:Title>
<d:Description>
The nation's only forum allowing people to ask and receive answers
to the uncomfortable and even embarrassing questions they've
always wanted to ask people who are different from themselves.
All questions and answers are acceptable, as long as they promote
dialogue and are not asked or answered out of hate.
</d:Description>
</ExternalPage>
<!-- more <Topic>s and <ExternalPage>s -->
...
</RDF>Figure 1. The content RDF dump of the Open Directory
The hierarchical structure of the directory is represented as a flat list of Topic elements. Each of them uses the r:id attribute to identify the category within the hierarchy and a catid child for its unique identifier. The further content of a Topic is a list of resources for this category given in link elements. Each resource in turn is described by means of an ExternalPage element following the Topic.
The task for this kind of data is to find the resources belonging to a given category specified by its id (catid) and to output these as HTML. While this task is not particularly complicated to solve in XSLT, it is the size of the XML input that makes it impossible for XSLT. The STX transformation sheet for extracting the requested information is given here:
<?xml version="1.0"?> <stx:transform xmlns:stx="http://stx.sourceforge.net/2002/ns" xmlns:r="http://www.w3.org/TR/RDF/" xmlns:d="http://purl.org/dc/elements/1.0/" xmlns:od="http://dmoz.org/rdf" xmlns="http://www.w3.org/1999/xhtml" version="1.0"> <!-- External parameter identifying the requested category --> <stx:param name="catid" /> <stx:template match="od:RDF"> <html> <body> <stx:process-children /> </body> </html> </stx:template> <stx:variable name="resources" /> <!-- Group for Topic elements --> <stx:group> <stx:variable name="found" select="false()" /> <stx:template match="od:Topic" visibility="public"> <stx:assign name="resources" select="()" /> <stx:process-children /> <stx:if test="$found and $resources"> <!-- We found the category and there are resources --> <h3>Resources in <stx:value-of select="@r:id" /></h3> <dl> <stx:process-siblings while="od:ExternalPage|text()" group="ep" /> </dl> </stx:if> </stx:template> <stx:template match="od:catid"> <stx:assign name="found" select=". = $catid" /> </stx:template> <stx:template match="od:link"> <stx:assign name="resources" select="($resources, @r:resource)" /> </stx:template> </stx:group> <!-- Group for ExternalPage elements --> <stx:group name="ep"> <stx:template match="od:ExternalPage"> <!-- Is this page among the resources? --> <stx:if test="@about = $resources"> <stx:process-children /> </stx:if> </stx:template> <!-- Output Title and Description --> <stx:template match="d:Title"> <dt><a href="{../@about}"><stx:value-of select="." /></a></dt> </stx:template> <stx:template match="d:Description"> <dd><stx:value-of select="." /></dd> </stx:template> </stx:group> </stx:transform>
Figure 2. The STX transformation for extracting the resources belonging to a category
Looking at this transformation you will notice a lot of similarities with XSLT.
The document element stx:transform encloses a set of templates which describe transformation rules for special kinds of nodes. A template must have a match attribute with a pattern that determines the nodes it is responsible for. This is just like in XSLT. The set of allowed patterns is almost the same. The conflict resolution with the help of priorities works in the usual way.
The first evident difference is the existence of the stx:process-children instruction. If you recall that STX processes its input as a stream of SAX events, then it should be clear that the transformation process itself must run in document order. stx:process-children is the counterpart to XSLT's xsl:apply-templates, whereas the possibility to select arbitrary nodes by means of an additional select attribute can't be part of the STX language.
The second apparent difference (well, actually I don't know in which order you will detect the differences ...) is the new element stx:group. A group encloses a set of templates and other groups. Only the templates of the current group are visible when processing a new node. Thus groups replace the mode concept in XSLT. The rules for entering a group are a little bit complex and will be covered by section Section 3.3.
Finally STX has changeable variables, a well-known concept in procedural programming languages. While XSLT is functional and stateless, STX maintains a state and works in a more imperative way. The order of instructions in STX is determined by the input (especially the document order of the input nodes), whereas an XSLT transformation may process a set of nodes from the input tree in parallel.
Many STX instructions were just adopted from XSLT, for example stx:param, stx:value-of, stx:if, stx:choose, stx:element, stx:attribute, and others. Users familiar with XSLT shouldn't have too many problems to get STX working.
However, there are some drawbacks to consider. An STX processor doesn't have random access to all nodes of the input document. It just maintains an ancestor stack, i.e. only the information about all opened elements is available. Additionally there are position counters for all kind of node tests to allow positional predicates. It is the task of the author of an STX program to store data from preceding nodes ("preceding" in the meaning of the preceding axis in XPath). Information from descendant and following nodes isn't available at all, simply because the STX processor didn't get this information from its SAX source.
There's one exception from the rule: an STX processor already knows the next event, it works with a look-ahead. This way the information whether an element has child nodes or not is available, as well as the textual content of a text-only element in its node value (see the templates for catid, d:Title, and d:Description in the example above).
Having said this it's clear that full XPath can't be a part of STX. Instead STX comprises a subset of XPath called STXPath. This path language allows only abbreviated location paths which access only nodes on the ancestor stack. For example an absolute path /a/b/c selects at most one element node c and this element must be the context node or its ancestor. All expressions that deal only with simple types behave the same as in XSLT. It is intended that STXPath keeps as close as possible to the upcoming XPath 2.0 specification, including its new sequence data type and the functions defined for XQuery and XPath [F&O]. The example presented here uses a sequence for collecting all resource links into the variable resources.
In the absence of a general apply-templates instruction STX provides additionally instructions beside stx:process-children: there are stx:process-attributes, stx:process-self (for processing the same node again with a different template), stx:process-siblings (for processing the following siblings of the context node, used in the example for processing the ExternalPage elements following the found Topic), stx:process-buffer (for processing a temporary XML fragment) and stx:process-document (for processing another XML document). The STX processor mustn't encounter an stx:process-children instruction twice when processing a template.
STX enhances XSLT's flat stylesheet structure with the ability to define groups of templates. Groups may be nested; the stx:transform element itself forms a top-level group. The search scope for the next matching template is restricted to the current group.
So, what are the benefits of such groups? Grouping allows some kind of modularization. The templates of a group build a local module; the transformation of the current node and its descendants is restricted to this group. STX encourages an author to develop well-structured stylesheets. Templates are not grouped implicitely by means of a mode attribute as in XSLT (they need not to be collocated side by side in XSLT), but rather explicitely. The transformation process runs more efficient since a smaller set of templates has to be searched through for the best matching template.
There are two ways for entering a group. The first one uses explicit group names and corresponds to XSLT's mode. In the example there is a group named ep and an stx:process-siblings instruction with a group attribute targetting this group. The second way uses a fix set of templates of each group which act as entry points for this group. Templates in child groups marked as "public" are also visible in the parent group and will therefore be considered for matching. Private templates (the default) are only visible within their group, but not for the parent group. Looking again at our example we see that the first (anonymous) group will be entered only via the public template for Topic elements. Furthermore, templates marked as "global" in whatever group can act as a last resort for matching. Such templates loosen the strict group oriented processing flow and should be used with caution.
Another important aspect with regards to groups is the scope of variables. While XSLT distinguishes only between global (on stylesheet level) and local variables, STX introduces group level variables. These variables will be shared by all templates of this group. Moreover, they can be used for recursion. Templates with an attribute new-scope set to "yes" enforce the STX processor upon matching to create new instances of all group variables of this group. After the STX processor has finished processing such a template the former variable values will be restored.
The streaming character of STX basing on the transformation of SAX events builds STX's strength with respect to memory consumption, but can be seen also as its principal weakness, because this doesn't allow arbitrary transformations of the document, especially "wide-area" structural changes, such as sorting for example, cannot be performed by the means introduced so far.
STX offers a solution for this problem by providing buffers. Buffers are used to store SAX events produced in a transformation step which can be transformed again afterwards. This is similar to node-sets in XSLT 2.0 emerging from a transformation, which overcome the restrictions stemming from result tree fragments in XSLT 1.0.
Buffers are first in, first out (FIFO) stores. They must be declared just like variables (the same scoping rules apply), they can be filled by placing appropriate elements or instructions into the content of an stx:result-buffer instruction, and they can be processed with the stx:process-buffer instruction. Obviously storing a whole document into a buffer would foil the aspect of STX's small memory footprint. For transformations which require random access to all data of the XML source XSLT might be the better choice.
Nevertheless, it is not very difficult to show that with the help of buffers STX is able to perform any kind of algorithmicly describable transformation. Using the vocabulary TMML for Turing machines from http://www.unidex.com/turing/ it is quite easy to develop an STX transformation which runs every in TMML coded Turing program. That means STX is turing complete.
For users familiar with XSLT the following short list of the main differences may help when starting with STX:
STX has no general xsl:apply-templates. Instead there is a family of stx:process-... instructions, such as stx:process-children, stx:process-siblings stx:process-self, stx:process-attributes, stx:process-document, and stx:process-buffer.
Different transformation modes, distinguished by a mode attribute in xsl:apply-templates and xsl:template can be modelled with named groups.
Variables can be changed with an stx:assign instruction.
Location paths in STX may use the abbreviated XPath syntax only. A path in STX selects only nodes from the ancestor stack.
Steps in match patterns may contain at most one predicate. This is only important for positional predicates because without accessing the position of a node the expressions of multiple predicates can be easily connected with the and operator. There's no restriction regarding variables in predicates within patterns.
A call to the position() function within a template body returns always the position with respect to the node test of the match pattern. In XSLT the position is defined in terms of a context node-set, determined for example by a preceding xsl:apply-templates.
Patterns beginning with "//", actually also not necessary in XSLT, has been removed from STX.
Things XSLT can't perform at all, but STX can:
Output start and end of an element indepently with stx:start-element and stx:end-element instructions. The STX processor pays attention that the result consists of well-formed XML. XSLT doesn't allow the creation of single start tags or end tags because it operates on a tree level that has no notion of tags.
Process CDATA sections: STX is able to match CDATA sections as well as output CDATA sections.
Serialize markup as text.
Process text content in a similiar manner like XML data, allowing complex transformations from text to markup in a simple way.
After becoming acquainted with the STX language let me answer the question, for what kind of transformation tasks STX may provide a useful solution. From the characterization of STX it should be clear that transformations that require an overall view and random access to the data in the XML document cannot be practicably executed.
A general rule of thumb is: a transformation that keeps the order of the incoming data is a good candidate for STX. Thus typical tasks are:
Changes in naming of elements or attributes. The overall structure of the data keeps the same, only names have to be altered.
Providing a view or a subset of the data. The expected output simply misses certain information, for example a subelement in each record or all but a special record, etc.
Local transformations. The structural changes are constrained to a small amount of data in each case, for example when replacing attributes by subelements or vice versa.
For scenarios that don't clearly belong to one of the categories "pure sequential" and "full random access" a hybrid procedure appears to be useful: the STX processor collects preprocessed XML data chunks into a buffer and passes the resulting fragment to an independent XML processor, for example an XSLT engine. This way XML data that is too big for an XSLT engine but too complicated to transform solely in STX will become manageable.
The last aspect I want to address is the use of an STX transformation engine within a Java application. The simplest way is the treatment of such an engine as a SAX XMLFilter object that may be chained into your processing flow. This approach is a kind of low-level approach and requires intimate knowledge of the used engine and its API. Fortunately there is a vendor-independent API for XML transformations.
The Transformation API for XML (also known as TrAX)[2] became a part of Sun's Java API for XML processing (JAXP) with the 1.1 version of JAXP. One of the design goals for TrAX was to introduce a level of abstraction which allows the usage of general XML transformations within a Java application without the need to struggle with special XSLT transformer APIs. Moreover, even the usage of XSLT as transformation language wasn't hard-wired into the TrAX API. So it's not astonishing to enable STX transformations via TrAX. In the end this can even be seen as a proof of concept that TrAX is indeed able to handle different transformation languages.
The following refers to the STX processor implementation Joost, written in Java (http://joost.sourceforge.net/). To create a factory for STX transformations it is necessary to use the value "net.sf.joost.trax.TransformerFactoryImpl" for the system property "javax.xml.transform.TransformerFactory". After that you can simply create Transformer or Templates objects with an STX transformation sheet as source. The calls to invoke the transformation process itself remain unchanged. That means: switching from XSLT to STX within a Java application requires at most two lines of code to be changed:
// 1. Set the property for the new factory System.setProperty("javax.xml.transform.TransformerFactory", "net.sf.joost.trax.TransformerFactoryImpl"); ... // 2. Use an STX transformation sheet instead of an XSLT stylesheet Transformer transformer = factory.newTransformer(new StreamSource("sheet.stx"));
Moreover, it is no problem to switch dynamically between several transformation methods just by changing the system property noted above and the source for the transformation at runtime. The integration of STX based transformations into existing Java applications which do use already the TrAX interfaces is indeed very simple.
You may wonder what happens to the DOM versions of Source and Result, since STX requires SAX events to be processed. Well, a specific TrAX implementation can of course handle these special types. Additional helper objects can accept a DOMSource by traversing a DOM tree and emitting SAX events from it or supply a DOMResult by building a DOM tree during the consumption of SAX events. It should be clear that using DOM in this case will again foil STX's advantage of being more memory sparing than XSLT.
[STX] Streaming Transformations for XML (STX) 1.0, P. Cimprich, O. Becker, et. al., http://stx.sourceforge.net/documents/
[Intro] An Introduction to Streaming Transformations for XML, O. Becker, P. Brown, P. Cimprich, http://www.xml.com/pub/a/2003/02/26/stx.html
[F&O] XQuery 1.0 and XPath 2.0 Function and Operators, W3C Working Draft, http://www.w3.org/TR/2002/WD-xquery-operators-20021115/
[1] Well, nearly perfect. Unfortunately the used file http://rdf.dmoz.org/rdf/content.rdf.u8.gz contains invalid UTF-8 byte sequences, thus being not well-formed XML! Before it can be used as input for an STX processor it must be converted into correct XML by using non-XML software, for example by removing the invalid sections.
[2] Possibly the acronym TrAX won't be used in the future anymore. The interfaces are now part of Java's standard library and don't need a name of their own.
![]() ![]() |
Design & Development by deepX Ltd. |