Transforming XML on the Fly
How STX Enables the Processing of Large Documents
Oliver Becker
Humboldt University Berlin
What will I in this presentation talk about?
- Why STX?
XSLT is fine, isn't it?
- What is STX?
The foundations of STX
- What is STX good for?
Show me a real-world use case!
- Is it more than XSLT?
Which new concepts does STX introduce?
- But what about my existing applications?
How to integrate STX based transformations
- Great! Where do I find more information?
XML Transformations
The problem with XSLT ...
- Large XML documents
- Continuous transformation of XML streams (Pipelining)
| The transformation process with XSLT
|
- XSLT uses XPath
- XPath provides an overall view to the XML data
What to do?
- Use a "real" programming language and an API for parsing XML
(for example SAX, "The Simple API for XML")
- Drawbacks:
- Requires programming skills
- A simple API causes normally a complex application logic
- Missing support for creating XML
|
API |
Scripting Language |
| Tree based |
DOM |
XSLT |
| Event based |
SAX |
??? |
|
|
??? = STX
|
STX – Streaming Transformations for XML
Use the best of both worlds:
- XSLT-like syntax, uses many familiar constructs
- Built on top of SAX
- Transformation without an internal tree representation of
the document
| The transformation process with STX
|
The Path Language of STX
Obviously, STX cannot use full XPath.
STXPath is an extended subset of XPath 1.0
-
Only abbreviated paths (no explicit axes)
-
Access restricted to the ancestors of the context node
+
Simple sequences and some XPath 2.0 functions
|
- No support for Schema datatypes
- Still to investigate: the optimal subset of XPath 2.0
|
What is STX good for?
| Q: |
For what kind of transformations is STX a suitable
technology?
|
| A: |
Forward transformations, that need only local access to the
XML data.
For example:
- No structural changes, only renamings
of elements or attributes
- Creating a subset (view) of the data by
omitting unwanted information
- Locally constrained transformations that
need only data from small local subtrees
|
|
Combining STX with XSLT will enable
powerful and memory saving transformations.
|
STX by Example
The RDF dump of the Open Directory (DMOZ)
http://rdf.dmoz.org/
The contents dump (uncompressed) is about 1GByte large.
<RDF>
<Topic
r:id="hierarchy path">
<catid> ID
</catid>
<link
r:resource="URL" />
- more
<links>s
belonging to this topic
<ExternalPage
about="URL">
<d:Title> ...
</d:Title>
<d:Description> ...
</d:Description>
- more
<ExternalPage>s for each of the
<link>s above
- more
<Topic>s and their
<ExternalPage>s
<?xml version='1.0' encoding='UTF-8'?>
<RDF xmlns:r="http://www.w3.org/TR/RDF/"
xmlns:d="http://purl.org/dc/elements/1.0/"
xmlns="http://dmoz.org/rdf">
<Topic r:id="Top">
<catid>1</catid>
</Topic>
...
<Topic r:id="Top/Shopping">
<catid>13<catid>
<link r:resource="http://www.esmarts.com/"/>
<link r:resource="http://www.bdscodak.com"/>
<link r:resource="http://www.choicemall.com/"/>
</Topic>
<ExternalPage about="http://www.esmarts.com/">
<d:Title>eSmarts</d:Title>
<d:Description>
eSmarts helps consumers find the lowest possible prices on the
web. They compare prices at different Internet stores, list
coupons (including many $10 off coupons), discuss sales and
share great shopping tips.
</d:Description>
</ExternalPage>
...
</RDF> |
STX by Example (cont'd)
The task for this data:
- Input: the ID of a topic
- Output:
HTML view of the resources of this topic
Resources in Top/Shopping
-
eSmarts
- eSmarts helps consumers find the lowest possible
prices on the web. They compare prices at different Internet
stores, list coupons (including many $10 off coupons), discuss
sales and share great shopping tips.
-
BD Scodak - personalized
children's books for your child's education
- BD Scodak is your source for personalized
children's books customized with your child's information right
next to popular cartoon, religious, sports, and tv characters
and themes.
-
Choice World
- Choice Mall - The #1 global marketplace on the
Internet. Thousands of stores offer quality, unique products
and services, art and entertainment, books and music, gifts,
food, real estate, health, sports, and fitness -- all under
one roof!
|
The STX Transformation for this Example
<?xml version="1.0"?>
<stx:transform xmlns:stx="http://stx.sourceforge.net/2002/ns"
xmlns:r="http://www.w3.org/TR/RDF/"
xmlns:d="http://purl.org/dc/elements/1.0/"
xmlns:od="http://dmoz.org/rdf"
xmlns="http://www.w3.org/1999/xhtml"
version="1.0">
<!-- External parameter identifying the requested category -->
<stx:param name="catid" />
<stx:template match="od:RDF">
<html>
<body>
<stx:process-children />
</body>
</html>
</stx:template>
... |
<stx:variable name="resources" />
<!-- Group for Topic elements -->
<stx:group>
<stx:variable name="found" select="false()" />
<stx:template match="od:Topic" public="yes">
<stx:assign name="resources" select="()" />
<stx:process-children />
<stx:if test="$found and $resources">
<!-- We found the category and there are resources -->
<h3>Resources in <stx:value-of select="@r:id" /></h3>
<dl>
<stx:process-siblings while="od:ExternalPage|text()"
group="ep" />
</dl>
</stx:if>
</stx:template>
<stx:template match="od:catid">
<stx:assign name="found" select=". = $catid" />
</stx:template>
<stx:template match="od:link">
<stx:assign name="resources"
select="($resources, @r:resource)" />
</stx:template>
</stx:group> |
...
<!-- Group for ExternalPage elements -->
<stx:group name="ep">
<stx:template match="od:ExternalPage">
<!-- Is this page among the resources? -->
<stx:if test="@about = $resources">
<stx:process-children />
</stx:if>
</stx:template>
<!-- Output Title and Description -->
<stx:template match="d:Title">
<dt><a href="{../@about}"><stx:value-of select="." /></a></dt>
</stx:template>
<stx:template match="d:Description">
<dd><stx:value-of select="." /></dd>
</stx:template>
</stx:group>
</stx:transform> |
Overview: STX Elements
Known from XSLT
- stx:transform
- stx:template
- stx:value-of
- stx:if
- stx:variable, stx:param
- stx:with-param
- stx:choose, stx:when, stx:otherwise
- stx:text, stx:element, stx:attribute, stx:comment,
stx:processing-instruction
- stx:copy
- stx:result-document (XSLT 2.0)
- stx:message
- stx:include
Overview: STX Elements (cont'd)
Different Syntax than in XSLT
- stx:process-children, stx:process-siblings,
stx:process-attributes
- stx:process-document
- stx:procedure, stx:call-procedure
- stx:for-each-item
Providing new Functionality
- stx:process-self
- stx:process-text
- stx:else, stx:while
- stx:start-element, stx:end-element, stx:cdata
- stx:group
- stx:buffer, stx:result-buffer, stx:process-buffer
Understanding Groups, stx:group
- Means to structure the transformation sheet
- Combine several templates
- Speed up the transformation process
- Replace XSLT's
mode
Entering a group
1. Explicitely: using named groups
<stx:group name="ep">
<stx:template ...>
<stx:template ...>
</stx:group> |
<stx:process-children group="ep" /> |
Entering a group
2. Implicitely: using public templates in child groups
<stx:template ...>
...
<stx:process-children />
</stx:template>
<stx:group>
<stx:template match="..." public="yes">
...
</stx:template>
<stx:group> |
Groups can be nested.
Implicitely Entering a Group via Public Templates (schematic)
Buffers
Current Situation:
- An STX Transformation must run in document order.
- Difficult to perform wide-area changes (e.g. sorting)
Solution:
- Use a buffer (FIFO store) for result SAX events.
- The contents represents a temporary tree.
Buffers enable wide-area changes by means of repeated local
changes.
Drawback:
increasing memory and processing costs
Buffers (schematic)
STX Integration into existing Applications
STX is a transformation language, so its functionality should be made
accessible via a standard API.
Java: JAXP/TrAX
Using the STX implementation Joost
Own application:
- set the
javax.xml.transform.TransformerFactory property to net.sf.joost.trax.TransformerFactoryImpl
Other applications:
Current State of the STX Project
- Open Source Project since March 2002
- Homepage: http://stx.sourceforge.net/
- Discussion list for developing the STX specification
- Specification status: Working Draft
Implementations
Additional Reading
Thank you for your attention!
Questions?