1 / 23

Transforming XML on the Fly

How STX Enables the Processing of Large Documents



Oliver Becker

Humboldt University Berlin


2 / 23

What will I in this presentation talk about?

  1. Why STX?
    XSLT is fine, isn't it?

  2. What is STX?
    The foundations of STX

  3. What is STX good for?
    Show me a real-world use case!

  4. Is it more than XSLT?
    Which new concepts does STX introduce?

  5. But what about my existing applications?
    How to integrate STX based transformations

  6. Great! Where do I find more information?

3 / 23

XML Transformations

The problem with XSLT ...


The transformation process with XSLT

4 / 23

What to do?


API Scripting Language
Tree based DOM XSLT
Event based SAX ???
  ??? = STX

5 / 23

STX – Streaming Transformations for XML

Use the best of both worlds:


The transformation process with STX

6 / 23

The Path Language of STX

Obviously, STX cannot use full XPath.

STXPath is an extended subset of XPath 1.0

- Only abbreviated paths (no explicit axes)
- Access restricted to the ancestors of the context node
+ Simple sequences and some XPath 2.0 functions
  • No support for Schema datatypes

  • Still to investigate: the optimal subset of XPath 2.0

7 / 23

What is STX good for?

Q:  For what kind of transformations is STX a suitable technology?
A: Forward transformations, that need only local access to the XML data.
For example:
  • No structural changes, only renamings of elements or attributes
  • Creating a subset (view) of the data by omitting unwanted information
  • Locally constrained transformations that need only data from small local subtrees
Combining STX with XSLT will enable powerful and memory saving transformations.

8 / 23

STX by Example

The RDF dump of the Open Directory (DMOZ) http://rdf.dmoz.org/

The contents dump (uncompressed) is about 1GByte large.

<RDF>
<Topic r:id="hierarchy path">
<catid> ID </catid>
<link r:resource="URL" />
more <links>s belonging to this topic
<ExternalPage about="URL">
<d:Title> ... </d:Title>
<d:Description> ... </d:Description>
more <ExternalPage>s for each of the <link>s above
more <Topic>s and their <ExternalPage>s

9 / 23
<?xml version='1.0' encoding='UTF-8'?>
<RDF xmlns:r="http://www.w3.org/TR/RDF/"
     xmlns:d="http://purl.org/dc/elements/1.0/"
     xmlns="http://dmoz.org/rdf">

<Topic r:id="Top">
  <catid>1</catid>
</Topic>
...
<Topic r:id="Top/Shopping">
  <catid>13<catid>
  <link r:resource="http://www.esmarts.com/"/>
  <link r:resource="http://www.bdscodak.com"/>
  <link r:resource="http://www.choicemall.com/"/>
</Topic>

<ExternalPage about="http://www.esmarts.com/">
  <d:Title>eSmarts</d:Title>
  <d:Description>
    eSmarts helps consumers find the lowest possible prices on the
    web.  They compare prices at different Internet stores, list
    coupons (including many $10 off coupons), discuss sales and
    share great shopping tips.
  </d:Description>
</ExternalPage>
...
</RDF>

10 / 23

STX by Example (cont'd)

The task for this data:

Resources in Top/Shopping

eSmarts
eSmarts helps consumers find the lowest possible prices on the web. They compare prices at different Internet stores, list coupons (including many $10 off coupons), discuss sales and share great shopping tips.
BD Scodak - personalized children's books for your child's education
BD Scodak is your source for personalized children's books customized with your child's information right next to popular cartoon, religious, sports, and tv characters and themes.
Choice World
Choice Mall - The #1 global marketplace on the Internet. Thousands of stores offer quality, unique products and services, art and entertainment, books and music, gifts, food, real estate, health, sports, and fitness -- all under one roof!

11 / 23

The STX Transformation for this Example

<?xml version="1.0"?>
<stx:transform xmlns:stx="http://stx.sourceforge.net/2002/ns"
               xmlns:r="http://www.w3.org/TR/RDF/"
               xmlns:d="http://purl.org/dc/elements/1.0/"  
               xmlns:od="http://dmoz.org/rdf"
               xmlns="http://www.w3.org/1999/xhtml"
               version="1.0">

  <!-- External parameter identifying the requested category -->
  <stx:param name="catid" />
      
  <stx:template match="od:RDF">
    <html>
      <body>
        <stx:process-children />
      </body>
    </html>
  </stx:template>

  ...

12 / 23
  <stx:variable name="resources" />

  <!-- Group for Topic elements -->
  <stx:group>
    <stx:variable name="found" select="false()" />

    <stx:template match="od:Topic" public="yes">
      <stx:assign name="resources" select="()" />
      <stx:process-children />
      <stx:if test="$found and $resources">
        <!-- We found the category and there are resources -->
        <h3>Resources in <stx:value-of select="@r:id" /></h3>
        <dl>
          <stx:process-siblings while="od:ExternalPage|text()"
                                group="ep" />
        </dl>
      </stx:if>
    </stx:template>
  
    <stx:template match="od:catid">
      <stx:assign name="found" select=". = $catid" />
    </stx:template>

    <stx:template match="od:link">
      <stx:assign name="resources"
                  select="($resources, @r:resource)" />
    </stx:template>
  </stx:group>

13 / 23
  ...
      
  <!-- Group for ExternalPage elements -->
  <stx:group name="ep">
    <stx:template match="od:ExternalPage">
      <!-- Is this page among the resources? -->
      <stx:if test="@about = $resources">
        <stx:process-children />
      </stx:if>
    </stx:template>
    
    <!-- Output Title and Description -->
    <stx:template match="d:Title">
      <dt><a href="{../@about}"><stx:value-of select="." /></a></dt>
    </stx:template>

    <stx:template match="d:Description">
      <dd><stx:value-of select="." /></dd>
    </stx:template>
  </stx:group>

</stx:transform>

14 / 23

Overview: STX Elements

Known from XSLT


15 / 23

Overview: STX Elements (cont'd)

Different Syntax than in XSLT

Providing new Functionality


16 / 23

Understanding Groups, stx:group

Entering a group

1. Explicitely: using named groups

<stx:group name="ep">
   <stx:template ...>
   <stx:template ...>
</stx:group>
<stx:process-children group="ep" />

17 / 23

Entering a group

2. Implicitely: using public templates in child groups

<stx:template ...>
  ...
  <stx:process-children />
</stx:template>
      
<stx:group>
  <stx:template match="..." public="yes">
    ...
  </stx:template>
<stx:group>

Groups can be nested.


18 / 23

Implicitely Entering a Group via Public Templates (schematic)


19 / 23

Buffers

Current Situation:

Solution:

Buffers enable wide-area changes by means of repeated local changes.

Drawback:
increasing memory and processing costs


20 / 23

Buffers (schematic)


21 / 23

STX Integration into existing Applications

STX is a transformation language, so its functionality should be made accessible via a standard API.

Java: JAXP/TrAX

Using the STX implementation Joost

Own application:

Other applications:


22 / 23

Current State of the STX Project


Implementations


Additional Reading


23 / 23

Thank you for your attention!


Questions?