XML Europe 2003 logo

What's New in XSLT 2.0

Abstract

XSLT and XPath 2.0 are nearing completion but the size of the new specifications mean that many XSLT 1.0 users remain unaware of the changes that they bring. This talk outlines the major changes in XSLT 2.0 and their implications for the design of XSLT stylesheets. In particular, it examines the impact of the replacement of "result tree fragments" with "temporary trees", the ability to create multiple result trees, support for grouping, the definition of extension functions, parsing text documents with regular expressions and the introduction of typing.

Keywords

»XPath, »XSLT.

Table of Contents

1. Introduction
2. Temporary Trees
3. Result Documents
3.1. Multiple Result Documents
3.2. Output Serialisation
4. Grouping
5. Function Definitions
6. Text Parsing
7. Typing
7.1. Typed Values
7.2. Typed Nodes
7.3. Schema Import
7.4. Substitution Groups
8. Conclusions
Glossary
Biography

1. Introduction

Extensible Styling Language: Transformations (XSLT) 2.0 offers a number of helpful new features that make it easier to write complex stylesheets. Most of these are standardisations of proven extensions that have been available in XSLT 1.0 processors for some time, such as temporary trees and the creation of user-defined functions; others are more novel, such as the ability to parse a text document using regular expressions.

In this paper, we'll explore the major changes in XSLT 2.0 and the likely implications of those changes in terms of how we write our stylesheets. We'll look at the changes in six major areas:

  • The introduction of temporary trees

  • Support for multiple result documents and changes to result document serialisation

  • Grouping support

  • The ability to define extension functions within a stylesheet

  • Regular expressions and parsing of non-Extensible Markup Language (XML) documents

  • The addition of types and typing

Note

The latest specification for XSLT 2.0 is available at http://www.w3.org/TR/xslt20; the details in this paper might change between time of writing and XSLT 2.0 becoming a Recommendation. You can test many of XSLT 2.0's features using Saxon 7.x from http://saxon.sourceforge.net/.

2. Temporary Trees

In XSLT, you can assign values to variables and parameters in two ways: through a select attribute on the variable-binding element (such as <xsl:variable>) and through the variable-binding element's content.

In XSLT 1.0, when you use the content of a variable-binding element, you actually create an Result Tree Fragment (RTF): a small tree, with its own root node. This RTF is unlike the source tree(s) used in the transformation, however, in that you cannot use XPath to address into the RTF. Once an RTF has been constructed, pretty much the only things that you can do with it are copy it or use its string value. For example, say you declare a $menus variable as follows:

<xsl:variable name="menus">
  <menu name="File">
    <menuItem name="New..." shortcut="Ctrl-N" />
    <menuItem name="Open..." shortcut="Ctrl-O" />
    <menuItem name="Save..." shortcut="Ctrl-S" />
    ...
  </menu>
  ...
</xsl:variable>

In XSLT 1.0, the $menus variable will hold an RTF: a root node with several <menu> element children. You cannot then query into that RTF, however; doing something like:

<xsl:value-of select="$menus/menu/menuItem[starts-with(@shortcut('Ctrl'))]/@name" />

will give you an error in XSLT 1.0.

Most XSLT 1.0 processors provide a xx:node-set() extension function that will convert an RTF into a node set, returning the root node of the RTF. With the exception of MSXML[1], the major XSLT processors support the exsl:node-set() extension function from Extensions to XSLT (EXSLT), so you can do:

<xsl:value-of select="exsl:node-set($menus)/menu/menuItem[starts-with(@shortcut('Ctrl'))]/@name" />

In XSLT 2.0, Result Tree Fragments no longer exist. Instead, when you create a variable or parameter using the content of a variable-binding element, you create a temporary tree and the variable or parameter's value is the document node[2] of that temporary tree. Thus you can use XPath to access information from the temporary tree in the way you'd expect:

<xsl:value-of select="$menus/menu/menuItem[starts-with(@shortcut('Ctrl'))]/@name" />

Being able to create a tree that you then process further is useful in several situations:

  • To break up a complex transformation into several steps, usually by filtering, sorting or annotating nodes during an initial pass so that the latter transformation is easier to complete.

  • To create lookup tables to translate from codes or numbers to labels or vice versa, as you would create an array or matrix in another programming language.

  • To iteratively process a document until a certain constraint is true.

3.  Result Documents

There are two sets of changes in XSLT 2.0 surrounding result documents. The first is the ability to create multiple result documents during a single transformation. The second is the extra power of <xsl:output> as a means of controlling serialisation.

3.1. Multiple Result Documents

Most XSLT 1.0 processors (MSXML being the main exception) support some method of generating multiple result documents. Creating multiple result documents from the same transformation is useful when:

  • Paginating output, so that each document only includes part of the result of the transformation. For example, in transforming a book, each <chapter> element might create a separate result document; in transforming a table of records, you might want to only include 20 records on each page.

  • Generating pages that use Hypertext Markup Language (HTML) frames.

  • Creating supplementary files that are referenced by the main output, for example Scalable Vector Graphics (SVG) graphics, Cascading Style Sheets (CSS) stylesheets, or files containing meta information about the main output.

Without the ability to create multiple result documents from a single transformation, XSLT 1.0 applications usually use parameters to indicate which page should be created from a particular transformation, and use separate transformations each time. This can take more time, particularly when the same calculations must be repeated for each page.

In XSLT 2.0, secondary result trees are created within an <xsl:result-document> element. The href attribute, which is an attribute value template and thus can be set dynamically, specifies a Uniform Resource Identifier (URI) that acts as an identifier for the secondary result tree. If the secondary result tree is ever serialised, the URI may act as the location to which it is written, but all that is guaranteed is that if you use relative URIs for several secondary result trees then the relationships between the various result trees will be retained, wherever they are saved. For example, if you create two documents with:

<xsl:result-document href="navigation.html">...</xsl:result-document>
<xsl:result-document href="main.html">...</xsl:result-document>

then any references to navigation.html from the second document will point to the first document, and likewise any references to main.html from the first document will point to the second document.

Note

URIs are not limited to file or Hypertext Transfer Protocol (HTTP)URIs, though different XSLT 2.0 processors are likely to support different URI schemes. It would be perfectly possible for a processor to support mailto URIs by emailing the serialised result tree, or to support HTTP URIs by creating an HTTP POST request that includes the relevant result tree.

One thing to watch out for is that secondary result trees cannot be created during the creation of a temporary tree, so you can't use <xsl:result-document> at any level within a variable-binding element such as <xsl:variable> or <xsl:with-param>. This has repercussions when debugging or when using a multi-step transformation as described above: if you want to store the result of applying templates to a set of nodes in a variable, then none of the templates can include the creation of additional output documents. It's probably therefore a good idea to keep the generation of result trees as separate as possible: create secondary result trees at the same level at which you create the document element of the output rather than deep within that document.

3.2. Output Serialisation

As in XSLT 1.0, the <xsl:output> top-level element controls how a result tree is serialized. There are several changes to <xsl:output> in the XSLT 2.0 Working Draft:

  • output definitions can be given names, so that they can be referred to from <xsl:result-document> elements in order to control how secondary result trees are serialised

  • an additional output method, xhtml, produces well-formed HTML

  • two extra attributes control serialization in HTML and Extensible Hypertext Markup Language (XHTML) output: escape-uri-attributes, which governs whether attributes that hold URIs, such as the href attribute on the <a> element and the src attribute on the <img> element, are URI-escaped; and include-content-type, which determines whether a <meta> element is added to specify the content type (and character encoding) of the document

  • you now have control over whether the output is Unicode normalized, using the normalize-unicode attribute; only normalisation to Unicode Normalisation Form C is supported

  • during serialisation, characters can be substituted with strings, providing an alternative for disable-output-escaping

The last of these changes is worth looking at in a little bit more detail. In XSLT 1.0, users generating nearly-XML have to use disable-output-escaping to do so. For example, Java Server Pages (JSP) pages typically embed instructions using <%...%> syntax, which is not well-formed XML. In the following JSP example, the non-well-formed parts of the page are highlighted:

            <%@ page language="java" %>
<jsp:useBean id="internatBean" scope="request"
             class="com.devsphere.examples.mapping.internat.InternatBean"/>
<HTML>
  <HEAD><TITLE>Internat bean</TITLE></HEAD>
  <BODY>
    <H3>Internationalization Example</H3>
    <HR/>
    <%
      String suffix = "";
      if (internatBean.getLanguage() != 0)
        suffix = "_" + internatBean.getLocale().toString();
      String inclName = "InternatIncl" + suffix + ".jsp";
    %>
    <jsp:include page="<%=inclName%>" flush="true"/>
    <P><B>InternatBean properties: </B></P>
    <P> locale = <jsp:getProperty name="internatBean" property="locale"/></P>
    <P> parsedDate = <jsp:getProperty name="internatBean" property="parsedDate"/></P>
    <P> parsedNumber = <jsp:getProperty name="internatBean" property="parsedNumber"/></P>
  </BODY>
</HTML>

using disable output escaping in XSLT 2.0, the XSLT might look like:

<xsl:template match="/example">
  <xsl:text disable-output-escaping="yes"><![CDATA[<%@ page language="java" %>]]></xsl:text>
  <jsp:useBean id="internatBean" scope="request"
               class="com.devsphere.examples.mapping.internat.InternatBean"/>
  <HTML>
    <HEAD><TITLE>Internat bean</TITLE></HEAD>
    <BODY>
      <H3>Internationalization Example</H3>
      <HR/>
      <xsl:text disable-output-escaping="yes"><![CDATA[<%
        String suffix = "";
        if (internatBean.getLanguage() != 0)
          suffix = "_" + internatBean.getLocale().toString();
        String inclName = "InternatIncl" + suffix + ".jsp";
      %>]]></xsl:text>
      <jsp:include flush="true">
        <xsl:attribute name="page" disable-output-escaping="yes">&lt;%=inclName%&gt;</xsl:attribute>
      </jsp:include>
      <P><B>InternatBean properties: </B></P>
      <xsl:for-each select="prop">
        <P>
          <xsl:value-of select="@name" />
          <xsl:text> = </xsl:text>
          <jsp:getProperty name="internatBean" property="{@name}" />
        </P>
      </xsl:for-each>
    </BODY>
  </HTML>
</xsl:template>

Note

It is impossible to disable output escaping on attributes in XSLT 1.0, but in XSLT 2.0 the <xsl:attribute> instruction has a disable-output-escaping attribute that enables you to disable output escaping for an entire attribute. This is shown in the above example for the page attribute on the <jsp:include> element.

In XSLT 2.0, you can specify characters to stand in for strings that should not be escaped on output. These characters will usually come from the Unicode private use area, between #xE000 and #xF8FF. For example, I could set up a character map to state that whenever the character #xE001 is encountered in a text node or attribute node, it should be replaced in the output by the string "<%". Similarly, any occurrence of the character #xE002 should be replaced in the output by the string "%>". Character maps are declared with an <xsl:character-map> element at the top level of the stylesheet, with <xsl:output-character> elements within them defining the mapping from character to string. In this case, the character map might look like:

<xsl:character-map name="jsp">
  <!-- JSP start -->
  <xsl:output-character character="&#xE001;" string="&lt;%" />
  <!-- JSP end -->
  <xsl:output-character character="&#xE002;" string="%&gt;" />
</xsl:character-map>

The <xsl:output> element can point to one or more character maps by name to indicate which replacements should be made when the document is serialised:

<xsl:output use-character-maps="jsp" />

In this case, to get the output above, the XSLT would look like:

<xsl:template match="/example">
  &#xE001;@ page language="java" &#xE002;
  <jsp:useBean id="internatBean" scope="request"
               class="com.devsphere.examples.mapping.internat.InternatBean"/>
  <HTML>
    <HEAD><TITLE>Internat bean</TITLE></HEAD>
    <BODY>
      <H3>Internationalization Example</H3>
      <HR/>
      &#xE001;
        String suffix = "";
        if (internatBean.getLanguage() != 0)
          suffix = "_" + internatBean.getLocale().toString();
        String inclName = "InternatIncl" + suffix + ".jsp";
      &#xE002;
      <jsp:include page="&#xE001;=inclName&#E002;" flush="true" />
      <P><B>InternatBean properties: </B></P>
      <xsl:for-each select="prop">
        <P>
          <xsl:value-of select="@name" />
          <xsl:text> = </xsl:text>
          <jsp:getProperty name="internatBean" property="{@name}" />
        </P>
      </xsl:for-each>
    </BODY>
  </HTML>
</xsl:template>

Note

In most cases, for readability, the stylesheet will use an entity to represent the replaceable characters, such as &jsp-start; for &#xE001; and &jsp-end; for &#xE002; in the above example.

Using character maps is much more robust than using disable-output-escaping because the unescaped characters are guaranteed to persist even when a text node or attribute is copied in a temporary tree. In addition, unlike disable-output-escaping, all processors that support serialisation of result trees will support character maps, so you are more likely to get consistent results across processors. It's hoped that this capability will be able to replace all "good" uses of disable-output-escaping, enabling disable-output-escaping to be deprecated eventually.

4. Grouping

Grouping is one of the trickier things to do using XSLT 1.0 — the standard approach to use the Muenchian Method, which involves declaring keys that assign nodes different values depending on what group they should belong to, and then creating a node set containing one node from each group by picking the first node from each group. The Muenchian Method is complicated, especially when creating nested groups, not obvious to new users, and involves a fair amount of processing (to construct the relevant node sets) and memory (to hold the keys).

Thankfully, XSLT 2.0 introduces new methods for grouping, namely a new instruction, <xsl:for-each-group>, and a new function, current-group().

The <xsl:for-each-group> element works in a very similar way to <xsl:for-each> in that it has a select attribute that is used to select the items to be grouped, and holds a number of <xsl:sort> elements (for sorting the groups) followed by the instructions that create content for the particular group. As well as the select attribute, the <xsl:for-each-group> instruction can take one of four attributes, which are used to determine how the selected items are grouped together. These four attributes can be divided into two groups:

  • Attributes for grouping by value; the attribute is used to assign each item a grouping key, and the items are assigned to groups based on that grouping key. The relevant attributes are:

    • group-by, which ignores the order in which the items appear in the selected sequence.

    • group-adjacent, which only groups together adjacent items with the same value.

  • Attributes for grouping in sequence; these can only be used to group nodes, and the nodes are grouped in the order they appear. The relevant attributes are:

    • group-starting-with, which holds a pattern that matches the first node in each group.

    • group-ending-with, which holds a pattern that matches the last node in each group.

Within an <xsl:for-each-group> element, the current item is the first item in the particular group that's being processed. The members of the group can be returned with the current-group() function.

An example of grouping by value is taking a set of transactions in the form:

<IncomeStatement>
  <Trans LocalAcc="8100" LocalDescription="Erl. RV Stellenanzeigen 16%" 
         AccNo="401000" Period="2000-10-01T00:00:00" AmountEUR="-882705.05"/>
  <Trans LocalAcc="8101" LocalDescription="Erl. RV Stellenanzeigen nstb." 
         AccNo="401000" Period="2000-10-01T00:00:00" AmountEUR="-123788.21"/>
  <Trans LocalAcc="8100" LocalDescription="Erl. RV Stellenanzeigen 16%" 
         AccNo="401000" Period="2000-11-01T00:00:00" AmountEUR="-1268347.92"/>
  <Trans LocalAcc="8101" LocalDescription="Erl. RV Stellenanzeigen nstb." 
         AccNo="401000" Period="2000-11-01T00:00:00" AmountEUR="56790.6"/>
  ...
</IncomeStatement>

and grouping them first by their Period attribute and then by their AccNo attribute to create:

<IncomeStatement>
  <Period month="2000-10">
    <Account No="401000">
      <Trans LocalAcc="8100" LocalDescription="Erl. RV Stellenanzeigen 16%" AmountEUR="-882705.05"/>
          <Trans LocalAcc="8101" LocalDescription="Erl. RV Stellenanzeigen nstb." AmountEUR="-123788.21"/>
      ...
    </Account>
    ...
  </Period>
  <Period month="2000-11">
    <Account No="401000">
      <Trans LocalAcc="8100" LocalDescription="Erl. RV Stellenanzeigen 16%" AmountEUR="-1268347.92"/>
          <Trans LocalAcc="8101" LocalDescription="Erl. RV Stellenanzeigen nstb." AmountEUR="56790.6"/>
      ...
    </Account>
  </Period>
  ...
</IncomeStatment>

This can be achieved with the following XSLT:

<xsl:template match="IncomeStatement">
  <IncomeStatement>
    <xsl:for-each-group select="Trans" group-by="@Period">
      <xsl:sort select="@Period" />
      <Period month="{substring(@Period, 1, 7)}">
        <xsl:for-each-group select="current-group()" group-by="@AccNo">
          <Account No="{@AccNo}">
            <xsl:for-each select="current-group()">
              <Trans>
                <xsl:copy-of select="@LocalAcc | @LocalDescription | @AmountEUR" />
              </Trans>
            </xsl:for-each>
          </Account>
        </xsl:for-each-group>
      </Period>
    </xsl:for-each-group>
  </IncomeStatement>
</xsl:template>

Grouping in sequence is necessary when you want to collect together elements that are conceptually related but are not wrapped together. For example, in some XML I have been dealing with recently, the <reqpers> element has the content model (person, perscat, perskill?, trade?)+; each <person> element starts a group that includes a <perscat> element and, optionally, a <perskill> and/or <trade> element. To process this XML into an HTML table, we can use:

<xsl:template match="reqpers">
  <table>
    <xsl:for-each-group select="*" group-starting-with="person">
      <tr>
        <td><xsl:apply-templates select="current-group()[self::person]" /></td>
        <td><xsl:apply-templates select="current-group()[self::perscat]" /></td>
        <td><xsl:apply-templates select="current-group()[self::perskill]" /></td>
        <td><xsl:apply-templates select="current-group()[self::trade]" /></td>
      </tr>
    </xsl:for-each-group>
  </table>
</xsl:template>

There are still some limitations with the grouping provided in XSLT 2.0. For example, to group by more than one thing at once, you either have to create a grouping key by concatenating values or you have to nest two <xsl:for-each-group> elements inside each other. Creating groups where a particular item can belong to more than one of the groups (for example creating an index where each section can have more than one keyword) is also not straight forward. Regardless, the <xsl:for-each-group> instruction will simplify many stylesheets.

5. Function Definitions

The ability to create user-defined functions exists in most XSLT processors. Many XSLT processors support the definition of user-defined functions using external languages such as JavaScript, VBScript or Java. Quite a few also support using XSLT to define functions, using the <func:function> and <func:result> elements defined by the EXSLT initiative. XSLT 2.0 supports using XSLT to define functions (termed stylesheet functions), but not using other languages, since there are too diverse in their capabilities to provide standardised support.

In XSLT 2.0, a stylesheet function is defined using the <xsl:function> element at the top level of the stylesheet. The name attribute specifies the name of the function, as a qualified name. (All stylesheet functions must belong to a namespace, and their names written with a prefix to avoid confusion with built-in functions with the same name.)

The content of the <xsl:function> element starts with any number of <xsl:param> elements, which declare the arguments for the function. You can't have optional arguments in XSLT 2.0, but you can have two function definitions with the same name but different numbers of arguments, which enables you to simulate optional arguments. For example, in the following, the str:align function has two definitions: one with three arguments that allows you to specify an alignment, and one with two arguments that calls the first with the value 'left' as the third argument:

<xsl:function name="str:align">
  <xsl:param name="string" />
  <xsl:param name="padding" />
  <xsl:param name="alignment" />
  <xsl:variable name="str-length" select="string-length($string)" /> 
  <xsl:variable name="pad-length" select="string-length($padding)" /> 
  <xsl:variable name="result">
    <xsl:choose>
      <xsl:when test="$str-length >= $pad-length">
        <xsl:value-of select="substring($string, 1, $pad-length)" /> 
      </xsl:when>
      <xsl:when test="$alignment = 'center'">
        <xsl:variable name="half-remainder" select="floor(($pad-length - $str-length) div 2)" /> 
        <xsl:value-of select="concat(substring($padding, 1, $half-remainder),
                                     $string,
                                     substring($padding, $str-length + $half-remainder + 1))" /> 
      </xsl:when>
      <xsl:when test="$alignment = 'right'">
        <xsl:value-of select="concat(substring($padding, 1, $pad-length - $str-length),
                                     $string)" /> 
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="concat($string, substring($padding, $str-length + 1))" /> 
      </xsl:otherwise>
    </xsl:choose>
  </xsl:variable>
  <xsl:result select="string($result)" />
</xsl:function>

<xsl:function name="str:align">
  <xsl:param name="string" />
  <xsl:param name="padding" />
  <xsl:result select="str:align($string, $padding, 'left')" />
</xsl:function>

The last item in the content of the <xsl:function> element is a <xsl:result> element, which specifies the result of the function as a whole. The <xsl:result> element is like <xsl:param> and <xsl:variable>, in that it can either have a select attribute, in which case its value is set using an XPath expression and could be anything, or have content, in which case it will usually be set to a temporary tree.

As we'll see later, both the arguments to the function and the result of the function can be assigned types.

Stylesheet functions have many of the same features as named templates, the main difference being that they can be called from within an XPath expression or an XSLT pattern. This makes them very useful for:

  • Creating a value to sort by using <xsl:sort>

  • Creating a value to index by using <xsl:key>

  • Creating a value to group by using <xsl:for-each-group>

  • Carrying out complex tests on nodes, for use in match patterns in templates or keys

One of the potential "gotchas" of stylesheet functions is that they cannot refer to information from the focus (the context item, position and length) at the time the function was called. Stylesheet functions cannot follow the pattern of functions like name() and string(), which default to using the context node as their argument if no argument is specified. If you want to use information from the focus in the function, you have to pass it in as an explicit argument. For example, to test whether an element is an HTML heading element, you need to define a function like:

<xsl:function name="html:is-heading">
  <xsl:param name="element" />
  <xsl:result select="boolean($element[self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or self::h6])" />
</xsl:function>

A template that matches only heading elements could then be declared by calling this function with:

<xsl:template match="*[html:is-heading(.)]">
  ...
</xsl:template>

Despite this, it seems likely that stylesheet functions will take over from named templates as the mechanism for creating named blocks of code that return a result based on some parameters.

6. Text Parsing

One of the commonly requested new features in XPath 2.0 is support for regular expressions. XPath 2.0 supports three functions that use regular expressions:

  • match(string, regex, flags?), which returns true if a substring matching the regular expression matches the string

  • replace(string, regex, replacement, flags?), which returns the matched string with all occurrences of the regular expression replaced using the replacement string. The replacement string can contain references in the form $1 to $9 to indicate the values of sub-expressions from the regular expression

  • tokenize(string, regex, flags?), which returns a sequence of strings created by splitting the string on every occurrence of the regular expression

Note

The flags in each of these functions are used to indicate line-by-line (rather than whole string) and/or case-insensitive matching.

In XSLT, these functions are joined by the <xsl:analyze-string> element and its two possible children, <xsl:matching-substring> and <xsl:non-matching-substring>.

The <xsl:analyze-string> element has two mandatory attributes: select, which is an expression that returns a string to be analysed; and regex (an attribute value template), which is a regular expression to be used in the analysis. The optional flags attribute controls the way the regular expression works in the same way as the flags argument in the regular expression functions above.

The selected string is matched by the regular expression and broken up into substrings, some of which match the regular expression and some of which don't. The <xsl:matching-substring> element determines how substrings that match the regular expression are processed, while the <xsl:non-matching-substring> element determines how substrings that don't match the regular expression are processed.

A simple example is the problem of taking a poem in which lines are separated by line breaks:

<poem>
  Mary had a little lamb,
  Its fleece was white as snow;
  And everywhere that Mary went
  The lamb was sure to go.
</poem>

and turning it into a <poem> element in which each line is wrapped in a separate <line> element:

<poem>
  <line>Mary had a little lamb,</line>
  <line>Its fleece was white as snow;</line>
  <line>And everywhere that Mary went</line>
  <line>The lamb was sure to go.</line>
</poem>

This can be achieved with the following XSLT 2.0 code:

<xsl:template match="poem">
  <poem>
    <xsl:analyze-string select="." regex="\S.*" flags="m">
      <xsl:matching-substring>
        <line><xsl:value-of select="." /></line>
      </xsl:matching-substring>
    </xsl:analyze-string>
  </poem>
</xsl:template>

The flags attribute on the <xsl:analyze-string> element tells the processor to first parse the string into lines. Within each line, the regular expression matches a sequence of characters starting with a non-whitespace character. This sequence of characters is wrapped in a <line> element; the substrings that don't match the regular express are simply ignored.

If there are sub-expressions within the regular expression, the substrings that match these sub-expressions are available using the regex-group() function within the <xsl:matching-substring> element. The regex-group() function takes a number as an argument and returns the substring matching that sub-expression.

This capability for parsing strings is joined in XSLT 2.0 by the unparsed-text() function, which works like the document() function but for non-XML files. With the unparsed-text() function, it is possible to access text files of all kinds — comma-delimited files, HTML files, Rich Text Format (???) files, Java source code and so on. With the <xsl:analyze-string> function, you are able to parse them.

Effectively, this enables XSLT to be used as a transformation language for more than just XML: any text format can be processed, as long as you can construct the regular expressions to do so. But it is here that difficulties arise: the regular expression capabilities supported by XSLT and XPath cannot cope with balancing braces, so constructing parse trees will often have to be done in two or more phases.

7. Typing

The impact of typing is probably the main "unknown" in XSLT 2.0. Although MSXML4 has introduced extension functions that will return the type of a node, these haven't been used by the XSLT community to any great extent. There are two levels at which typing can occur in XSLT: on values, such as strings, numbers and dates, and on nodes (specifically elements and attributes), which are assigned a type during validation.

7.1. Typed Values

Values have always been typed in XSLT — they could be strings, numbers or booleans — but XPath 2.0 introduces many more types (all the built-in types from XML Schema, plus a few more specially created for XPath 2.0) and the type of a value is now more important to an XSLT author.

In XPath 1.0, it was rarely necessary to be aware of the type returned by a function or expression because if you happened to pass an argument of the wrong type to a function, or use it in an expression, then it would be converted automatically into the required type. This might give you unexpected output on occasion, but it would never give you an error. In XPath 2.0, you will get an error if you pass a value of the wrong type to a function or use it in an expression where it is not expected. For example, in the function call:

string-pad(' ', $n)

The $n variable must be convertible to an xs:integer — it must be a value of type xs:integer or one of its subtypes, or a node that can be atomised to give an xs:integer (nodes that are typed as xs:integer or one of its subtypes and nodes that do not have a type at all can be atomised in this way). If $n holds an xs:string, an xs:gYear, or even an xs:decimal with no significant decimal places, you will get a type error.

You can cast the $n variable to an xs:integer using a casting function (xs:integer()) or using a cast as expression; casting functions are easier to use as long as the names of the types that you use don't clash with the names of the functions that you define. For example, you can use:

string-pad(' ', xs:integer($n))

Again, however, not every value can be cast to an xs:integer. Strings that conform to a lexical representation of xs:integer can be converted, as can numeric values. However, despite the fact that, as a string, it looks like an integer, a xs:gYear value cannot be converted to an xs:integer; to convert it, you have to cast to a xs:string or xdt:untypedAtomic first, and then back into a xs:integer.

XSLT 2.0 itself gives different behaviour for different types in a couple of places:

  • When sorting with <xsl:sort>, the type of the value selected by the select attribute determines the way in which sorting is carried out; strings are sorted alphabetically, numbers numerically and so on.

  • When grouping with <xsl:for-each-group>, the type of the value selected by the group-by or group-adjacent attributes determines how the values are compared with each other when it comes to creating groups; if they're xs:decimal values, for example, then items with grouping keys of 1 and 1.0 will be grouped together.

In many cases, this means that you should explicitly cast the sort key or grouping key to suit the kind of sorting/grouping that you want to do. If you want to sort/group by date, for example, you should wrap the expression you use in the select/group-by/group-adjacent attribute in a call to the xs:date() casting function.

XSLT 2.0 allows you to declare the type of a variable using the as attribute. The as attribute is optional on all variable-binding elements (including <xsl:variable>, <xsl:param> and <xsl:with-param>). On <xsl:param> elements, the as attribute indicates the required type of the parameter, and an error will be raised in the value that you pass to the parameter is not convertible to this required type. Similarly, the <xsl:key> element has an as attribute to indicate the type of the values stored in the key, and the <xsl:function> element has one to specify the return type of the function.

The as attribute holds a SequenceType, which is a pattern that matches sequences. If you use the as attribute, then the processor will raise an error if the value that's selected is not of the type specified in the as attribute. For example, if you do:

<xsl:variable name="n" as="xs:integer" select="..." />

then the value specified in the select attribute must be convertible to an xs:integer. Again, you have to explicitly cast the value to an xs:integer if it is not convertable (for example if it's an xs:string or xs:decimal).

Declaring the types of parameters is a useful feature both for those who write reusable templates/functions (they don't have to write their own error code to detect erroneous use of the template/function) and for those who use them (they will get errors if they use the wrong kind of values). The benefit of the remainder of the static typing features is not as clear cut; if the document you're working with has not been type-annotated, or is annotated with types that coincide with those you need to use in the stylesheet, then it will have little impact. If the document is annotated with types that are not derived from those you want to use in the stylesheet, then you will have to do a lot of casting.

7.2. Typed Nodes

In the data model for XPath 2.0, every node, and in particular each element and attribute, has a type. This type can be referred to in order to treat nodes with the same kind of content in the same way. New node tests in XPath 2.0 allow you to create templates that can be used with all elements/attributes of a particular type. For example, to create a template that matches all elements with a type of (or derived from) xs:anyURI, you can use:

<xsl:template match="element(*, xs:anyURI)">
  ...
</xsl:template>

Having typed nodes can also be useful when sorting or grouping because, as we've seen above, the type of the node will determine the way the sorting/grouping works. If an attribute is labelled as being an xs:integer, for example, then sorting on that attribute will give you a numeric (rather than alphabetic) sort. Since XSLT allows you to create elements and attributes as well as access information about those loaded from an input document, it also gives you the ability to indicate the type of an element or attribute when you create it. For example:

<Trans>
  <xsl:attribute name="date" type="xs:date">
    <xsl:value-of select="substring(@Period, 1, 10)" />
  </xsl:attribute>
  ...
</Trans>

creates a <Trans> element with a date attribute of type xs:date. If those <Trans> elements are later sorted by date:

<xsl:for-each select="Trans">
  <xsl:sort select="@date" />
  ...
</xsl:for-each>

then the sort will be based on the value of the date attribute as a date rather than as a string or number.

The types that are assigned to nodes using XSLT are only usable within temporary trees or if the result tree is passed on directly to another process. When you serialise a result tree, type information from the result tree is lost (although you can create xsi:type attributes to preserve some type information, there is no necessity for these xsi:type attributes to specify the same type as that of the element in the result tree).

7.3. Schema Import

The types used in XPath and XSLT 2.0 are arranged in a type hierarchy of supertypes and subtypes. Wherever a supertype is expected, a subtype can be used instead, so for example an xs:integer can be used wherever an xs:decimal is expected.

The type hierarchy that's used by an XSLT processor to work out how two types are related comes from importing a schema using <xsl:import-schema> elements, which appear at the top level of the stylesheet. An <xsl:import-schema> element can point to a schema using the namespace attribute and/or the schema-location attribute. The schemas that are imported in this way can be in any schema language (although it's anticipated that most will use XML Schema), and it is up to the implementation how the type hierarchy is constructed from the schemas.

Importing schemas into a stylesheet is a useful way of indicating the structure of the documents that the stylesheet is expected to work with and produce. It may also prove useful for importing partial schemas that are never used to validate a document but rather simply define those types that are useful within the stylesheet.

7.4. Substitution Groups

Another piece of information that's available from imported schemas is the substitution group hierarchy, which is a hierarchy of elements that can replace each other within a document. For example, in a schema for XHTML, all the heading elements (<h1> to <h6>) might belong to the same substitution group, headed by an abstract <heading> element. This is done in the schema using top-level element declarations such as:

<xs:element name="heading" abstract="true" />
<xs:element name="h1" substitutionGroup="heading" />
<xs:element name="h2" substitutionGroup="heading" />
<xs:element name="h3" substitutionGroup="heading" />
<xs:element name="h4" substitutionGroup="heading" />
<xs:element name="h5" substitutionGroup="heading" />
<xs:element name="h6" substitutionGroup="heading" />

With this schema imported, in the stylesheet you could then match all the heading elements with the template:

<xsl:template match="element(heading,*)">
  ...
</xsl:template>

Note that in this case, there is no need for the document to have actually been validated against the schema that you use: the elements will be recognised as being part of the heading substitution group on the basis of their name only. In this way, lightweight schemas that address only parts of a document can be used to simplify a stylesheet.

8. Conclusions

There's lots that's new in XSLT 2.0, some of it familiar from the extensions that implementations have been offering on top of XSLT 1.0, and some of it more novel and untested.

The ability to create temporary trees will make XML lookup tables and multi-step transformations more commonplace, reducing the complexity of many stylesheets that currently have to do everything at once. The provision of <xsl:for-each-group> for grouping and the ability to create user-defined functions with XSLT code will also make lots of stylesheets much simpler. Multiple result documents will make it a lot easier to create framesets and paginated documents from a single transformation without scripting, which will make client-side XSLT applications more accessible and more powerful.

The parsed-text() function and support for regular expressions opens up the possibility of XSLT being used to process documents in formats other than XML. In some cases, this may prove to be more trouble than it's worth: a Simple API for XML (SAX) filter that will turn an HTML document into a sequence of SAX events is probably going to be a better solution than an XSLT transformation that does the same thing. However, this facility will prove useful for simple formats such as comma-delimited files, and for those who are more comfortable with XSLT than with Java.

Strong static typing is the most contentious of the additions to XSLT, and it remains to be seen whether XSLT implementers and authors will use schemas to any great extent. There are certainly some useful features: in many transformations, the ability to create templates that do the same thing to all elements in a particular group, whether membership is identified through type or through substitution group, will prove very useful. On the other hand, casting from one type to another in order to prevent type errors from being raised may prove to be more trouble than its worth.

With XSLT 2.0 reaching Last Call, now is the time to test out these new features and to send comments to public-qt-comments@w3.org.

Glossary

CSS

Cascading Style Sheets

EXSLT

Extensions to XSLT

HTML

Hypertext Markup Language

HTTP

Hypertext Transfer Protocol

JSP

Java Server Pages

RTF

Result Tree Fragment

SAX

Simple API for XML

SVG

Scalable Vector Graphics

URI

Uniform Resource Identifier

XHTML

Extensible Hypertext Markup Language

XML

Extensible Markup Language

XSLT

Extensible Styling Language: Transformations

Biography

Jeni Tennison is an independent consultant specialising in XSLT and XML Schema development. She trained as a knowledge engineer, gaining a PhD in collaborative ontology development, and since becoming a consultant has worked on XML and XSLT development in a wide variety of areas, including publishing, water monitoring and financial services.

She is the author of "XSLT & XPath On The Edge" (Hungry Minds, 2001) and "Beginning XSLT" (Wrox, 2002), one of the founders of the EXSLT initiative to standardise extensions to XSLT and XPath and an invited expert in the W3C XSL Working Group. She spends much of her spare time answering people's queries on XSL-List and xmlschema-dev mailing lists.



[1] MSXML has its own extension function for this purpose, msxsl:node-set().

[2] In XPath 2.0, the root of a tree may be an element or other node, so there is now a distinction between the "root node" of a tree, which may be any kind of node, and a "document node", which is a particular kind of node that can only appear as the root node of a tree.