XML Europe 2004 logo

ISO DSDL Overview

Abstract

The notion of "validation" of XML documents covers too many different aspects (structure, content, integrity, business rules, ...) to be performed by a single schema language.

Furthermore, even when a single language is used, it is often the case that documents needs to be transformed, split or normalized to keep the schemas simpler.

The ISO DSDL project (ISO/IEC JTC 1 SC 34 WG 1) is standardizing a set of specific and simple schema and pre validation transformation languages and a framework to define how these operations must be applied. These languages include well known technologies such as Relax NG and Schematron as well as new languages.

This talk gives a full project overview, explaining the goal of each of the parts and present the latest developments of DSDL.

Keywords


Table of Contents

1. Why DSDL?
1.1. Validation isn't optional
1.1.1. 1 XML document out of 10 contains at least 1 error
1.2. Too diverse for a single language
1.2.1. Structure (imbrication of elements and attributes)
1.2.2. Datatypes
1.2.3. Integrity constraints
1.2.4. Business rules
1.3. Requires a modular toolset
2. A multi-part standard
2.1. Part 1: Overview
2.2. Part 2: Regular-grammar-based Validation
2.2.1. A rewrite of the RELAX NG OASIS specification
2.2.2. example of RELAX NG is:
2.2.3. Compact syntax
2.3. Part 3: Rule-based Validation
2.3.1. Schematron is the main input
2.3.2. Hosting language for expressing rules
2.3.3. Example (Schematron)
2.3.4. extended to support more than XPath 1.0
2.4. Part 4: Selection of Validation Candidates
2.4.1. Splits documents instead of composing schemas
2.4.2. NRL (James Clark) as an input
2.4.3. Example
2.4.4. Supports wildcards, modes, ...
2.5. Part 5: Datatypes
2.5.1. To create new primitive types
2.5.2. Jeni Tennison's proposal as an input
2.5.3. Example
2.6. Part 6: Path-based Integrity Constraints
2.6.1. No much so say yet
2.7. Part 7: Character Repertoire Validation
2.7.1. Character Repertoire Validation for XML (CRVX)
2.7.2. And a draft to define character sets
2.7.3. Why RELAX NG isn't enough here
2.8. Part 8: Declarative Document Architectures
2.8.1. Builds on Architectural forms
2.9. Part 9: Namespace and Datatype-aware DTDs
2.9.1. To keep SGML DTDs alive
2.10. Part 10: Validation Management
2.10.1. The glue
2.10.2. Example
3. What DSDL should bring you
3.1. The power of a full toolset
3.1.1. Part 2 (RELAX NG): the simplest and most powerful language to validate the structure of XML documents.
3.1.2. Part 3 (Schematron): highly flexible "business rules".
3.1.3. Part 4 (Selection of Validation Candidates): validate composite documents using individual schemas.
3.1.4. Part 5 (Datatypes): Create your own datatypes.
3.1.5. Part 6 (Path-based Integrity Constraints): specify integrity constraints between elements and attributes.
3.1.6. Part 7 (Character repertoire): specify which characters may be used.
3.1.7. Part 8 (Declarative Document Architectures): make implicit information explicit.
3.1.8. Part 9 (Namespace and Datatype-aware DTDs): keep your legacy DTDs alive.
3.1.9. Part 10 (Validation Management): combine these parts and plug in other transformation and validation tools.
Biography

1. Why DSDL?

1.1. Validation isn't optional

1.1.1. 1 XML document out of 10 contains at least 1 error

If I needed to give only one good reason why DSDL is needed, I would keep this one: we must improve the level of quality of our XML documents.

1.2. Too diverse for a single language

The concept of validation at large is too diverse for a schema language. Common categorizations of the types of validation involved for XML documents include:

1.2.1. Structure (imbrication of elements and attributes)

A first type of validation consists in checking the structure, ie the imbrication of elements and attributes. This validation acts at markup level and do not test the content of the text nodes or attributes.

1.2.2. Datatypes

A second category of validation consists in checking the content of text nodes and attributes independently of each other. With the exception of qualified names (QNames) in the content, an unfortunate practice that creates a dependency between the markup and the content, datatype validation ignores the markup and tests only the content of the document.

1.2.3. Integrity constraints

A third category of validation consists in checking identifiers to verify that they are unique and references to check that they refer to existing identifiers. Integrity constraints may be performed internally to a document or between documents (link checking).

1.2.4. Business rules

What's left after these three categories is often called business rules. Business rules can be as simple as checking that a date of death is, when it exists, greater than the date of birth or as complex as spell checking.

1.3. Requires a modular toolset

Acknowledging that the validation of XML documents is vitally important and that it is too complex, diverse and application dependent to rely on a single schema language, ISO/IEC JTC1 SC34 WG1 is creating DSDL, a multi part standard defining a modular and extensible toolset to address the issue.

2. A multi-part standard

DSDL is still work in progress. It is a multi-part specification, each of the parts presenting a different schema language (except part 1 which is an introduction and part 10 which is the description of the framework itself).

2.1. Part 1: Overview

There isn't much to say about Part 1 which is a road map describing DSDL that introduces each of the parts: it's typically a more formal, better written version of this presentation!

2.2. Part 2: Regular-grammar-based Validation

2.2.1. A rewrite of the RELAX NG OASIS specification

This part is RELAX NG . It is a rewrite of the RELAX NG OASIS Technical Committee specification to meet the requirements of ISO publications. Its wording is more formal than the OASIS specification but the features of the language are the same. Any RELAX NG implementation that conforms to either of these two documents also conforms to the other.

DSDL Part 2 is now a "Final Draft International Standard" (FDIS), i.e. an official ISO standard.

2.2.2. example of RELAX NG is:

<?xml version="1.0" encoding="UTF-8"?>
<element name="library" xmlns="http://relaxng.org/ns/structure/1.0">
  <oneOrMore>
    <element name="book">
      <attribute name="id"/>
      <attribute name="available"/>
      <element name="isbn">
        <text/>
      </element>
      <element name="title">
        <attribute name="xml:lang"/>
        <text/>
      </element>
      <oneOrMore>
        <element name="author">
          <attribute name="id"/>
          <element name="name">
            <text/>
          </element>
          <optional>
            <element name="born">
              <text/>
            </element>
          </optional>
          <optional>
            <element name="died">
              <text/>
            </element>
          </optional>
        </element>
      </oneOrMore>
      <zeroOrMore>
        <element name="character">
          <attribute name="id"/>
          <element name="name">
            <text/>
          </element>
          <optional>
            <element name="born">
              <text/>
            </element>
          </optional>
          <element name="qualification">
            <text/>
          </element>
        </element>
      </zeroOrMore>
    </element>
  </oneOrMore>
</element>

2.2.3. Compact syntax

The compact syntax for RELAX NG should be included to DSDL part 2 as an addendum and is strictly equivalent to the XML syntax.

The same example written with the compact syntax is:

element library {
  element book {
    attribute id { text },
    attribute available { text },
    element isbn { text },
    element title {
      attribute xml:lang { text },
      text
    },
    element author {
      attribute id { text },
      element name { text },
      element born { text }?,
      element died { text }?
    }+,
    element character {
      attribute id { text },
      element name { text },
      element born { text }?,
      element qualification { text }
    }*
  }+
}

2.3. Part 3: Rule-based Validation

2.3.1. Schematron is the main input

This part of DSDL will describe the next release of the rule-based schema language known as Schematron. The current version of Schematron has been defined by Rick Jelliffe and other contributors as a language used to express sets of rules as XPath expressions (or more accurately as XSLT expressions since XSLT functions such as document() are also supported in XPath expressions). Its home page is http://www.ascc.net/xml/schematron/http://www.ascc.net/xml/schematron/.

2.3.2. Hosting language for expressing rules

Without going into the details of the language, we can say that a Schematron schema is composed of sets of rules named "patterns" (these patterns shouldn't be confused with RELAX NG patterns). Each pattern includes one or more rules. Each rule sets the context nodes under which tests will be performed and each tests is performed either as an assert or as a report. An assert is a test which raises an error if it is not verified, while a report is a test which raises an error if it is specified.

2.3.3. Example (Schematron)

A fragment of a Schematron schema for our library could be:

<?xml version="1.0" encoding="iso-8859-1"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron">
  <sch:title>Schematron Schema for library</sch:title>
  <sch:pattern>
    <sch:rule context="/">
      <sch:assert test="library">The document element should be &quot;library&quot;.</sch:assert>
    </sch:rule>
    <sch:rule context="/library">
      <sch:assert test="book">There should be at least a book!</sch:assert>
      <sch:assert test="not(@*)">No attribute for library, please!</sch:assert>
    </sch:rule>
    <sch:rule context="/library/book">
      <sch:report test="following-sibling::book/@id=@id">Duplicated ID for this book.</sch:report>
      <sch:assert test="@id=concat('_', isbn)">The id should be derived from the ISBN.</sch:assert>
    </sch:rule>
    <sch:rule context="/library/*">
      <sch:assert test="self::book or self::author or self::character">This element shouldn't be here...</sch:assert>
    </sch:rule>
  </sch:pattern>
</sch:schema>

We see from that simple example that it would be very verbose to write a full schema with Schematron since it would mean writing a rule for each element. In this rule writing, all the individual tests that check the content model and eventually the relative order between children elements, must be specified. We see also that it does very well at expressing what are often called business rules, such as:

<sch:assert test="@id=concat('_',isbn)">

The id should be derived from the ISBN number. </sch:assert>

which checks that the id attribute of a book is derived from its ISBN element by adding a leading underscore.

2.3.4. extended to support more than XPath 1.0

DSDL Part 3, the next version of Schematron should keep this structure and add still more power by allowing it to use, not only XPath 1.0 expressions, but also expressions taken from other languages such as EXSLT (a standard extension library for XSLT ), XPath 2.0, XSLT 2.0, and even XQuery 1.0.

2.4. Part 4: Selection of Validation Candidates

2.4.1. Splits documents instead of composing schemas

Although schema languages such as RELAX NG often provide a way to write and combine modular schemas, it is often the case that you need to validate a composite document against existing schemas which can be written using different languages: you may want for instance to validate XHTML documents with embedded RDF statements. In this case, you need to split your documents into pieces and validate each of these pieces against its own schema.

2.4.2. NRL (James Clark) as an input

The first contribution to Part 4 was an ISO specification known as "RELAX Namespace" by Murata Makoto. This contribution has been followed by a couple of others, namely Modular Namespaces ( MNS ) by James Clark and Namespace Switchboard by Rick Jelliffe. The latest contribution, Namespace Routing Language ( NRL ) was made by James Clark in June 2003 and builds on previous proposals. It has been decided in December 2003 that DSDL Part 4 will be based on NRL. NRL is implemented in the latest versions of Jing.

2.4.3. Example

The first example given in the specification (http://www.thaiopensource.com/relaxng/nrl.html) shows how NRL can be used to validate a SOAP message containing one or more XHTML documents:

<?xml version="1.0" encoding="iso-8859-1"?>
<rules xmlns="http://www.thaiopensource.com/validate/nrl">
  <namespace ns="http://schemas.xmlsoap.org/soap/envelope/">
    <validate schema="soap-envelope.xsd"/>
  </namespace>
  <namespace ns="http://www.w3.org/1999/xhtml">
    <validate schema="xhtml.rng"/>
  </namespace>
</rules>

This example would split the SOAP messages into two parts. The SOAP envelope validated against the W3C XML Schema soap-envelope.xsd. The one or more XHTML documents found in the body of the SOAP message will be validated against the RELAX NG schema xhtml.rng.

2.4.4. Supports wildcards, modes, ...

More advanced features are available including namespace wildcards, validation modes, open schemas and transparent namespaces. These features seem to be able to handle the most complex cases assuming that the basic assumption that instance documents may be split according to the namespaces of its elements and attributes is met.

2.5. Part 5: Datatypes

2.5.1. To create new primitive types

The goal of this part is to define a set of primitive datatypes with their constraining facets and the mechanisms to derive new datatypes from this set. It is fair to say that it's probably the least advanced, yet more complex part of DSDL. While people agree on what shouldn't be done, it is difficult to get beyond the criticism of existing systems such as W3C XML Schema datatypes to propose something better.

2.5.2. Jeni Tennison's proposal as an input

The most advanced input for Part 5 is a proposal by Jeni Tennison, see: http://www.jenitennison.com/datatypeshttp://www.jenitennison.com/datatypes.

This proposal borrows elements to RELAX NG to define a kind of EBNF describing the lexical space of datatypes and uses XPath expressions to specify how the datatypes should be normalized.

2.5.3. Example

<datatype name="decimal">
  <parse>
    <optional>
      <choice>
        <string>+</string>
        <string>-</string>
      </choice>
    </optional>
    <oneOrMore>
      <charGroup>
        <range from="0" to="9" />
      </charGroup>
    </oneOrMore>
    <optional>
      <string>.</string>
      <oneOrMore>
        <charGroup>
          <range from="0" to="9" />
        </charGroup>
      </oneOrMore>
    </optional>
  </parse>
</datatype>

2.6. Part 6: Path-based Integrity Constraints

2.6.1. No much so say yet

The goal of this part is basically to define a feature covering W3C XML Schema's xs:unique, xs:key and xs:keyref. Part 6 hasn't had any contributions yet.

2.7. Part 7: Character Repertoire Validation

Part 7 will allow us to specify which characters may be used in specific elements and attributes or within entire XML documents. The W3C note, "A Notation for Character Collections for the WWW" (http://www.w3.org/TR/charcol/), is used as an input for Part 7.

2.7.1. Character Repertoire Validation for XML (CRVX)

A proposal has been made at WWW 2003 under the name "Character Repertoire Validation for XML" (CRVX ) (http://dret.net/netdret/docs/wilde-crvx-www2003.html).

A simple example of CRVX is:

<crvx xmlns="http://dret.net/xmlns/crvx10">
  <restrict structure="ename aname pitarget" 
    charrep="\p{IsBasicLatin}"/>
  <restrict structure="ename aname" 
    charrep="[^0-9]"/>
</crvx>

In this proposal, the structure attribute contains identifiers for "element names" (ename), "attribute names" (aname)", Processing Instruction target pitarget and other XML constructions including element and attribute contents. This example would thus impose that element and attribute names and Processing Instruction targets must all use characters from the BasicLatin block and that element and attribute names must not use digits.

2.7.2. And a draft to define character sets

Diederik Gerth van Wijk has recently produced a draft of DSDL Part 7 focusing on the definitions of character properties and character sets which may be seen as a more advanced way of defining CRX “charrep” attribute.

2.7.3. Why RELAX NG isn't enough here

There is some overlap between Part 7 and other schema languages such as Part 2 ( RELAX NG ). You'd need to take care that your names match the rules defined in both places and can use the data pattern to check the content of attributes and simple content elements. However, Part 7 gives you a more focused means of expressing these rules independently of other schemas. It filling some gaps in such constraints: RELAX NG cannot express such constraints on name classes nor on mixed content elements.

2.8. Part 8: Declarative Document Architectures

2.8.1. Builds on Architectural forms

This section is still in development. The idea here is to allow a person to add information (such as default values) to documents depending on the structure of the document. The only input considered for Part 8 so far is known as "Architectural Forms", an old technology with strong adherents and limited use.

2.9. Part 9: Namespace and Datatype-aware DTDs

2.9.1. To keep SGML DTDs alive

There were plenty of good things in DTD s, especially in SGML DTDs. Many people are still using them and do question the need of putting them in the trash and then define new schema languages to support namespaces and datatypes. DSDL Part 9 is for these people who would like to rely on years of usage of DTD s without loosing all of the goodies of newer schema languages. Despite a burst of discussion in April 2002, this part hasn't advanced yet.

2.10. Part 10: Validation Management

2.10.1. The glue
2.10.2. Example

2.10.1. The glue

Last but not least, Part 10 (formerly known as Part 1: Interoperability Framework) is the cement which will let you use the different parts from DSDL together with external tools such as XSLT , W3C XML Schema, or your favorite spell checker, to come back to an example given in the introduction to this chapter.

Here again, different contributions have been made, including my own "XML Validation Interoperability Framework" XVIF and Rick Jelliffe's Schemachine. The latest contribution is currently being specified.

2.10.2. Example

An example of a DSDL Validation Management document corresponding to this new version is:

<?xml version="1.0"?>
<dsvm xmlns="http://xmlns.xmlschemata.org/dsvm/" defaultGroup="anyVersion">
  <group name="anyVersion">
    <choice>
      <checkGroup name="version1"/>
      <checkGroup name="version2"/>
    </choice>
  </group>
  <group name="version1">
    <validate schema="schema1.rng"/>
    <validate schema="schema1.sch"/>
  </group>
  <group name="version2">
    <validate schema="schema2.rng"/>
    <validate schema="schema2.sch"/>
  </group>
</dsvm>

This document defines three validation groups ("anyVersion", "version1" and "version2") . These groups of rules may be selected at validation time through a parameter passed to the DSDL Validation Management processor:

  • The group "version1" checks that instance documents are conform to a version 1 of a vocabulary by applying a RELAX NG schema ("schema1.rng") and a Schematron schema ("schema1.sch").

  • The group "version2" does the same for a second version of the vocabulary through another version of these schemas.

  • The group "anyVersion" which is also declared as being the default group if no parameter is specified checks that instance documents are valid through either group "version1" or "version2".

3. What DSDL should bring you

3.1.  The power of a full toolset

DSDL should bring you all that individual schema languages ignored because they're too focused on individual aspects of validation:

3.1.1. Part 2 (RELAX NG): the simplest and most powerful language to validate the structure of XML documents.

3.1.2. Part 3 (Schematron): highly flexible "business rules".

3.1.3. Part 4 (Selection of Validation Candidates): validate composite documents using individual schemas.

3.1.4. Part 5 (Datatypes): Create your own datatypes.

3.1.5. Part 6 (Path-based Integrity Constraints): specify integrity constraints between elements and attributes.

3.1.6. Part 7 (Character repertoire): specify which characters may be used.

3.1.7. Part 8 (Declarative Document Architectures): make implicit information explicit.

3.1.8. Part 9 (Namespace and Datatype-aware DTDs): keep your legacy DTDs alive.

3.1.9. Part 10 (Validation Management): combine these parts and plug in other transformation and validation tools.

If you already like of of them, I am sure that you'll enjoy the other members of the DSDL family. They share the same principles of focus on solving very specific issues. This focus keeps them powerful and easy to use.

Biography

Eric van der Vlist is an independent XML consultant, developer, trainer and writer.

He’s involved in many XML projects for the French administration, most of them related to the publication of XML vocabularies.

Eric is the editor of the ISO DSDL Part 10 specification (work in progress, see http://dsdl.org) describing “Validation Management”. He is also the author of Examplotron (http://examplotron.org) and one of the editors of RSS 1.0. (http://www.purl.org/rss/1.0/www.purl.org/rss/1.0/).

He is a contributing editor to XML.com and xmlhack.com, creator and chief editor of XMLfr.org, the main website dedicated to XML in French, and therefore contributes to the adoption of XML by the French community.

Eric is the author of "XML Schema, The W3C's Object-Oriented Descriptions for XML", O'Reilly 2002 and "RELAX NG, A Simpler Schema Language For XML", O'Reilly 2003.

He lives in Paris France, but you will probably meet him in one of the many international conferences where he delivers his tutorials and presentations.

He welcomes your comments at vdv@dyomedea.com.