Abstract
The notion of "validation" of XML documents covers too many different aspects (structure, content, integrity, business rules, ...) to be performed by a single schema language.
Furthermore, even when a single language is used, it is often the case that documents needs to be transformed, split or normalzed to keep the schemas simpler.
The ISO DSDL project (ISO/IEC JTC 1 SC 34 WG 1) is standardizing a set of specific and simple schema and pre validation transformation languages and a framework to define how these operations must be applied. These languages include well known technologies such as Relax NG and Schematron as well as new languages.
This talk gives a full project overview, explaining the goal of each of the parts and present the latest developments of DSDL.
Keywords
Table of Contents
Although schema languages such as RELAX NG and Schematron have been started as standalone projects, they are now being standardized at ISO (ISO/IEC JTC1 SC34 WG1 to be precise) as a part of a multi-part standard named DSDL (see http://dsdl.org ).
Standing for "Document Schema Definition Languages", DSDL is a recognition that the validation of XML documents is a subject too wide and complex to be covered by a single language. It also acknowledges that the industry needs a set of simple and dedicated languages to perform different validation tasks - as well as a framework in which these languages may be used together.
There are many different aspects in validating (or schematizing) XML documents which can be categorized into:
Validating the structure of the document, i.e. checking the containment of elements and attributes (this is a domain in which RELAX NG is very good).
Validating the content of each text node and attribute independently of each other (this is where datatype libraries are needed).
Validating integrity constraints between different elements and attributes.
Validating any other rules, often called business rules (this is where Schematron is so good).
DSDL can be seen as a framework and set of languages to check the quality of XML documents and this issue appears to be crucial for any XML based application. Recent works such as the presentation given by Simon Riggs at XML Europe 2003 or the work of Isabelle Boydens ("Informatique, normes et temps." Bruxelles, Éditions E. Bruylant, 1999.) about the quality of large databases have shown that about 10% of XML documents (or data records) contain at least one error. This level of quality is unacceptable for many applications. DSDL could thus be a technology which is absolutely indispensable for most of XML applications.
DSDL is still work in progress. It is a multi-part specification, each of the parts presenting a different schema language (except part 1 which is an introduction and part 10 which is the description of the framework itself).
This part is RELAX NG. It is a rewrite of the RELAX NG OASIS Technical Committee specification to meet the requirements of ISO publications. Its wording is more formal than the OASIS specification but the features of the language are the same. Any RELAX NG implementation that conforms to either of these two documents also be conformant to the other.
DSDL Part 2 is now a "Final Draft International Standard" (FDIS), i.e. an official ISO standard.
An example of RELAX NG is:
<?xml version = '1.0' encoding =
'utf-8' ?> <element
xmlns="http://relaxng.org/ns/structure/1.0"
name="library"> <oneOrMore> <element
name="book"> <attribute name="id"/>
<attribute name="available"/> <element
name="isbn"> <text/> </element>
<element name="title"> <attribute
name="xml:lang"/> <text/> </element>
<oneOrMore> <element name="author">
<attribute name="id"/> <element
name="name"> <text/> </element>
<optional> <element name="born">
<text/> </element> </optional>
<optional> <element name="died">
<text/> </element> </optional>
</element> </oneOrMore> <zeroOrMore>
<element name="character"> <attribute
name="id"/> <element name="name">
<text/> </element> <optional> <element
name="born"> <text/> </element>
</optional> <element name="qualification">
<text/> </element> </element>
</zeroOrMore> </element> </oneOrMore>
</element>This part of DSDL will describe the next release of the rule-based schema language known as Schematron. The current version of Schematron has been defined by Rick Jelliffe and other contributors as a language used to express sets of rules as XPath expressions (or more accurately as XSLT expressions since XSLT functions such as document() are also supported in XPath expressions). Its home page is http://www.ascc.net/xml/schematron/.
Without going into the details of the language, we can say that a Schematron schema is composed of sets of rules named "patterns" (these patterns shouldn't be confused with RELAX NG patterns). Each pattern includes one or more rules. Each rule sets the context nodes under which tests will be performed and each tests is performed either as an assert or as a report. An assert is a test which raises an error if it is not verified, while a report is a test which raises an error if it is specified.
A fragment of a Schematron schema for our library could be:
<sch:schema
xmlns:sch="http://www.ascc.net/xml/schematron">
<sch:title>Schematron Schema for library</sch:title>
<sch:pattern> <sch:rule context="/">
<sch:assert test="library">The document element should
be "library".</sch:assert> </sch:rule>
<sch:rule context="/library"> <sch:assert
test="book">There should be at least a
book!</sch:assert> <sch:assert
test="not(@*)">No attribute for library,
please!</sch:assert> </sch:rule> <sch:rule
context="/library/book"> <sch:report
test="following-sibling::book/@id=@id">Duplicated ID for
this book.</sch:report> <sch:assert
test="@id=concat('_', isbn)">The id should be
derived from the ISBN.</sch:assert> </sch:rule>
<sch:rule context="/library/*"> <sch:assert
test="self::book or self::author or self::character">This
element shouldn't be here...</sch:assert>
</sch:rule> </sch:pattern> </sch:schema>We see from that simple example that it would be very verbose to write a full schema with Schematron since it would mean writing a rule for each element. In this rule writing, all the individual tests that check the content model and eventually the relative order between children elements, must be specified. We see also that it does very well at expressing what are often called business rules, such as:
<sch:assert test="@id=concat('_',
isbn)">The id should be derived from the
ISBN.</sch:assert>which checks that the id attribute of a book is derived from its ISBN element by adding a leading underscore.
DSDL Part 3, the next version of Schematron should keep this structure and add still more power by allowing it to use, not only XPath 1.0 expressions, but also expressions taken from other languages such as EXSLT (a standard extension library for XSLT), XPath 2.0, XSLT 2.0, and even XQuery 1.0.
Although RELAX NG provides a way to write and combine modular schemas, it is often the case that you need to validate a composite document against existing schemas which can be written using different languages: you may want for instance to validate XHTML documents with embedded RDF statements. In this case, you need to split your documents into pieces and validate each of these pieces against its own schema.
The first contribution to Part 4 was an ISO specification known as "RELAX Namespace" by Murata Makoto. This contribution has been followed by a couple of others, namely Modular Namespaces (MNS) by James Clark and Namespace Switchboard by Rick Jelliffe. The latest contribution, Namespace Routing Language (NRL) was made by James Clark in June 2003 and builds on previous proposals. Although it is too early to say if NRL will become DSDL Part 4, it will most likely influence it heavily. NRL is implemented in the latest versions of Jing.
The first example given in the specification (http://www.thaiopensource.com/relaxng/nrl.html) shows how NRL can be used to validate a SOAP message containing one or more XHTML documents:
<rules
xmlns="http://www.thaiopensource.com/validate/nrl">
<namespace
ns="http://schemas.xmlsoap.org/soap/envelope/">
<validate schema="soap-envelope.xsd"/>
</namespace> <namespace
ns="http://www.w3.org/1999/xhtml"> <validate
schema="xhtml.rng"/> </namespace> </rules>This example would split the SOAP messages into two parts. The SOAP envelope validated against the W3C XML Schema soap-envelope.xsd. The one or more XHTML documents found in the body of the SOAP message will be validated against the RELAX NG schema xhtml.rng.
More advanced features are available including namespace wildcards, validation modes, open schemas, transparent namespaces, and NRL. These features seem to be able to handle the most complex cases until the basic assumption that instance documents may be split according to the namespaces of its elements and attributes is met.
The goal of this part is to define a set of primitive datatypes with their constraining facets and the mechanisms to derive new datatypes from this set. It is fair to say that it's probably the least advanced, yet more complex part of DSDL. While people agree on what shouldn't be done, it is difficult to get beyond the criticism of existing systems such as W3C XML Schema datatypes to propose something better.
Some interesting ideas have been raised during the last DSDL meeting in May 2003 which tend to converge with threads on the XML-DEV mailing list in June. We should hope that this will lead to something more constructive in the next DSDL meeting in December 2003.
The goal of this part is basically to define a feature covering W3C XML Schema's xs:unique, xs:key and xs:keyref. Part 6 hasn't had any contributions yet.
Part 7 will allow us to specify which characters may be used in specific elements and attributes or within entire XML documents. The W3C note, "A Notation for Character Collections for the WWW" (http://www.w3.org/TR/charcol/), is used as an input for Part 7. The first contribution is "Character Repertoire Validation for XML" (CRVX) (http://dret.net/netdret/docs/wilde-crvx-www2003.html).
A simple example of CRVX is:
<crvx
xmlns="http://dret.net/xmlns/crvx10"> <restrict
structure="ename aname pitarget"
charrep="\p{IsBasicLatin}"/> <restrict
structure="ename aname" charrep="[^0-9]"/>
</crvx>In this proposal, the structure attribute contains identifiers for "element names" (ename), "attribute names" (aname)", Processing Instruction target pitarget and other XML constructions including element and attribute contents. This example would thus impose that element and attribute names and Processing Instruction targets must all use characters from the BasicLatin block and that element and attribute names must not use digits.
There is some overlap between Part 7 and other schema languages such as Part 2 (RELAX NG). You'd need to take care that your names match the rules defined in both places and can use the data pattern to check the content of attributes and simple content elements. However, Part 7 gives you a more focused means of expressing these rules independently of other schemas. It filling some gaps in such constraints: RELAX NG cannot express such constraints on name classes nor on mixed content elements.
This section is still in development. The idea here is to allow a person to add information (such as default values) to documents depending on the structure of the document. The only input considered for Part 8 so far is known as "Architectural Forms", an old technology with strong adherents and limited use.
There were plenty of good things in DTDs, especially in SGML DTDs. Many people are still using them and do question the need of putting them in the trash and then define new schema languages to support namespaces and datatypes. DSDL Part 9 is for these people who would like to rely on years of usage of DTDs without loosing all of the goodies of newer schema languages. Despite a burst of discussion in April 2002, this part hasn't advanced yet.
Last but not least, Part 10 (formerly known as Part 1: Interoperability Framework) is the cement which will let you use the different parts from DSDL together with external tools such as XSLT, W3C XML Schema, or your favorite spell checker, to come back to an example given in the introduction to this chapter.
Here again, different contributions have been made, including my own "XML Validation Interoperability Framework" XVIF and Rick Jelliffe's Schemachine. The latest contribution is known (and implemented) as "xvif/outie" (see http://downloads.xmlschemata.org/python/xvif/outie/about.xhtml).
A simple example of a xvif/outie document is:
<?xml version="1.0"
encoding="utf-8"?> <framework> <rule>
<instance> <transform
transformation="normalize.xslt"/> </instance>
<assert> <isValid schema="schema.rng"/>
<isValid schema="schema.sch"/> </assert>
</rule> </framework>This document defines a rule that checks on the result of the XSLT transformation normalize.xslt which is applied to the instance document. This rule states that the result of the transformation must be valid for both schema.rng and schema.sch.
DSDL should bring you all that individual schema languages ignored because they're too focused on individual aspects of validation:
Part 2 (RELAX NG) is the simplest and most powerfull language to validate the structure of XML documents.
Part 3 (Schematron) gives you the ability to add highly flexible "business rules" to your schemas.
Part 4 (Selection of Validation Candidates) lets you write and reuse schemas written in any language and combine them to validate composite documents.
Part 5 (Datatypes) should provide a better alternative to W3C XML Schema datatypes.
Part 6 (Path-based Integrity Constraints) will let you specify integrity constraints between elements and attributes.
Part 7 (Character repertoire) will let you specify which characters may be used in your documents.
Part 8 (Declarative Document Architectures) will let you make explicit information which was previously only implicit to your documents before validation.
Part 9 (Namespace and Datatype-aware DTDs) will let you upgrade and reuse your DTDs in the context of newer applications.
Part 10 (Validation Management) will let you do combine these parts and plug in other transformation and validation tools.
If you already like of of them, I am sure that you'll enjoy the other members of the DSDL family. They share the same principles of focus on solving very specific issues. This focus keeps them powerful and easy to use.
![]() ![]() |
Design & Development by deepX Ltd. |