XML 2003 logo

XML Schema 1.0: A language for document grammars

Abstract

The abstract was not available at the time the proceedings were created. Please check an updated version of the paper abstracts at the conference proceedings web site.


Table of Contents

1. Introduction: document grammars
2. DTDs as document grammars
3. XML Schema 1.0
4. Schemas and namespaces
5. Simple types
6. Complex types
7. The Post-schema-validation Infoset
Biography
One of the characteristic features of SGML and XML which marks them as an advance over earlier systems of textual representation is their notion of document grammars: formal specifications of rules for distinguishing valid documents from other data streams. This paper introduces the idea of using formal grammars to describe data streams, and notes some of the fundamental characteristics of XML 1.0 document type definitions as formal grammars, before taking up the major characteristics of XML Schema 1.0 as a language for document grammars.

1. Introduction: document grammars

One fundamental rule of computing is: Garbage in, garbage out. In a sense, everything disucssed in this paper is an attempt to come to grips with this iron law.

Garbage can take the form of bad or erroneous information, but even good information can be garbage if the program we feed it to doesn't know what to do with it. By ‘garbage’ we typically mean nothing other than input for which a given program is not prepared.

When a program is expensive or takes a long time to run, it can be rather trying to discover only after the fact that the input to the program was not what the program was expecting — was garbage, in fact. It would be useful to have some reliable method of distinguishing good input — input for which a program is going to produce useful output — from bad input.

That is the reason that grammars are, or ought to be, of interest to builders of software systems.

Informally, a grammar is a set of rules that describes how something works, especially a natural language like English or French. In the context of computing, the term grammar is usually associated with Noam Chomsky's explication of the idea: in Chomsky's account, a grammar is a rewriting system — a system which performs calculations by writing and re-writing the contents of a buffer according to a specified set of rules, until the rules say the calculation is finished. A grammar in this sense consists of:

  • V N : a vocabulary of so-called non-terminal symbols

  • V T : a vocabulary of so-called terminal symbols, disjoint from VN

  • P: a set of production rules symbols, each with a left-hand side consisting of a sequence of non-terminal symbols, and a right-hand side consisting of a sequence of non-terminal and/or terminal symbols

  • S: the so-called start symbol, which is an element of the non-terminal vocabulary

The calculation starts with a buffer containing only the start symbol, and proceeds by locating production rules whose left-hand sides match part of the contents of the buffer; when there is such a match, the part of the buffer that matches the left-hand side of the rule can be replaced by the right-hand side of the rule; we call this a derivation step. The calculation proceeds until the buffer contains no non-terminal symbols, only terminal symbols. (They are terminal symbols because they terminate the calculation.) The contents of the buffer at that time will be a sequence of terminal symbols; that sequence of symbols is said to be generated by the grammar.

In this formalization, a language is just a set, whose members are strings, i.e. sequences of symbols. The language defined by a grammar contains all the strings generated by that grammar, and no others. The function of a grammar, in this account, is just to generate strings, but we can use it to recognize strings of the language, as well as generating them: in principle, for any finite grammar we can systematically generate all the strings which belong to the language, so one way of telling whether a given string is or is not a member of the language is just to generate all the members of the language and see whether it shows up in the list. Since for most interesting languages the list is infinitely long, however, this method has some practical drawbacks, and computer scientists have devoted a good deal of energy during the last fifty years to finding ways, given a grammar, to create a program which can reliably distinguish between strings which are members of the language defined by the grammar, and strings which are not. A piece of software which uses a grammar to recognize the members of the language is, not unreasonably, called a recognizer. A program which provides a bit more information about the string itself (for example, some account of its derivation using the grammar) is a parser.

As a simple example of a rewriting system, let us consider this fragment of a grammar for purchase orders:

purchase-order ::= shipto-address billto-address comments items
shipto-address ::= address
billto-address ::= address
address        ::= name street city state zip
name           ::= string
street         ::= string
city           ::= string
state          ::= 'AL'| 'AK' | 'AR' | ... | 'WV' | 'WI' | 'WY'
zip            ::= zip-five | zip-nine
zip-five       ::= DIGIT DIGIT DIGIT DIGIT DIGIT
zip-nine       ::= zip-five '-' DIGIT DIGIT DIGIT DIGIT
string         ::= CHAR | CHAR string
items          ::= item
items          ::= item items

DIGIT          ::= '0'| '1' | ... | '8' | '9'
CHAR           ::= ' ' | 'a' | 'b' | ...

In this example, terminal symbols (DIGIT and STRING) are written in capitals, and non-terminals in lower-case letters. The left- and right-hand sides of each rule are separated by ::= and alternative right-hand sides are separated by vertical bars.

The first rule says that the symbol purchase-order can be rewritten as the sequence shipto-address billto-address comments items — less formally, it can be interpreted as saying that a purchase order consists of a shipping address, a billing address, some comments, and a list of items. The next two rules say that the symbols shipto-address and billto-address can each be rewritten as address. The rule for state specifies the legal two-letter codes for states in U.S. addresses (the left-hand side state can be replaced by any one of the alternatives in the right-hand side), and the rules for zip codes capture the fact that U.S. zip codes must consist either of five numeric digits or of nine (with a hyphen after the fifth). No attempt is made here to distinguish valid zip codes from invalid zip codes: even though it would in theory be possible to do so, it would be difficult to maintain. This is a good example of a constraint which is better checked by specialized software (in this case, a certified address-checking package, which can also check that the zip codes match the city and state information). The two rules for items work together recursively to allow the symbol items to be replaced in the buffer by one or more occurrences of the symbol item; the expression of repetition by means of recursion is a basic idiom of conventional context-free grammars.

One of the most important results of work on this idea of grammars as rewriting systems is the observation that if we restrict the form of the production rules (P), the result is a restriction on the complexity of the grammars we can write. Various seemingly mechanical restrictions on the form of production rules turn out to correspond to distinct classes of languages, resulting in what is called the Chomsky hierarchy:

  1. unrestricted phrase-structure grammars

  2. monotonic grammars

  3. context-free grammars

  4. regular grammars

The more restrictive the form of grammar, the less expressive it is, but also the easier it is to recognize the language it describes. For practical purposes, in computing we are interested primarily in context-free and regular grammars. The example rules shown above, which are limited to a single non-terminal symbol in the left-hand side, obey the rules for context-free grammars.

Another crucial result of formal language theory is the basic fact that there are languages (sets of strings) which cannot be generated by any grammar. Some languages, that is, are ineffable; there is a necessary limit to our ability to express arbitrarily complex distinctions in defining languages we want to recognize by machine.

One of the most important innovations in the development of SGML was its incorporation, under the name document type definition, of the notion of formal, formally checkable grammars for document classes. Since the grammars defined by DTDs define classes of documents, we can refer to them as document grammars.

Although they are very interesting from the point of view of textual theory, document grammars were introduced in SGML and retained in XML not for theoretical but for pragmatic reasons. They prove useful in routine quality assurance, because they make it possible to find and clean up dirty data — they cannot find all dirty data, since there are classes of error which cannot be detected conveniently or at all using a formal grammar, but formally checkable document grammars make it possible to detect mechanically whole classes of typographic error or data corruption in transmission which would otherwise require expensive manual proofreading. ISO 8879 distinguishes explicitly between the set of rules governing the application of markup to a particular document type, and the formally checkable expression of some of those rules. Formal rules can typically capture only part of the expectations we may have in building a particular application, but every rule we can capture formally is a rule which can be checked without human intervention.

Document grammars can be used to document agreements between data providers and data consumers, or to describe the contents of data flows. And in some cases they can be and have been used to specify the allowable contents of messages in client/server protocols.

2. DTDs as document grammars

As an example of DTD notation, consider the DTD equivalents of some of the purchase-order rules given above in context-free form:

<!ENTITY % address "(name, street, city, state, zip)" >
<!ELEMENT purchaseOrder (shipTo, billTo, comment, items) >
<!ELEMENT shipTo        (%address;) >
<!ELEMENT billTo        (%address;) >
<!ELEMENT name          (#PCDATA) >
<!ELEMENT street        (#PCDATA) >
<!ELEMENT city          (#PCDATA) >
<!ELEMENT state         (#PCDATA) >
<!ELEMENT zip           (#PCDATA) >
<!ELEMENT items         (item+) >

XML DTDs are often described as analogous to grammars in Backus-Naur Form, the conventional notation for representing context-free grammar. Formally, however, DTDs differ from BNF in several ways.

  • First of all, XML DTDs allow regular-expression notation in content models (the right-hand sides of rules in a DTD), which is a great convenience. Consider for example the definition of items, which is somewhat more compact and easier to understand than the BNF equivalent given earlier.

  • Second, XML DTDs are strictly less powerful than context-free grammars; instead, they resemble a subset of context-free languages known as ’bracketed‘ languages.

  • Third, XML DTDs are intentionally restricted to languages which can be recognized without look-ahead; this rule, inherited from SGML, goes by a variety of names. The SGML specification specifies that content models must be “unambiguous”, although its definition of “ambiguity” is somewhat broader than usual in computing. To avoid misusing the term ambiguity, the XML specification introduced the term determinism. Owing to an editorial confusion, meanwhile, the XML Schema specification uses a third term (“unique particle attribution”) for the same concept.

  • Finally, XML DTDs do more than provide an account of the legal structure of XML documents: they also provide for the declaration of entities and notations. This has often been identified as a weakness in DTDs, both for theoretical reasons (it is a mixing of unrelated concerns) and for practical reasons (the entities declared in documents often depend more on the individual document than on its document type, and they have different patterns of maintenance and revision).

In practice, SGML and XML DTDs have demonstrated the utility of document grammars as a tool for describing the structure of data streams and allowing for independent specification-based validation of data. They allow for the independent description of an agreement between data providers and data consumers about the structure and possible content of data streams, thus making it much easier to process the same data with software from more than one vendor. By contrast, formats without a DTD or similarly declarative definition (e.g. RTF or, in practice, HTML) have tended to be defined by the behavior of one or two prominent pieces of software, with the result that there is no way to determine whether a data stream is acceptable or not except by checking the behavior of that piece of software. Since there is no declarative description of the legal format, it's difficult or impossible to prove that the output of a particular program (for example) will always be legal according to the specification. And using the behavior of one program, rather than a written specification, as the acid test has the drawback, for information owners, of encouraging or enforcing dependence on a particular software vendor.

As the scope of application of XML has grown, however, some shortcomings of XML document type definitions have become more and more apparent.

Perhaps the most pressing for many is that DTDs have no type system. There is no standard way, in a DTD, to specify that the content of a particular elements, or the value of a particular attribute, must be an integer, or a Boolean value, or a date — and still less that it must be a positive integer less than 150, or a date between 1989 and the present. In the example, note for example that the DTD fragment for purchase orders does not constrain the contents of the state and zip elements; DTDs offer no ability to check the content of elements against an enumeration or against any kind of datatype. This lack contributes to a mismatch between XML and existing programming languages and database management systems.

Less immediately pressing for most users, but nevertheless important, is the fact that while XML DTDs are incontestably structured data, they don't use the increasingly common standard for writing, manipulating, and transmitting structured data. They aren't XML documents! So we cannot use our standard tools for working with XML: we cannot use the same XML editor for them that we use for our documents, we cannot read them and manipulate them using XSLT or SAX or the DOM. And we cannot apply stylesheets to them to affect their display when we include them in system documentation.

In an effort to address these and related problems, the World Wide Web Consortium developed XML Schema 1.0, an XML vocabulary for defining XML vocabularies. XML Schema 1.0 became a W3C Recommendation in May, 2001, and with the coming publication of XQuery 1.0 and XSLT 2.0 it seems likely to become increasingly important to users.

3. XML Schema 1.0

XML Schema 1.0 differs from DTDs in several ways (some of these are also characteristic of other non-DTD schema languages for XML documents, some are not).

It uses an XML vocabulary, rather than an ad hoc specialized non-XML notation, to represent document grammars. Schema documents are more verbose than equivalent schemas in DTD notation but also more easily processable. The declaration of purchaseOrder, for example, might look like this in a schema document:

 <xsd:element name="purchaseOrder">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element name="shipTo" type="address"/>
    <xsd:element name="billTo" type="address"/>
    <xsd:element name="comment" type="xsd:string"/>
    <xsd:element name="items" type="item-type"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

It is not only possible, but commonplace, to edit schema documents using standard off-the-shelf XML editors (driven by the schema for schemas) and process them using XSLT. XSLT stylesheets to generate documentation from a schema document are relatively common; by contrast, software to generate document from DTDs is relatively uncommon.

XML Schema separates constraints on logical structure from constraints on physical organization of the document. This has the advantage of cleanly separating the logical from the physical. Entities can still be used in XML documents governed by XML Schema, however, simply by using the existing DTD notation. Since XML Schema takes the XML Information Set output by a standard XML parser as its input, entity expansion can be handled up front by the XML parser without complicating schema processing. The ability of XML Schema to co-exist with DTDs is one of the least well understood features of XML Schema, and the least well exploited.

XML Schema processors operate on XML information sets, rather than on raw XML character streams: entity references have already been expanded, and the details of the XML transfer syntax have already been abstracted away from. Validation is defined, formally, as a mapping from an input information set to an output information set, which includes information about datatypes and validity of elements and attributes.

4. Schemas and namespaces

One of the most important developments in the XML area since the original XML specification is the definition of XML namespaces.

The XML namespaces specification (like the original XML specification, this is a W3C Recommendation) allows XML users to combine tags from different vocabularies in the same document. Each namespace is identified by a Uniform Resource Identifier (URI), and local names in different namespaces do not collide with each other. There is thus no problem if one XML vocabulary uses a name like title in one sense, and another vocabulary uses the same name with a different meaning.

The biggest drawback about namespaces has historically been that they made it very difficult to validate documents: XML DTDs do not have any syntax for associating elements and attributes with namespaces, and DTD-based validation does not understand the use of namespace prefixes. Even more important, DTDs provide no way to show, in a content model, where elements from other namespaces should be legal, and where they should raise an error.

As a result, while it is possible (within limits) to use DTDs together with namespaces, it is a difficult task at best, and most people who have tried have eventually given up the attempt.

Unlike DTDs, XML Schema deals well with XML namespaces. Each schema document which declares elements and attributes can also specify the namespace to which those elements and attributes belong. There is explicit support for namespace-based validation and for combining XML vocabularies from different namespaces into a single composite schema.

The top-level element of our sample schema document for purchase orders, for example, might look like this:

<xsd:schema targetNamespace="http://www.example.com/po" 
 xmlns="http://www.example.com/po"
 xmlns="http://www.w3.org/2001/XMLSchema" >
 ...
</xsd:schema>

Namespaces are used both in schema-defined documents and in schema documents themselves: in the example just shown, the XML Schema namespace (the namespace which contains the declarations for the elements found in schema documents) is bound to the prefix xsd, while the namespace being defined by the schema document itself (http://www.example.com/po) is identified by the targetNamespace attribute and is made the default namespace.

5. Simple types

XML Schema provides a basic set of predefined `simple' datatypes, which can be associated with attribute values or with elements whose content is a simple character string without sub-elements. Builtin types include types inherited from XML as well as types commonly found in programming languages and database management systems: exact decimal numbers and integers, floating-point and double-precision numbers, dates and times, and so on. Schema authors can define new simple types by restricting existing ones in certain well defined ways.

For example, we can specify that when quantity is specified in a purchase order, it must be a positive integer less than 100:

<xsd:element name="quantity">
 <xsd:simpleType>
  <xsd:restriction base="xsd:positiveInteger">
   <xsd:maxExclusive value="100"/>
  </xsd:restriction>
 </xsd:simpleType>
</xsd:element>

6. Complex types

In addition to simple types, schemas can also define complex types, for elements which can contain sub-elements; complex types correspond to the content models and attribute declarations of DTDs. From object-oriented systems, however, XML Schema has adopted the concept of class inheritance: it is possible to derive new complex types from existing complex types, just as it is possible to derive new object classes from existing classes in an object-oriented programming language. Experience with DTDs shows that two quite separate kinds of inheritance may be needed for document grammars: one in which the derived type inherits some properties of its content model and attributes from the ancestor types, and another in which what is inherited is the ability of an element to occur in particular locations. (The TEI models these two different kinds of inheritance by distinguishing attribute-classes and model-classes.)

A simple example of complex types can be seen in our purchase-order example. Both the shipping address and the billing address have the same structure. In the Backus-Naur Form grammar fragment, this was expressed by allowing the two non-terminals each to be rewritten in the same way:

shipto-address ::= address
billto-address ::= address

This has the advantage of making the parallel explicit, but has the odd effect of suggesting that an address is a constituent part of a shipping address, and also of a billing address, instead of correctly capturing the fact that shipping and billing addresses are addresses.

In the DTD fragment, the parallelism is captured by using a parameter entity:

<!ELEMENT shipTo        (%address;) >
<!ELEMENT billTo        (%address;) >

This makes the parallel structure obvious to any reader of the DTD, and makes it easier to maintain consistency if the definition of address changes (for example, to accommodate addresses outside the U.S.), but typical XML systems expand the parameter entity before passing any of this information on to the downstream application (if, that is, any information about the DTD is available at all), and so the information about the common structure of shipping and billing addresses tends not to be readily available to applications using XML DTDs.

In an XML Schema document, by contrast, the shipTo and billTo elements are explicitly assigned a type, just as the quantity element is.

    
<xsd:element name="shipTo" type="address"/>
<xsd:element name="billTo" type="address"/>

The commonality of structure is thus more explicit in XML Schema than in either of the other notations.

7. The Post-schema-validation Infoset

Perhaps the most important innovation in XML Schema 1.0 is that schema-based validation provides much more information than the simple yes/no is-this-valid? information provided by DTD-based validation. Information about the simple or complex type assigned to an attribute or element is provided by an XML Schema processor as part of the standard post-schema-validation information set (PSVI). The validity of each element and attribute is checked and recorded separately; this entails a distinction between the concept of full validity, which is recursive and requires that all descendants also be fully valid, and of local validity, which is not recursive. Since schema validation need not start at the root element of the document, and since a schema can direct that the contents of particular elements are not to be validated, or that the elements encountered in particular contexts need not be declared, XML Schema 1.0 effectively introduces a coherent notion of partial validation.

Biography

C. M. Sperberg-McQueen is a member of the Technical Staff of the World Wide Web Consortium, an international membershp organization responsible for developing Web standards. He co-edited the XML 1.0 specification and the [Guidelines] of the Text Encoding Initiative, and he serves as a co-chair of the W3C XML Schema Working Group.