XML 2002 logo

Resolving the Distinction Between Typing and Validating

Abstract

We attempt to unify the traditional markup notion of validation with the process of type assignment used by W3C Schema. We show how validation can be extended to provide typing information for a schema language, such as Relax NG, which doesn't have explicit type declarations.

Keywords


Table of Contents

1. Introduction
2. Modeling Schemas as Grammars
3. Turning a W3C Schema into a Grammar
4. Getting Types Out of Relax
5. Conclusion
Bibliography
Biography

1. Introduction

The traditional mechanism for describing correct markup has been the DTD, a nebulous concept that nonetheless has one very concrete component - a grammar describing correct members of the document type. The process for determining if a document is correct according to its stated grammar is called validation. A validation attempt has only one required output - either "yes", the document is a member of the document type, or "no", the document is not a member. In either case this is a single statement about the entire document. Among recent schema languages, Relax NG (RNG) [RNG]is a proponent of this approach. The W3C Schema language (W3C Schema) [WXS], while continuing to support validation, introduces a new ingredient, typing, to the mix, especially through the addition of the "Post Schema Validation Infoset" (PSVI), i.e., the Infoset available to an application after a document has been validated against a (W3C Schema) schema. W3C Schema mandates a reference to some defining schema construct be available for just about every Information Item found in a document.

The introduction of such a heavyweight process requires justification, and certainly requires a clear relationship back to the underlying concept of validation. The PSVI has certainly been controversial, judging from traffic on XML-DEV, and has clearly not been easy to implement, as supporting validators are only starting to appear.

Nevertheless, the PSVI is only reifying something that application developers do all the time. Knowing that an entire document belongs to a particular schema doesn't tell the application writer which element is which. In writing an application, the programmer must divide the different items described by the schema into groups and determine which elements belong to which group.

Furthermore, the programmer must go through the process of looking at the schema and determining, essentially "if the document was valid, then the validator must have seen a foo element here, and a bar attribute there, so I know I can have access to those." In other words, the work done automatically by the W3C Schema type system is done consciously by the programmer with just a validator.

The preceding leads to the idea that types are just abstractions of validator states. We hope to demonstrate this notion in two directions. First, by converting a W3C Schema into a grammar based system and showing how the additional typing information is easily generated through the validation process, and second by showing how one could infer or overlay a type system onto a validation based system, such as RelaxNG. The ability to "compile" both forms to a single representation opens up the possibility of allowing each language to reference definitions from each other. The result also creates a W3C Schema implementation that's much lower overhead.

2. Modeling Schemas as Grammars

At this point we need a more formal representation of a schema. Given a description of some type, if we're to automatically determine whether a document conforms, we need some kind of machine that can automate the process. There are a variety of abstract machine classes that can serve this purpose, from COBOL style pattern matching at the low end to full scale Turing machines or hand written programs at the overly complex end. Clearly finding a point in this spectrum that is both sufficient to support the vast majority of applications but which is still mathematically tractable is very desireable.

The markup community has zeroed in at a particular level in the chain, hedge regular languages and their corresponding automata [MUR95], [MUR99],[CHID00], [TATA], so we will direct our discussion to such models. RelaxNG is the most popular language directly based on this abstraction. W3C Schema can be generally expressed as such [FD], to the degree that it is understandable. XDuce, an influential programming language for XML, is the first work to actually link forest grammars with types, although that's not explicitly stated. The XQuery formal semantics also refers to tree grammars, while making little direct use of them in its exposition.

These hedge regular languages (as with regular languages generally) have two different representations. On the one hand they can be expressed via grammars, as is universally the case with current schema languages, or they can be expressed as automata. The grammar level is manifestly more useful for human description of a language while the automaton level is generally far more efficient for execution. It turns out they're also very useful for making comparisons among languages.

For example, two grammars - one in W3C Schema and one in RelaxNG - may compile to the same automaton, but look very different. In addition, different grammar systems will provide different rules and support.

We will change slightly the standard definition of a hedge-regular automaton [MUR99]. We define a hedge-regular grammar as a 6-tuple <Σ, X, N, P, R, r f > where:

  • Σis a finite set of symbols

  • X is a finite set of variables, where variables are either of the form v, or the form @v

  • N is a set of non-terminals

  • R is a finite set of production rules, each of the form:

    • r → e, where e is a regular expression over N

  • P is a finite set of production rules, each of one of 2 forms

    • n → x where n ∈ N and x ∈ X

    • n → a<r> where n ∈ N, a ∈ Sigma, and r ∈ R

  • r f is a regular expression over N.

We've added R to the usual definition, which provides absolutely no change to expressive power, but which will simplify the issue of type assignment discussed later.

Hedge-regular grammars are of interest because many existing XML schema languages, including XML Schema [formal], can be represented using HRGs. We will have more to say about the encoding of XML Schema below, when we talk about W3C Schema's two forms of extension.

Closely related,a hedge regular automaton is defined as a 6-tuple <Sigma, X, Q, a, j, F> where:

  • Sigma is a set of symbols

  • X are variables

  • Q is a finite set of states

  • α is a transition relation from Sigma X Q* to 2^Q s.t. for every q ∈ Q and x ∈ X, { q 1 q 2... q n | n >0, α(x, q 1 q 2... q n, q)} is a regular string language. Note that if α is a function Sigma X Q → Q, then this is a deterministic automaton.

  • j is a relation from X to Q.

  • F is a regular language over states.

In this instance, we are specifically interested in automata compiled from the languages specified above. Given that the transition relation is over regular languages, we can express the set of allowed q i as a regular expression or regular automaton. In particular, as we want to compile our languages into automata, we can strengthen the transition relation to α(x, q 1... q n, q) only if q 1... q n ∈ r, r E R, and there is an n ∈ N such that n→x<r> is a production.

Machines operate on documents/hedges. As a validator, such a machine is required to determine if a document is in the language described by the automaton (or in the grammar from which the automaton is compiled):

T accepts u if:

  • T(u) /\ F ≠ 

  • T(a<c>) = {q | α(x, r, q) ∈ α & T(c) /\ L(r) ≠ }

  • T(x) = {q | q ∈ j(x)}

  • T(uv) = {wz | w ∈ T(u) & z ∈ T(v)}, where u, v, w, and z are sequences

Our acceptance criteria is specified specifically in terms of the regular expressions from the grammar. This is to point out that we can represent determining the sequence of states within an element by a (not necessarily deterministic) automaton.

We've adopted the RelaxNG approach of allowing attributes (in the form of terms such as @xsi:type< typeName > ) to appear in content models. An attribute in the element matches an attribute in the content model and is then removed, so a content model such as (@att1<x>,@att1<x>) would be legal, but wouldn't match any XML documents.

We've defined this as an automatic and formal process, but the same process goes on in the mind of the developer, or else there would be little reason to validate. The developer looks at the schema and determines which trace of states the automaton must have followed to be at a particular point, hence the current element must have the desired properties. When this process fails, an error is likely to occur.

At this point we can start talking about the PSVI and types. Validation with an automaton can be seen as a form of implicit typing. If we write an application on the assumption that a document is valid, we're essentially depending on the machine having matched certain items in the document, or on entailment - if the validator matched this, then there must be one of those. In essence, we create implicit types as abstractions of inevitable traces of the validator.

As the significant difference between validating and typing is the production of type assertions about the document, we can present validation as a machine that generates type assignments. We do this by having certain state transitions in the machine generate output as well as consume input. This is a fairly straightforward transducer. However, we must be very careful about the outputs - they're type names. For explicit types we can construct the output directly as we construct the initial automaton, by having certain states output the type, or types, to which some item in the document belongs.

3. Turning a W3C Schema into a Grammar

This brings us to the question of what to type. Types are something defined using our input language, not the automaton. If we return to our definition of a HRG, we find that a production is n→a<r>, where n is a non-terminal, a is a symbol, and r is a regular expression. Which do we want to type? In the W3C Schema case, complex and simple types clearly are constraints on r - the regular expression defining the content of an element. Substitution groups relate sets of symbols that can appear in the same location. Clearly these are types specified over non-terminals - for each member of the substitution group, there is a production for the same non-terminal.

W3C Schema soon looks like an elaborate system to generate a grammar with very particular rules. For example, if there is a regular expression, r, corresponding to a complex type, R, and then there is a derived type, D, extending it with the further expression, d, in the grammar we construct two regular expressions:

  • (r,d) for the content model of D

  • (r | @t,r,d), where @t stands for the xsi:type attribute required for an element of type R to have content of type D.

Note that without that attribute, the regular expression for type R becomes very non-deterministic and may very well be ambiguous - meaning there may be more than one extension of R describing the same language and there'd be no way to tell them apart. We can compile an element, A, of type R into declaration into a production left hand side such as a<r>. If the element in question is nillable, then we can transform a<r> into a<r|@nil=true>.

Using these and similar rules we construct a grammar. We divide our non-terminals into two groups, global and local. For every global element declaration, g<r> we create a global non-terminal of the same name, G, with a production G→g<r>. For every other element, g', for which g is in the substitution group, we create another production, G'-> g<r>. For every local element, l<r>, we create a local non-terminal of the same name, L, and a production L → l<r>. We create the regular expressions roughly as above, where the name of the regular expression is the name of the type.

We build the corresponding W3C Schema automaton roughly as follows, taking advantage the language being top-down deterministic. First we modify the transition relation, α, to potentially output information across each transition. The accepting regular expression has one state for each global element and is a disjunction over these states. We build a finite state automaton for this language; it'll have a start state and a unique transition to each of these accepting states. On each of these transitions, the automaton outputs the corresponding element name. We now have each of the global elements - an LHS of the form a<r>. At each level we build an NDFA corresponding to r. If α has a rule α(a, r', q), where a is the name of a global element, r' a language for its contents, and q the associated non-terminal occuring in r, then any transition in the NDFA for r with q as its input outputs q if q corresponds to a type.

Each regular expression, r, corresponds to a complex type definition. It is composed of other non-terminals. Some of those correspond to global definitions, some to local ones. Those corresponding to global definitions (ref) will use global non-terminals, as above. Those corresponding to local definitions will have their own non-terminals. We build an automaton from that. Now, as an instance will either correspond to the base type or correspond to a derived type, the instance will either have an xsi:type attribute or not. Therefore in the automaton built from the regular expression we will have some transitions corresponding to the base type and other transitions corresponding to xsi:type attributes for derived types. For the first set we output the name of the type, as well as anything else we'd output for that state. For the others we output the value of the xsi:type attribute.

For example, assume the following schema:

<schema>
  <element name="foo" type="bar"/>
  <element name="fuzz" type="buzz" substitutionGroup="foo"/>
  <complexType name="bar">
    <sequence>
       <element name="a" type="int"/>
    </sequence>
  </complexType>
  <complexType name="buzz">
      <complexContent>
         <extension base="bar">
            <choice>
               <sequence>
                  <element name ="b" type="int"/>
                  <element name = "foo" type="bar"/>
               </sequence>
               <element ref="foo"/>
           </choice>
       </extension>
    </complexContent>
  </complexType>
</schema>

We first need a process to distinctly name every declaration. We use an XPath-like shortcut by giving the path through the definitions to the item of interest. We then end up with the following productions:

Foo → foo<bar> Foo → foo<buzz> Foo → fuzz<buzz> Fuzz → fuzz<buzz> Schema → (Foo) bar/a → a int buzz/b → b int buzz/foo → foo<bar>

And the following regular expressions:

bar := (bar/a | (@type='buzz', bar/a, ((buzz/b, buzz/Foo) | Foo))) buzz := (bar/a, ((buzz/b, buzz/Foo) | Foo))

From this we build the following transition relation:

α(foo, bar, Foo) α(foo, buzz, Foo) α(fuzz, buzz, Foo) α(fuzz, buzz, Fuzz) α(foo, bar, LocFoo) α(@type, 'buzz', LocType) α(a, int, LocA) α(b, int, LocB)

We compile bar into the following automaton:

click image for full size view

We compile buzz into the following automaton:

click image for full size view

In both cases, the transition has the state from the automaton, followed by the information to be output for that transition. The transition from q 0 to the following states outputs the type of the content model if there's an xsi:type attribute, and both the type and the element name otherwise. Note, of course, that the same non-terminal may appear multiple times in a content model, so more than one transition may correspond to the same type from the original schema language (for example a sequence of foo, fuzz, foo).

Given the appropriate nesting, the outputs of the automaton is a tree corresponding to the types assigned to the items included in the input tree, corresponding to the PSVI. One significant difference is the use of names here that do not currently exist in W3C Schema. Only globally defined types have names - types defined within elements such as:

<element name="foo">
   <complexType>
   ...
   </complexType>
</element>

are considered as "anonymous types". In addition, there are no unique names for elements - while global elements are always qualified, local elements may be either qualified or not. Therefore several local and global elements may have the same qualified name, but with different structure. The XML Schema Formal Description [fd] names all of these types distinctly by, essentially, identifying unique paths through the namespace containment graph (toplevel constructs have names directly within the namespace, constructs which are local to some other construct, such as an element inside a complexType, have multipart names). Alternatively, in a pure W3C Schema world, we can replace these names with references to the actual schema components. However, one of the great advantages of the use of automata and names rather than pointers, is the ability to apply the same automata regardless of the schema language of origin. This also provides a lower-overhead validator, as detailed information about schema components (as found in the PSVI) does not need to be maintained at runtime. Rather that information can be accessed as necessary, using type names as indices.

4.  Getting Types Out of Relax

Applying types to a RelaxNG schema is a more difficult issue, in particular because the language has paid no attention to issues of type (in particular, there's no - and doesn't require the determinism that is so useful in the W3C Schema case. In addition, RelaxNG describes a transformation from a full language to a core language in which it would be easy for information to be lost. On the other hand, RelaxNG is directly designed to be expressible as a hedge automaton.

As with W3C Schema, we can assign a named non-terminal to every element definition and a language to every content model - these are the items we need to keep track of. As with W3C Schema, we can either use the names given in the schema document, or generate names based on the tree structure of the definitions. While non-determinism in W3C Schema is under strict control, in RelaxNG it runs rampant. We therefore need to make a further distinction between non-deterministic (or n-deterministic) and ambiguous languages. Non-deterministic languages require some amount of lookahead greater than 1 to parse correctly, but in the end there is only one parse. This means that when the children of an element are finally parsed, only one of the possible parses will turn out to have been correct. For an ambiguous language, there may be several possible parses, none of which can be shown to be correct.

For example, we can have the following grammar:

<grammar>
   <start>
      <ref name=”bar.element”/>
   </start>
   <define name="foo.element">
     <element name="foo">
        <group>
          <element name=”a”>
                <empty/>
          </element>
        </group>
     </element>
   </define>
   <define name=”bar.element”>
     <element name="bar">
        <choice>
           <element name=”foo”>
              <group>
                <element name=”b”>
                   </empty>
                </element>
              </group>
           </element>
           <ref name="foo.element"/>
        </choice>
      </element>
   </define> 
</grammar>
  

This gives us the following language:

Foo → foo<r0> Bar →bar<r 1 > Lfoo → foo<r 2 > r 0 → foo/a r 1 → Lfoo | Foo r 2 → bar/foo/b F → Foo

The interesting part here is compiling the automaton for checking r 1 : α(foo, a, Foo) α(foo, b, Lfoo) …

click image for full size view

In the case where the language is unambiguous, as here, we keep track of the sets of non- terminals that may be applied to each item in a content model. Again, we compile NDFAs for all the regular languages in R, using states q0'...qn'. Starting with the start state, q0', at any point when we see an element or attribute, we have a set of current states, and the transition relation gives us a set of potential inputs leading to new states. We make this set the current set of states and prune the set of incoming states - only those states from the input set with transitions across those inputs offered by the transition relation can remain. Others are dropped and we proceed backwards. If the language is unambiguous, then at the end there will be only a single possible state for each input. This then corresponds to our types, and the typing for any particular node may only be determined after validating the entire document, as opposed to assigning them in order.

If the language is ambiguous, then at any point, starting with the root, we have a tree of possible states. This, if anything, comes close to a kind of multiple inheritance. The element may be seen as belonging to any of the possible states. An application may safely choose any of them. Any choice, of course, may invalidate some states for either the parent or the child, and would need to be eliminated through a similar pruning process. T="foo""""his provides a very interesting non-determinism, in that the final decisions about the types of elements is determined by the choices the application makes among the possibilities affirmed by the validator.

5. Conclusion

Of course, the type inferences made from an arbitrary RelaxNG schema through this kind of automated type generator may be less than may be less than totally illuminating for the developer if the schema is developed so that every element is its own type. However, the more that definitions are reused, the smaller the number of types that need to be inferred. Also patterns could be turned into types. For example, because subset is calculable for automata, it is possible to order any set of elements by inclusion. Stick them in a choice group, and one has a makeshift substitution group, which could be tracked by a validator. Another simple way to take advantage of this would be for an application designer to give arbitrary type names directly to elements in the schema. The mechanism above could be used to carry that information through validation to the application.

The goal of this exercise has been to demonstrate that types and the generation of type-specific information from validation is not incompatible with "pure" validation. RelaxNG, as specified, adds no information to a document's infoset. Nevertheless, significant information about a document can be derived automatically from the structure of a schema when that document is valid. A reasonable intermediary position would be to clearly demarcate a document's well-formed infoset from any information added through a validation process.

Typing, in the sense of W3C Schema, then, is reflecting these states/nonterminals out from the automaton/expression and giving them names. Once this is done, for all the interesting non-terminals, some interesting things happen. It is easy to share them. Components become easily reusable - once a component is known to work with a particular type/non- terminal, it can be used anywhere the nonterminal is referenced. Of course, as the non- terminal is now freed from its surroundings, only downward properties can be considered persistent - ancestor and sibling properties can only be determined within a context in which the nonterminal appears. Another side benefit would be the ability to, for example, have a W3C Schema type extend from a RelaxNG type.

It is reuse at the application component level that is the greatest promise of types. Static typing has been shown to be the best way to achieve this. Externalizing types provides the essential bridge from XML to programming languages to enable that.

Bibliography

[RNG] Relax NG SpecificationJames Clark and MURATA Makoto, Oasis, 2001

[MUR95] Murata Makoto, "Forest-Regular Languages and Tree-Regular Languages". Technical Report. Fuji Xerox, Japan, 1995.

[MUR99] Murata Makoto, "Hedge Automata: A Formal Model for XML Schemata", 1999.

[CHID00] Boris Chidlovskii, "Using Regular Tree Automata as XML Schemas", Proceedings of IEEE Advances in Digital Libraries 2000 Washington, USA, 2000, eds., J. Hoppenbrouwers, et al.

[TATA] Hubert Comon, et al, Tree Automata Techniques and Applications, 1998

[WXS] Henry Thompson, et al., XML Schema Part One: Structures, World Wide Web Consortium, May 2001, http://www.w3.org/TR/xmlschema-1/

[FD] Allen Brown, et al., XML Schema: Formal Description, World Wide Web Consortium, Sept. 2001, http://www.w3.org/TR/xmlschema-formal/

Biography

Matthew Fuchs, PhD, is senior architect at Westbridge Technology with a long interest in sending markup over the Web. A charter member of the W3C XML Activity and a member of the Schema Working Group, he pioneered a number of XML related technologies. He received his PhD from NYU in 1995.

Senior Software Architect

Dr. Allen L. Brown, Jr. is a Senior Software Architect at Microsoft with a primary focus on the development of international standards related to Web Services. In the years immediately prior to joining Microsoft, he co-founded and developed a venture-backed startup enterprise focused on digital media management. More than half of Dr. Brown's nearly four decades of professional involvement in computing was spent at the Xerox Corporation where he served, among other things, as a Research Fellow in its Corporate Research and Technology Division and as CTO of its XSoft Division. While much the Xerox era was devoted to co-authoring 75+ papers and 24 US patents, Dr. Brown also contributed directly to products, including architecting some of the key facilities of the landmark Xerox 'Star' system. He has also held academic appointments as Professor of Computer and Information Science at Syracuse University and James Welling Clark Professor at the George Washington University. Dr. Brown earned undergraduate degrees in Mathematics and Chemical Engineering, and a doctoral degree in Artificial Intelligence, all from the Massachusetts Institute of Technology.