Abstract
Implementing validation and restriction checking for W3C XML Schema content models is harder than for DTDs. This paper gives complete details on how to convert W3C XML Schema content models to Finite State Automata, including handling of numeric exponents and wildcards. Enforcing the Unique Particle Attribution constraint and implementing restriction checking in polynomial time using these FSAs is also described.
Keywords
Table of Contents
W3C XML Schema content models [XML Schema] are richer and more powerful than their XML DTD [XML] equivalent in several ways:
Numeric occurrence ranges;
Substitution groups;
Wildcards.
Content models are also required to obey the Unique Particle Attribution constraint, a weak form of determinism requirement, and, if derived by restriction, to stand in a certain relationship to the content model of the base from which they are derived.
In this paper we present a complete approach to all these issues, based on the translation of content models into augmented Finite State Automata (hereafter FSAs).
Three algorithms are presented:
| Algorithm 1: |
A modified version of Aho and Ullman's algorithm [Aho & Ullman] for converting regular expressions to FSAs, in which edges are labelled with term schema components. The modifications treat substitution group as disjunctions and handle occurrence indicators other than 0, 1 or unbounded by unrolling the sub-expression governed by the indicator; |
| Algorithm 2: |
Enforcement of the Unique Particle Attribution constraint by relabelling the FSA constructed by Section 2.1 with expanded element names, and then checking by cases for ambiguity for element-element, element-wildcard and wildcard-wildcard cases. |
| Algorithm 3: |
The current W3C XML Schema Recommendation [XML Schema] defines valid restriction by cases, in a way which not only is difficult to understand and implement, but which also fails to implement the stated design goal for restriction, namely that a restriction is valid if and only if it accepts a subset of the information items accepted by its base. The W3C XML Schema Working Group has expressed an interest in replacing the current definition with an extensional one, which simply uses the subset formulation. But doing this without specifying how it can be checked would be unhelpful to implementors to say the least. The third algorithm achieves this: A subsumption check algorithm for two augmented FSAs, which essentially works by attempting to parse every path through the derived FSA using the base FSA, with special attention to certain corner cases to do with wildcards. If the parse succeeds, the base FSA subsumes the derived FSA, which in turn therefore accepts a subset of what the base accepts. |
XML DTD content models are regular expressions over an atomic term vocabulary of element names. Alternation and concatenation (indicated with vertical bar (|) and comma (,) respectively) are the basic term constructors for composing complex terms from other terms. In addition the three Kleene operators ?, + and *, indicating optionality, one-or-more and zero-or-more respectively, can be added to any term.
XML Schema goes beyond this in three ways:
| Wildcards |
In addition to element names, XML Schema adds wildcards, which accept any element, possible constrained with respect to namespace, as atomic terms; |
| Substitution groups |
A replacement for one common use of parameter entities, the impact on content models is that a single atomic element term may accept a number of alternative elements; |
| Numeric occurrence ranges |
Replacing and extending Kleene operators on terms, these allow any range of occurrence, with lower bound any non-negative integer, and upper bound and positive integer or infinity. |
The impact of these three additions is that the schema-validation process cannot use existing approachs to regular-expression matching to check that a content model is satisfied. The first algorithm we present addresses this problem, by augmenting the normal algorithm for converting a regular expression to an FSA in ways which accommodate the three differences outlined above. The result can not only serve to check local schema-validity (that is, check that the sequence of element children of a parent in an instance are accepted by the parent's content model), but also as the basis for solving our other two content-model related problems.
The algorithm that follows is a modification of the one beginning on page 95 of [Aho & Ullman].
We need to explain the abstract data model which XML Schema uses for content models in a bit more detail before setting out the algorithm in detail.
The {content model} of a complex type definition is a single ??? .
| particle |
A ??? has three properties:
|
||||||
| Element declaration |
An ??? has three properties of relevance:
|
||||||
| Wildcard |
A ??? has two properties:
|
||||||
| Sequence |
A ??? has one property:
|
||||||
| Choice |
A ??? has one property:
|
XML Schema also defines an all term, but its use is heavily constrained -- in particular it cannot occur within other terms, but only as the ??? of the ??? which is a content model. Given this, and the fact that it does have any simple translation into an ordinary FSA, it is appropriate that it should be handled specially. Accordingly we will have nothing further to say about such particles in this paper.
The algorithm for converting a content model to an FSA has two parts, one for particles and one for terms.
To translate a ??? to an FSA ending at a state S
Set n to S
If the ??? 's ??? is ∞
otherwise ( ??? is numeric)
Now build a chain of ??? copies of the translation of ??? back from n, and return (the start state of) the resulting machine.
To translate a term to an FSA ending at a state S
If the term is a ??? , connect a new state b to S with an edge labelled with the term itself and return b;
Otherwise if the term is an ??? , create a new state b, then for each ??? in its ??? create an edge from b to S labelled with that ??? , and return b
Otherwise if the term is a ??? , create a new state b, translate each particle in its ??? into an FSA ending at S using Tp(S), connect b to the start state of all the results with a lambda edge and return b;
Otherwise (the term is a ??? ), chain translations of each particle in its ??? , in reverse order, back from S and return the first state in the chain.
It should be noted that the non-lambda edges of the resulting automata are all labelled either with ??? or ??? terms.
There are two circumstances in which the algorithms above produce a state with no edges leaving it, which may (although it need not) render the whole FSA unsatisfiable:
An empty ??? , i.e. <xs:choice/>. This results from an empty ??? .
An abstract ??? which is not identified as the substitutionGroup of any other ??? . This results from an empty ??? at the component level.
It should be obvious that FSAs produced by the above algorithm can be used either as-is or after determinisation and/or minimisation to do local schema-validity assessment.
XML Schema's Unique Particle Attribution (UPA) constraint on content models (http://www.w3.org/TR/xmlschema-1/#cos-nonambig in [XML Schema]) reconstructs a constraint inherited by XML DTDs from SGML. Given an FSA constructed from a content model using Section 2.1, it is easy to enforce this constraint by simple manipulations and checks.
Call the translation of a content model into a (non-deterministic) FSA ending at a final state F using Section 2.1 M1;
As noted above, the non-lambda edges in M1 are labelled with simple terms, either ??? or ??? , not element names. Determinise M1 to yield M2, using the algorithm given on page 93 of [Aho & Ullman], collapsing non-lambda edges only if they are labelled with the same simple term.
M2 violates the UPA if it is non-deterministic ignoring term identity, that is, if there is any state in M2 which has two outgoing edges such that any of the following hold:
Their labels are both ??? with the same ??? and ??? .
Their labels are both ??? whose ranges overlap.
Their labels are a ??? and an ??? and the ??? of the ??? is in the range of the ??? .
Since step 3.1 above is by construction operating on an FSA which has been determinised with respect to terms, any name collision is necessarily between two distinct terms from the original content model, thus violating UPA. Steps 3.2 and 3.3 are the obvious additions to handle ??? - ??? and ??? - ??? overlaps respectively.
Note there's nothing special in any of the above for numeric ranges. That is all handled by Section 2.1.
The XML Schema REC [XML Schema] defines two ways by which new complex type definitions can be derived from old, extension and restriction. Restriction, as its name suggests, is intended to define a new type, call it D for 'derived', which accepts a subset of what its base type definition (call this B) accepts. The REC as it stands does not, however, base the formal definition of restriction by appeal to subsetting, but rather by a set of detailed constraints on the allowed relationship between the content models of B and D (see http://www.w3.org/TR/xmlschema-1/#cos-particle-restrict in [XML Schema]). Not only is the current approach complex and difficult to understand (and implement), but it is also inaccurate, in that it both allows some defintions as restrictions which accept more than their base, and bars some definitions which accept lest. There is therefore strong interest in moving to a definition which accurately correctly implements the subset story.
Two broad classes of approach to this can be imagined: One which deals explicitly with types, defined as (infinite) sets of infosets, and one which deals only with type definitions. We adopt the latter approach, as it fits neatly with the use of FSAs set out above, which we have already adopted in our XML Schema implementation [XSV].
Since we are working in the domain of type definitions, it's appropriate to describe our goal as checking whether a the content model of a type definition B subsumes that of a derived type definition D. Here's a detailed sketch of an algorithm which attempts to check subsumption of two FSAs B and D, constructed with Section 2.1 and determinised and checked with Section 2.2:
Initialise SS to be a set containing the unprocessed pair <B.initialState,D.initialState>
Access such pairs with .base for the first member and .derived for the second
Choose an unprocessed pair P from SS. For each edge DE leaving P.derived
If there is an* edge BE leaving P.base which Section 2.3.1.1 DE, then add a pair <BE.end,DE.end> to SS (unless it's already there).
Otherwise fail
Mark P as processed, and repeat from step (2) until no unprocessed pairs are left in SS
Check that for every end-state in D, there is a processed pair in SS which pairs it with an end-state of B, otherwise fail.
*There can be only one such edge because B satisfies UPA.
An edge BE matches another edge DE iff one of the following three cases holds:
They are both labelled with ??? with the same ??? s and ??? s;
BE is labelled with a ??? , DE is labelled with an ??? and DE's ??? is allowed by BE;
BE and DE are both labelled with ??? and DE's label is an intensional subset (defined in [XML Schema]) of BE's.
The above algorithm is too strict in one area, involving ??? . A single ??? in D may require two ??? s in B to cover it, as it were. Consider the following:
B: <xs:choice>
<xs:any ns='a'/>
<xs:any ns='b c d'/>
</xs:choice>vs.
D: <xs:choice>
<xs:any ns='a b'/>
<xs:any ns='c'/>
</xs:choice>The second clearly accepts a subset of what the first accepts, but Section 2.3 will reject it, because no edge in the top choice subsumes the 'a b' edge in the bottom one.
Here's the fix:
This problem only arises in cases where there is more than one wild edge leaving a given state in B and at least one wild edge leaving its paired derived state which is either negated or has more than one allowed namespace;
If the wild edges in B share a destination, then simply compute their union and replace them with it, problem solved;
Otherwise, unpack (see below) the wild edges from the corresponding state in D and use them instead of the actual wild edges therefrom.
In all cases, then proceed with Section 2.3 as given above.
To unpack a set of wild edges with no negated wild edge, simply replace each edge therein which has more than one allowed namespace with a set of wild edges, one for each allowed namespace;
To unpack a set of wild edges which include one negated wild edge, replace it with a non-negated edge which allows any namespace mentioned in the corresponding base ??? s as well as a new created namespace, and proceed as in (1) immediately above.
Note that to avoid violating the UPA there cannot be more than one negated wild edge leaving any state, and when there is one all the non-negated wild edges can only allow namespaces which are negated by the negated one.
The simple example above is covered by case (2) in the fixup above. Here's an example that is covered by (3) and so requires unpacking:
B: <xs:choice>
<xs:seq>
<xs:any ns='a'/>
<xs:elt ref='z'/>
</xs:seq>
<xs:seq>
<xs:any ns='not(a)'/>
<xs:elt ref='z'/>
</xs:seq>
</xs:choice>versus
D: <xs:choice>
<xs:seq>
<xs:any ns='b'/>
<xs:elt ref='z'/>
</xs:seq>
<xs:seq>
<xs:any ns='not(b)'/>
<xs:elt ref='z'/>
</xs:seq>
</xs:choice>The rules for unpacking give us a transformation in two steps:
First D becomes
<xs:choice> <xs:seq> <xs:any ns='b'/> <xs:elt ref='z'/> </xs:seq> <xs:seq> <xs:any ns='q a'/> <xs:elt ref='z'/> </xs:seq> </xs:choice>
then
<xs:choice> <xs:seq> <xs:any ns='b'/> <xs:elt ref='z'/> </xs:seq> <xs:seq> <xs:any ns='q'/> <xs:elt ref='z'/> </xs:seq> <xs:seq> <xs:any ns='a'/> <xs:elt ref='z'/> </xs:seq> </xs:choice>
and it should be clear that the subsumption check will now succeed, as it should.
So far, we've been looking at very local properties of content models, looking at the elements accepted only in terms of their names. To complete the picture, we need to check a few other properties of the ??? s involved, and place a recursive constraint on their type definitions. For this we need to look at some other properties of ??? s:
| {type definition} |
A simple or complex type definition, constraining the content and attributes of the element being declared. |
| {value constraint} |
Optional provision for a default and/or fixed value. |
| {identity constraints} |
A set of key, keyref and unique constraints (see http://www.w3.org/TR/xmlschema-1/#cIdentity-constraint_Definitions in [XML Schema]). |
Each of these properties needs to be checked to ensure valid restriction. The following should be considered as additions to clause (1) in the definition of Section 2.3.1.1 above:
DE's ??? must be either the same as BE's ??? , or declared to be derived from it by (a chain of) restriction.
If BE has a fixed ??? , DE must also have a fixed ??? with the same value.
DE's ??? must be either the same as BE's ??? or a superset thereof.
Clause (1) above is the point at which the necessary recursion into the children of an element and their types is invoked---each step of the derivation chain will itself be checked by Section 2.3.
There is one place where the move beyond local issues discussed immediately above interacts with our earlier treatment of ??? s, and the fact that that treatment ignores the ??? property. This leads to problems in certain cases. Consider the case where B has a ??? and D has one or more explicit
However further thought suggests some interesting possible corner cases which arise when B has a wildcard and D has one or more explicit ??? s:
B: <any ns='xyzzy' processContents='strict'>
D: <elt name='{xyzzy}:foo' type='xs:decimal'/>when there is a global
<elt name='{xyzzy}foo' type='xs:integer'/>??? available.
As defined above, Section 2.3 allows this, via clause (2) of the definition of Section 2.3.1.1. But this is wrong, since D accepts things not accepted by B, for example <foo>3.1415</foo>. To fix this we need to both take account of ??? , and extend the consideration of type definitions from clause (1) to clause (2) of Section 2.3.1.1.
In clause (2) of Section 2.3.1.1, we have a ??? in B which corresponds to an ??? in D. Let EEN be the expanded name (that is, the ??? and ??? ) of the explicit ??? , and ET be its ??? .
If the ??? doesn't allow EEN's namespace name, then clause (2) is fails right away.
If it does allow it, then we need to consider sub-cases based on the ??? 's ??? :
skip: OK, since this means anything is allowed by the ??? in B;
strict: if there exists a top-level ??? whose expanded name is EEN and from whose ??? ET is derived by (a chain of) restriction, then OK, otherwise fail.
lax: if exists a top-level ??? whose expanded name is EEN but ET is not derived from its ??? by (a chain of) restriction, then fail, otherwise (there is no top-level ??? with the right name, or there is and its ??? is OK) OK.
Basically this addition to (2) takes account of the fact that lax and skip ??? may function as if they were an ??? .
Section 2.1 is polynomial in time and exponential in space with respect to the size of the original content model considered as a regular expression.
The time complexity of determinisation is quadratic in the number of states. The time complexity of the UPA checking part of Section 2.2 has a linear term with respect to the size of the input FSA and an exponential term with respect to the maximum fanout of the input FSA.
The time complexity of Section 2.3 is linear with respect to the size of the FSA for D and the product of the maximum fanouts of the two FSAs. It's not exponential because both FSAs are known to respect the Unique Particle Attribution constraint, which is sufficient to render the pseudo-parsing deterministic in all cases. The exponential space requirement arises because of the unfolding approach to numeric exponents of Section 2.1.
The work reported here was supported in part by a W3C Fellowship for the first author and a research contract from Microsoft.
The algorithms presented here have been discussed by the W3C XML Schema Interest Group [XML Schema IG], and a number of improvements derive from comments made there. Particular thanks to Sandy Gao of IBM for identifying several flaws in an early version.
[Aho & Ullman] Aho, A. and J. Ullman Principles of Compiler Design, Addison-Wesley, Reading, MA, 1977.
[XML] Bray, T., Paoli, J., Sperberg-McQueen, C. M. and E. Maler, eds. Extensible Markup Language (XML) 1.0 (Second Edition), W3C, Boston, 2000. Available online as http://www.w3.org/TR/REC-xml.
[XML Schema] Thompson, Henry S., Mendelsohn, N., Maloney, M. and D. Beech, eds. XML Schema Part 1: Structures, W3C, Boston, 2001. Available online as http://www.w3.org/TR/xmlschema-1/.
[XML Schema IG] The W3C XML Schema Interest Group: mailing list archives at http://lists.w3.org/Archives/Member/w3c-xml-schema-ig, W3C members only.
[XSV] Thompson, Henry S. and R. Tobin, XML Schema Validator, W3C and University of Edinburgh, Boston and Edinburgh, 2003. See http://www.ltg.ed.ac.uk/~ht/xsv-status.html for description and download information.
![]() ![]() |
Design & Development by deepX Ltd. |