XML Europe 2004 logo

Towards a logical foundation for XML Schema

Abstract

This paper defines a logic in which to express constraints on W3C XML Schema components and relationships between components and XML infoset items, for use in a formal rewriting of the W3C XML Schema Recommendations. The logic is essentially a constraint language over path expressions, interpreted equally with respect to an Infoset graph or a schema component graph. By 'logic' I mean the traditional three-part story comprised of a sentential form, a model theory and an interpretation.

Keywords


Table of Contents

1. Introduction
2. Illustrative rewrite
3. A preliminary sketch of the logic
3.1. Sentential form
3.2. The Model
3.3. Signatures
3.4. Interpretation and satisfaction
4. Open questions and next steps
4.1. Component-valued properties identified by name
4.2. Focussing on Infosets
Bibliography
Glossary
Biography

1. Introduction

In the period after the publication of [XML Schema] the W3C's XML Schema Working Group received frequent requests to tighten up the normative definitions of schema components, both in terms of the constraints on their well-formedness and their validation semantics. And as our experience has grown the Working Group has wished to clarify the fundamental nature of schemas as such, for instance when discussing a possible move to subsumption-based restriction checking (see [FSA for Schema Restriction]).

I have thought for some time that the basis for a response to these requests lies in the work of Rounds, Moshier, Johnson, Carpenter and others on a logic of feature structures. This paper is a preliminary sketch of what such a logic might look like, and how it might be used, intended to provide sufficient detail to assess the likely value of the idea, without being anywhere near complete.

The basic idea is to define a logic in which to express constraints on components and relationships between components and infoset items. By 'a logic' I mean a pretty traditional three-part story:

  • A sentential form, that is, an inductive definition of what constitutes a well-formed formula (WFF),

  • A model theory, that is, a set-theoretic definition of possible model,

  • An interpretation, that is, a constructive definition of what it means for a WFF to be 'valid' with respect to a model.

A brief trivial example may help explain what these three things are. First a BNF to define our WFF:

  • birds ::= 'doves' | 'crows' | . . .

  • stuff ::= 'tar' | 'snow' | . . .

  • color ::= 'black' | 'white' | . . .

  • spred ::= 'is' | 'is not'

  • ppred ::= 'are' | 'are not'

  • lcon ::= 'and' | 'or'

  • ssent ::= stuff ' ' spred ' ' color

  • psent ::= birds ' ' ppred ' ' color

  • sent ::= ssent | psent

  • wff ::= sent | '(' wff ') ' lcon ' (' wff ')'

Some examples of WFF are then

  • crows are black

  • snow is white

  • (doves are black) or (doves are not black)

Our models are very simple -- a model consists of a set of individuals (the domain) and a set of named disjoint subsets of that set (the colourings), e.g.

domain

{a, b, c, d, e, f, g, h}

colourings

{black:{a, b, c},white:{d, e, f, g}}

Finally the constructive definition of validity. This is expressed in terms of a mapping from WFFs to conditions on the model. We use double square-brackets []to denote this mapping. We treat our birds and stuff as variables, that is, they have an arbitrary association with individuals in a model. We denote the variable assignment function which establishes this assignment using the Greek letter iota (?). The mapping rules then are as follows, using X and Y is meta-variables over parts of WFFs:

  • [(X) or (Y)] == ([X]) ? ([Y])

  • [(X) and (Y)] == ([X]) ? ([Y])

  • [X is Y] == ?(X) ? [Y]

  • [X are Y] == ?(X) ? [Y]

  • [X is not Y] == ?(X) ? [Y]

  • [X are not Y] == ?(X) ? [Y]

  • [white] == the set named 'white' in the model

  • [black] == the set named 'black' in the model

So, given the specific model above and a variable assignment which maps 'doves' to 'f', we get, for the interpretation of the last example WFF above, the interpretation:

( f ? {a, b, c} ) ? ( f ? {a, b, c} )

which is evidently true. Furthermore, for models in which every individual has a colour, it should be clear that it's a tautology, that is, it's true in all such models for any variable assignment.

This may serve to explain why Frege was not being childish when he, famously, said

[Snow is white] is true just in case [snow] is [white] [punctuation added :-]

So the idea for XML Schema is to define a sentential form which can express all component constraints and validation rules, a model which formalises components and the PSVI, and an interpretation which maps WFF from the sentential form onto set-theoretic constraints on the model. We won't do anywhere near all the work ourselves -- the sentential form is a modest extension of the standard sentential form for first-order predicate calculus with equality, and the model and interpretation bootstrap from both first-order logic and graph theory.

Before beginning the formal definitions required, the next section provides an extended worked example.

2. Illustrative rewrite

There follows hereafter a rewrite in the proposed new approach of the Schema Component Constraint: Complex Type Definition Properties Correct from [XML Schema]. The original looks like this:

  1. The values of the properties of a complex type definition must be as described in the property tableau in The Complex Type Definition Schema Component (§3.4.1), modulo the impact of Missing Sub-components (§5.3).

  2. If the {base type definition} is a simple type definition, the {derivation method} must be extension.

  3. Circular definitions are disallowed, except for the ·ur-type definition·. That is, it must be possible to reach the ·ur-type definition· by repeatedly following the {base type definition}.

  4. Two distinct attribute declarations in the {attribute uses} must not have identical {name}s and {target namespace}s.

  5. Two distinct attribute declarations in the {attribute uses} must not have {type definition}s which are or are derived from ID.

The rewrite would look something like this, with informal glosses on formal statements of the constraints. We use a variable C to stand for the type definition node itself, and double vertical bars to signal de-referencing of a QName, with the sort of component indicated by a subscript, e.g. t for a type definition. We miss out the first constraint, which will be handled elsewhere via signatures for the sorts:

  1. [skipped]

  2. C·base type definition ISA Simple Type Definition

       ?

    C·derivation method = extension

    Derivation from simple type definitions must be by extension only.

  3. ?xs:anyType?t ? C·base type definition*

    Circular derivations are not allowed -- all derivations originate with the ·ur-type definition·.

  4. ? a1, a2 ? C·attribute uses ,

      (a1·name = a2·name ? a1·target namespace = a2·target namespace)

        ?

      a1 = a2

    Only one attribute declaration per qualified name.

  5. ? a1, a2 ? C·attribute uses ,

      (?xs:ID?t ? a1·type definition·base type definition* ?

      ?xs:ID?t ? a2·type definition·base type definition*)

        ?

      a1 = a2

    Only one declaration of type ID allowed.

3. A preliminary sketch of the logic

3.1. Sentential form

We use the same sentential form as standard first order predicate calculus with equality, replacing the standard atoms with paths. A path is a specification of a route through the component/PSVI graph, denoting a (possibly empty) set. It starts from a graph node, and takes any number of property-name-labelled steps (· is used to join steps, * indicates transitive closure):

  • path ::= pathStart ( "·" propertyName "*"? )*

  • pathStart ::= nodeVar | lookup

  • lookup ::= "?" QName "?" sortKey

  • sortKey ::= "t" | "e" | "a" | "k" | "n" | "mg" | "ag"

For example, here's a familiar path, assuming S is a node variable which contains a simple type definition:

S·base type definition·variety

This denotes the {variety} (atomic, list or union) of the {base type definition} of S

The optional star indicates transitive closure, so for example

S·base type definition*·name

would denote a set consisting of the {name}s of every type in the {base type definition} chain up from S.

The double vertical bar indicates QName lookup, with the sort key identifying the sort of the named component. The question of where namespace bindings come from for this remains to be addressed. For example, here's a path from a top-level element named my:doc to its {type definition}:

?my:doc?e·type definition

3.2. The Model

The model starts from the standard formulation of a labelled directed graph, and adds some bells and whistles we'll need. A model is a model for a component language. A component language is four-tuple < S, P, A, G >:

S

a set of sorts, e.g. Complex Type Definition, Attribute Information Item

P

a set of property names e.g. base type definition, validation attempted

A

a union of disjoint sets of atomic values A 1 ? A 2 ? ? A n e.g. strings, numbers, dates

G

a sort signature function G: S ?

   P ×

   {required, optional} ×

   {component, infoitem, micro-component, literal} ×

   {singleton, set, sequence} ×

   A* ? S*

Given a component language, a model for that language is a triple < N, E, L >:

N

a set of nodes

E

a set of directed labelled edges, a subset of N × ( P ? ( P × N ) ) × N

L

a node labelling function L: N ? S ? A

This model does not represent sets as such, but rather by allowing multiple edges with the same label. That is, a property such as [children] will appear in our model as zero or more child edges. The presence of indexed edge labels is to provide for sequences, so that for instance a model group with two particles would be represented in the model by a node with edges labelled <particle,1> and <particle,2>.

We constrain our models to use 1-origin indexing and to make coherent use of indexed labels as follows:

? <n, <p, i>, m> ? E , (i > 0) ? (? j ? 0 < j < i , <n, <p, j>, m> ? E)

We constrain node labels so that atomic values label leaf vertices and sorts label the rest:

? n ? N , (L(n) ? A) = ¬ (? l, m, <n, l, m> in E)

This approach (instead of just partitioning V into sorted internal vertices and atomic values) is perhaps overkill -- it allows us to represent bags of atomic values, which I don't believe is required by anything in the PSVI or component inventory.

3.3. Signatures

Signatures are the way we specify what properties go with what sorts, and what we can expect about their values. For example,

  • G(Attribute Use)

      =

    {<required, required, literal, singleton, {boolean}>,

     <attribute declaration, required, component, singleton, {Attribute Declaration}>,

     <value constraint, optional, micro-component, singleton, {Value Constraint}> }

corresponding to the definition of the Attribute Use component in the [XML Schema].

3.4. Interpretation and satisfaction

We define the relationship between the sentential form and model in terms of the normal satisfaction relation, written with the double-turnstile character (?). Here are enough examples of what we add to the rules we inherit from first-order predicate logic with equality to make the project clear, I hope, without pretending to be complete.

  • M ? n·p1·p2·. . .·pn iff [ n·p1·p2·. . .·pn ] ? {}

    A path on its own is interpreted as requiring that there be something at its end.

  • [ n·p1·p2·. . .·pn ] ==

      {o| ? m ? [ n·p1·p2·. . .·pn-1 ] ,

          (<m, pn, o> ? E) ? (? i, <m, <pn, i>, o> ? E) }

    Path interpretation is defined by induction.

    We're liberal in interpreting simple property names in paths, following both indexed and non-indexed edges.

  • [ n·p1·p2·. . .·pn * ] ==

      {o| ? m ? [ n·p1·p2·. . .·pn-1 ] , (m pn * o) }

    * is for transitive closure -- should shift to relational notation throughout

  • [ n·p1·p2·. . .·pn[i] ] ==

      {o| ? m ? [ n·p1·p2·. . .·pn-1 ] , <m, <pn, i>, o> ? E }

    This means we're strict in interpreting indexed properties.

  • [n] == {?(n)}

    To get a path interpretation started, once we've gotten all the way back to a node variable, we just interpret that with the variable interpretation function.

  • [?nn:ln?t] == {o| L(o·namespace name) = nslookup(nn) ?

      L(o·local name) = ln ?

      L(o) ? {Complex Type Definition, Simple Type Definition} }

    Alternatively, if a path begins with a lookup, we find the node with the right name and sort. This still begs the nslookup question. Note the fact that this set has cardinality <=1 is a condition on valid models which needs to be stated elsewhere.

    Similar clauses will work for k (keys etc.), mg and ag (named group definitions) and n (notations), but e (elements) and a (attributes) will need an extra clause checking that {scope} is global.

    We cheat notationally here and assume that L({n}) is the same as L(n).

  • M ? P ISA s iff [P] = {o} ? L(o) = s

    An ISA constraint wrt a path is satisfied if the path denotes exactly one node, and that node is of the specified sort.

4. Open questions and next steps

This section surveys the areas in which work remains to be done before the logic proposed here might be ready for use in a revised version of [XML Schema].

4.1. Component-valued properties identified by name

As presented above, the logic makes no provision for anything less than complete schemas, with no dangling references -- indeed, the whole issue of schema construction has been ignored. If we want to use this approach to address schema construction and composition, we need to look more closely at this area.

Ignoring annotation-related properties and the schema component itself, [XML Schema] has the following component-valued properties today (* means that when a schema is constructed from a schema document, the value is always established by dereferencing a QName, ? means it may be established by dereferencing):

Attribute_Declaration

a-simple_type_definition?

singleton Simple_Type_Definition

Attribute_Group_Definition

ag-attribute_use

set Attribute_Use

ag-attribute_wildcard

singleton Wildcard

Attribute_Use

attribute_declaration?

singleton Attribute_Declaration

Complex_Type_Definition

ct-attribute_use

set Attribute_Use

ct-attribute_wildcard

singleton Wildcard

ct-base_type_definition?

singleton Type_Definition

Element_Declaration

class_exemplar*

singleton Element_Declaration

identity-constraint_definition

set Identity-Constraint_Definition

type_definition?

singleton Type_Definition

Identity-Constraint_Definition

referenced_key*

singleton Identity-Constraint_Definition

Model_Group

particle

sequence Particle

Model_Group_Definition

model_group

singleton Model_Group

Particle

term?

singleton Term

Simple_Type_Definition

facet

set Facet

fundamental_facet

set Fundamental_Facet

st-base_type_definition?

singleton Simple_Type_Definition

st-item_type_definition?

singleton Simple_Type_Definition

st-member_type_definition?

sequence Simple_Type_Definition

st-primitive_type_definition?

singleton Simple_Type_Definition

There are actually only a small number of cases here:

  1. Declarations in their context of use (Attribute Declarations in Attribute Uses, Element Declarations in Particles), which may be either referenced by name or locally declared;

  2. Type definitions of Declarations, which may be either referenced by name or locally declared;

  3. Base type definitions, almost always referenced by name, a few corner cases wrt ur-type;

  4. Collateral uses: substitution groups, referenced key, item and member type definitions, almost always referenced by QName.

  5. Named attribute groups, which are a major embarassment, because they disappear when referenced. I'm inclined to say we actually need to change that, i.e. change the definition of AttributeUse to allow it to contain a group as well as single attributes. This would also require a different treatment of attribute wildcards.

There are a number of ways we could deal with QName resolution and laziness -- I've picked one to illustrate how it would be formalised.

Suppose the signature of all * and ? properties has in fifth (value sort) place not only every 'real' value sort but also an expanded-name micro-component placeholder sort with a related name (e.g. Element Declaration Name goes with Element Declaration). Now we can actually define what it means for a schema to be complete:

We say a model < N, E, L > is complete if and only if

? o ? N , L(o) ? S ? ¬ ? p x a ss s, (<p, x, component, a, ss>) ? G(L(o)) ? <o, p, s> ? E ? s ? S N

where by S N I mean the set of 'placeholder' components defined above. This formalises the notion of completeness as the state in which no component-valued property of any component has as its value a placeholder component.

4.2. Focussing on Infosets

The design presented here intentionally covers both schema components and infoset infoitems. Some of its complexity derives from needing to cover both -- it would be worth exploring a logic for the infoset alone, comparable to [The Essence of XML].

Bibliography

[XML Schema] Thompson, Henry S., Mendelsohn, N., Maloney, M. and D. Beech, eds. XML Schema Part 1: Structures, W3C, Boston, 2001. Available online as http://www.w3.org/TR/xmlschema-1/. The particular constraint referred to is at http://www.w3.org/XML/Group/2002/09/xmlschema-1/#ct-props-correct.

[FSA for Schema Restriction] Thompson, Henry S. and Richard Tobin, Using Finite State Automata to Implement W3C XML Schema Content Model Validation and Restriction Checking, W3C, London, 2003. Available online as http://www.idealliance.org/papers/dx_xmle03/papers/02-02-05/02-02-05.html.

[The Essence of XML] Siméon, Jérôme and Philip Wadler, The Essence of XML, POPL 2003. Available online as http://homepages.inf.ed.ac.uk/wadler/papers/xml-essence/xml-essence.pdf

Glossary

WFF

well-formed formula

Biography

University of Edinburgh
World Wide Web Consortium

Henry S. Thompson is Managing Director of Markup Technology Ltd. He is also Reader in Artificial Intelligence and Cognitive Science in the Division of Informatics at the University of Edinburgh, based in the Language Technology Group of the Human Communication Research Centre. He received his Ph.D. in Linguistics from the University of California at Berkeley in 1980. His university education was divided between Linguistics and Computer Science, in which he holds an M.Sc. His research interests have ranged widely, including natural language parsing, speech recognition, machine translation evaluation, modelling human lexical access mechanisms, the fine structure of human-human dialogue, language resource creation and architectures for linguistic annotation. His current research is focussed on articulating and extending the architectures of XML. Markup Technology is focussed on delivering the power of XML pipelines for application development. Henry was member of the SGML Working Group of the World Wide Web Consortium which designed XML, is the author of the XED, the first free XML instance editor and co-author of the LT XML toolkit and is currently a member of the XSL and XML Schema Working Groups of the W3C. He currently is a member of the W3C Team, and is lead editor of the Structures part of the XML Schema W3C Recommendation, for which he co-wrote the first publicly available implementation, XSV. He has presented many papers and tutorials on SGML, DSSSL, XML, XSL and XML Schemas in both industrial and public settings over the last five years.