XML 2002 logo

RELAX NG Validator for Mobile Phones

Abstract

We propose a two-phase validation for RELAX NG. The first phase creates a state machine from a RELAX NG schema. This phase can be performed without having any instance documents. The second phase validates instance documents by using this state machine. This phase does not require schemas in their original forms.

This two-phase validation has three advantages. First, it becomes easier to implement RELAX NG validators on many platforms, since only the second phase has to be ported. Second, validators become lightweight, since the first phase is not necessary at run-time. Third, validation is expected to become faster, because a state machine works in a way much simpler than the schema validation.

Although this approach can be easily applied to DTD-based validation, epressiveness of RELAX NG imposes significant challenges as below:

Non-deterministic tree (or hedge) automata: Since RELAX NG can capture any tree (or hedge) regular language, state machines are required to mimic non-deterministic tree (or hedge) automata.

Attribute-element constraints: RELAX NG allows elements and attributes to be freely combined in one content model. Although such compound content models are powerful in expressing constraints between attributes and elements, our state machines have to handle attributes as well as elements.

Interleaving: RELAX NG supports supports fully-generalized unordered content models (interleaving). Naive construction of state machines for interleaving leads to exponential blowup.

We first introduce state machines for handling RELAX NG in its entirety. Our state machines meet these challenges: (1) our state machines capture non-deterministic tree (or hedge) automata by maintaining a set of (possibly multiple) states per element or attribute; (2) state machines handle attribute-element constraints by converting attribute-element automata to element automata at run time; and (3) state machines simulate variations of shuffle automata for handling interleave patterns without exponential blowup.

Next, we demonstrate how a RELAX NG schema is compiled into such a state machine. This process is done by computing derivatives of content models. We also present ptimization techniques for reducing the size of state machines.

Finally, we show our open-source implementations of RELAX NG validators. As of this writing, two schema compilers (the first-phase) have been implemented, and two run-time systems (Java and Win32/Visual C++,) have been implemented. We also show how these run-time systems interface with other components in their respective environments.

Keywords