XML 2002 logo

RELAX NG Validator for Mobile Phones

Abstract

We propose a two-phase validation for RELAX NG. The first phase creates a state machine from a RELAX NG schema. This phase can be performed without having any instance documents. The second phase validates instance documents by using this state machine. This phase does not require schemas in their original forms.

This two-phase validation has three advantages. First, it becomes easier to implement RELAX NG validators on many platforms, since only the second phase has to be ported. Second, validators become lightweight, since the first phase is not necessary at run-time. Third, validation is expected to become faster, because a state machine works in a way much simpler than the schema validation.

Although this approach can be easily applied to DTD-based validation, epressiveness of RELAX NG imposes significant challenges as below:

Non-deterministic tree (or hedge) automata: Since RELAX NG can capture any tree (or hedge) regular language, state machines are required to mimic non-deterministic tree (or hedge) automata.

Attribute-element constraints: RELAX NG allows elements and attributes to be freely combined in one content model. Although such compound content models are powerful in expressing constraints between attributes and elements, our state machines have to handle attributes as well as elements.

Interleaving: RELAX NG supports supports fully-generalized unordered content models (interleaving). Naive construction of state machines for interleaving leads to exponential blowup.

We first introduce state machines for handling RELAX NG in its entirety. Our state machines meet these challenges: (1) our state machines capture non-deterministic tree (or hedge) automata by maintaining a set of (possibly multiple) states per element or attribute; (2) state machines handle attribute-element constraints by converting attribute-element automata to element automata at run time; and (3) state machines simulate variations of shuffle automata for handling interleave patterns without exponential blowup.

Next, we demonstrate how a RELAX NG schema is compiled into such a state machine. This process is done by computing derivatives of content models. We also present ptimization techniques for reducing the size of state machines.

Finally, we show our open-source implementations of RELAX NG validators. As of this writing, two schema compilers (the first-phase) have been implemented, and two run-time systems (Java and Win32/Visual C++,) have been implemented. We also show how these run-time systems interface with other components in their respective environments.

Keywords


Note

This talk was presented at the conference on Thursday December 12 at 11:45 am.

1. Waitlisted Paper

Since this was a waitlisted talk, the author did not prepare a paper for the proceedings.

Biography

MURATA is a co-editor of the RELAX NG specifications of OASIS and is the editor of its predecessor, RELAX Core of ISO/IEC. He has been promoting hedge or tree automata as formal models for schemas since 1994. He was a member of the original XML WG, co-editor of media-type RFCs for XML, and the editor of the Japanese XML Profile.

Member of Technical Staff

Kohsuke Kawaguchi is a schema expert at Sun, working on RELAX NG and W3C XML Schema. He developed Sun's Multi-Schema XML Validator (MSV) and several other RELAX NG related projects.For his personal work on RELAX NG, see http://www.kohsuke.org/