XML 2003 logo

Incremental XML Parsing and Validation in a Text Editor

Abstract

XML editors can be divided into text editors and structure editors. In a structure editor, the user interacts with the document as an abstract tree of elements. In a text editor, the user interacts with a document as a sequence of characters or lines of text.

In a normal text editor, a user is not constrained in how they can modify the content of the document: any text can be inserted at any point and any range of text can be deleted. Preserving this characteristic in an XML editor, while providing useful support for XML editing and acceptable performance, presents some challenges.

A normal XML parser or validator starts at the beginning of the document, and processes the entire document until it reaches the end or possibly until it encounters an error. This kind of implementation is not useful for an XML editor. Completely reprocessing the document on every edit cannot scale to large documents. To solve this problem, XML processing must work incrementally: as the document is processed, additional information is recorded, so that when the document is subsequently modified, the necessary reprocessing is minimized.

Three kinds of XML processing will be addressed: XML 1.0 parsing, XML Namespaces processing and RELAX NG validation. This session will describe two algorithms that allow all these three kinds of processing to be performed incrementally. These algorithms have been implemented for GNU Emacs completely in Emacs Lisp. This is a particularly challenging environment, since the implementation of Emacs Lisp in GNU Emacs is much slower than the typical implementation of a language such as C++, Java or C# in which a text editor would usually be written. Moreover, GNU Emacs lacks any support for multithreading.

Note that this work is also relevant W3C XML Schemas, since, for the purposes of validation, W3C XML Schemas (minus integrity constraints) can be translated into RELAX NG schemas.

Keywords


1. Late-breaking Talk

Since this was a late-breaking talk, the author did not have time to complete the paper for the proceedings.

Biography

James Clark has been involved with SGML and XML for more than 10years, both in contributing to standards and in creating open sourcesoftware. James was technical lead of the XML WG during the creationof the XML 1.0 Recommendation. He was editor of the XPath and XSLTRecommendations. He was the main author of the DSSSL (ISO 10179)standard. Currently, he is chair of the OASIS RELAX NG TC and editorof the RELAX NG specification.The open source software that James has written includes SGMLparsers (sgmls and SP), a DSSSL implementation (Jade), XML parsers(expat and XP), an XPath/XSLT processor (XT) and a RELAX NG validator(Jing). Prior to his involvement with SGML and XML, James wrote theGNU groff typesetting system.James read Mathematics and Philosophy at Merton College, Oxford,where he obtained First Class Honours. James lives in Thailand, wherehe runs the Thai Open Source Software Center.