XML 2002 logo

XML For The Masses - An XML Based File Format for Office Documents

Abstract

The Open Source office productivity suite OpenOffice.org features a new file format utilizing the Extensible Markup Language, XML. With over 450 elements and more than 1600 attributes, the resulting DTD for this format can justly be called one of the more complex applications of XML. With several million downloads of the supporting Open Source application, and with a commercial offering reaching #1 sales rank on Amazon in several countries, this format can also be called one of the more successful applications of XML. There are specific features of the XML File Format which have made this success possible:

In this talk, we will examine the design rationale of the OpenOffice.org XML File Format. Also, we will present how the use of XML streaming can be used inside applications to simplify document processing. Furthermore, we will introduce some uses of the format outside of its supporting applications.

In the format definition, a fundamental decision was made that the format was to be designed: In order to fully realize the XML promise of data exchange it is not sufficient to simply encode existing program structures in XML syntax. Instead, an explicit, reviewed design process was established to ensure that additional benefits could be realized by the format. This common theme led to the definition of three design principles which governed the format definition:

1. use existing standards - don't reinvent the wheel

The use of existing standards is embodied by generous 'borrowing' of elements and structures from e.g. HTML, XSL, CSS, SVG, Dublin Core, XLink, and MathML.

2. transformability - the format must be usable outside of the office application

A consistent design makes it possible for transformation developers to focus on their area of interest, allowing them to ignore the remainder of the format. Similarly, a limited redundancy between presentation and content allows processing tools to be aware of either aspect.

The format features a unique approach for dealing with layout and content of a document, in that both must be contained in an office document to allow faithful output reproduction, but should be separate to allow easy processing and generation.

3. first class XML - all structured content must be accessible through XML structures

All structured information embodied in the document must be accessible as XML elements and attributes, thus making them fully accessible to XSLT and similar XML based tools. No information is stored in 'special' comments or names, and no information is encoded in strings that require elaborate parsing to be understood.

This office documents XML format is also used within the application for file format conversion, or "filters" in OpenOffice.org parlance. Using XML turns file format filters into XML transformations into or from the office document XML format. This use of a documented, human readable format simplifies both filter development and debugging. The inefficiencies associated with XML processing for large documents can be mitigated by using XML pipelines based on the Simple API for XML, SAX.

Storing documents in a transformation friendly XML format allows users to access and manipulate their office documents using standardized tools. Support for attaching arbitrary XML attributes to certain XML elements should foster better integration of office documents into content management systems or custom solutions.

Keywords


1. Late-breaking Talk

Since this was a late-breaking talk, the author did not have time to complete the paper for the proceedings.

Biography

Daniel Vogelheim is a software engineer at Sun Microsystems, Inc. and a co-architect of the OpenOffice.org XML file format. He is a significant contributor to OpenOffice.org since its launch in 2000, working mainly on the XML file format and the Writer word processing application.