Abstract
The era of the personal computer brought with it a proliferation of proprietary binary file formats. With the rise of the Internet, there has been a need to break down the barriers between machines and between applications, resulting in the current trend toward open, Web-accessible publication formats including XML grammars. Therefore there is a pervasive need for software to convert binary file formats to XML.
Just as XML is at the root of this conversion problem, so is it a basis for a solution. By taking advantage of the XML meta-language at every stage of the data conversion process, one can maximize code re-use.
The author will present research aimed at facilitating and partially automating the process of creating binary-to-XML file format translators. The process consists of the following stages: (i) file format analysis, (ii) creation of a parser, (iii) mapping analysis, (iv) creation of a mapper, (v) serialization of the target format. Here is how XML processing software is leveraged throughout:
First, we define an XML grammar for binary file format schema. Then we write special-purpose parser-generator software that reads in binary file format schema instances. Next, we write a rapid file format analysis tool that allows the user to discover file format schema in an iterative process: at each step a single schema modification is suggested by the user, a new parser is generated, a series of test files are parsed, and the results are dumped as XML for inspection.
The outcome of applying the rapid file format analysis tool to any particular binary format is a parser that handles that format. The outcome of applying this parser to any particular test file is an in-memory tree representation of that file, serializable as XML. Therefore the mapping stage reduces to an XML to XML transformation. In simple cases this mapping can be done efficiently enough with XSLT; in more realistic cases the XSLT needs to be cross-compiled with more traditional languages (e.g. with Java, using XSLTC).
Because the source and target of the mapping are of a common meta-language, it is possible to create a rapid mapping analysis tool that generates the mapper software. Again, the tool is used in an iterative process that culminates in the final mapping software.
As long as the target format is an open standard, one can take advantage of existing software to style, view, search, augment, edit or otherwise process that format. An important special case is the use of SVG (Scalable Vector Graphics) as a target. By translating from a binary format to SVG, one effectively has a viewer for that binary format without having to write any rendering software. The rendering stage is accomplished by standard browser or browser plug-in functionality. With the help of CSS, the view can be customized for user, purpose or device. With the help of SMIL Animation, the view can be dynamic. With the help of JavaScript and the DOM, the view can be interactive, with rich navigation, search or redlining functionality.
Keywords
Table of Contents
File sharing on the Internet requires shared, standard file formats that can be processed by widely available, cross-platform software. For this reason, vendor-specific binary file formats have become unpopular, and are being replaced with XML formats.
These binary formats arose in an environment of individual, mostly unconnected personal computers. The inaccessibility of the formats allowed vendors to maintain market share, since users had a dependency on vendor-specific applications to access their information.
PC software evolved rapidly, leading to a proliferation of binary file formats and file format versions. This became a serious problem for users, who found that file sharing was often impractical and that upgrades to application and operating system software could make their older files inaccessible.
With the emergence of the Web, many more PCs have become connected and demand for file sharing has increased greatly. This implies a distinct advantage for open, standardized text formats such as XML. Consequently, application vendors are scrambling to create and process XML variants of their existing file formats, and to support standard XML grammars produced by the W3C.
This move toward XML grammars has in turn created a demand for binary to XML conversion software, which is the subject of the current paper.
Saving out an XML variant of your application file format only goes part way toward making it Web-publishable. Without instructions on how to render each format, Browser software is unable to display published files in the intended manner. These instructions can take various common forms:
Style sheets that accompany the XML files
ECMAScript that accompanies the XML files
Plug-ins or applets that extend Browser functionality
Browser-native code
Each of these have limitations. The commonly-supported style sheet standards Cascading Style Sheets [CSS2] and Extensible Stylesheet Language [XSL] assume a formatting model based on upright rectangular regions, which is typical of text-based publications. Formatters using CSS or XSL do not natively handle more advanced graphical features; nor can they provide animation or interactivity on their own. ECMAScript can provide animation and interactivity, but it needs to operate on a DOM for a Browser-supported XML grammar, so it is not a complete solution on its own. Plug-ins or applets require the user to agree to installation, or approval of a digital signature, or at least a large download, all of which are potential barriers when a user just wants to view one file. Finally, Browser-native code tends to be Browser-specific with the exception of a very small number of formats, HTML being the most popular of these. HTML on its own is not great as a presentation format, since it has very limited and incompletely specified formatting.
Many XML grammars are for special-purpose use, and for this reason do not tend to be supported by Browsers, even in the case of industry-standard grammars. For example, as of this writing MathML is not directly supported by Internet Explorer or Netscape despite being a W3C Recommendation since April 7, 1998 [MathML1].
By contrast, Scalable Vector Graphics [SVG] is a generalized vector graphic XML language that is rapidly gaining Browser support, both natively and via widely-distributed plug-ins. SVG effectively provides a full-featured interface for drawing arbitrary shapes on the screen, complete with support for text, animation, interactivity and style sheets. This single format can serve as a uniform target for publishing arbitrary content to the Web in a broadly-accessible manner.
The conversion to SVG can either be directly through "Save as SVG" desktop application functionality, or indirectly through "Save as XML" with application-specific XML as the target, followed by XSL Transformations [XSLT] or other code to convert to SVG. By using SVG uniformly throughout your Web applications, you can easily merge content from many different sources and semantic origins. These include specialized file formats as well as data formats. For an approach toward dynamically generating SVG graphs, schematics and other diagrams from XML data sources using XSLT, see the Graphical Stylesheets paper [GS] and presentation on the associated software tool [Catwalk], which can be downloaded from http://www.schemasoft.com/catwalk/. For information on providing highly-interactive GIS solutions by dynamically translating Geography Markup Language [GML] to scripted SVG, see [SVGMaps] and the associated "Cleopatra" software download http://www.schemasoft.com/cleopatra/.
Consider the general problem of creating many file format translators, whether between binary formats, XML formats or otherwise. Whenever engaging in multiple, similar software projects, one always tries to maximize code re-use. Code re-use can take many forms, including:
a library of common functions
a collection of base classes to inherit from
parameterized code in the form of C++ templates
parameterized code in the form of preprocessing macros
implementation of a generic intermediate object model
Existing generalized filter frameworks tend to take advantage of a number of the above strategies. In particular, they tend to rely on a generic intermediate object model. Any translation from format A to format B is done in two steps: an import filter from A to the intermediate object model, followed by an export filter from the intermediate object model to B.
This design scales really well: to support all possible mappings between N formats, you only need 2N filters, rather than the N2-N that would be needed if an intermediate format were not used. However, this design has at least two major drawbacks:
The double conversion is inefficient. There is usually a more direct mapping that avoids the intermediate format.
There tends to be unnecessary loss of information in the conversion, since the intermediate format tends to be a "lowest common denominator".
We seek an alternate design that avoids these problems yet still achieves at least the same level of scalability and code re-use. Fundamentally, the re-use comes from identifying commonalities among all formats and representing these in code. An intermediate format achieves this by identifying sets of similar features across all file formats (e.g. the set of all "path" object models), and creating a prototypical representation of each set of features (e.g. one particular "path" object model). The problem is that this prototypical representation is just another instance, and does not capture the full generality of the set.
What we need, then, is a way to represent entire sets of similar features from different file formats. This suggests some sort of parameterization, or template for the features. However, it is clear that neither C++ templates nor preprocessing macros are sufficient to achieve the required degree of generality. What we need is code generators: programs that take a parameterized description of a feature as input, and output the code that implements that feature, using the full power of a programming language to determine what code is produced by the given parameters.
How, exactly could the generated code implement a feature? Naively, the code generator would write out the class definitions. But surely we have the opportunity to do more. Patterns can be found in the translations between corresponding features, not just in the representation of features. Therefore we seek a formal language for specifying file format translation patterns, and a mechanism for generating translator code from instances of those patterns.
The primary objectives of this paper would be met by a single "universal translator" program, if such a thing were possible. Commonly featured in science fiction novels, the notion of a universal translator seems too good to be true. Perhaps so, but we will do our best to approximate this ideal nontheless.
To begin with, the idea of a universal translator seems much more feasible if we allow input data to be accompanied by inputs that specify the format of the incoming data and the manner in which it is to be interpreted. These additional inputs would ideally be just-in-time-compiled to specific translator programs. However, a first approximation to this ideal is to combine automation with human intervention in the process of generating specific translator programs from general specifications. We will attempt to flesh out and formalize this idea next.
Throughout the discussion, format will be defined to mean any information representation or schema. This could include schemas for streams, database tables, object models, trees, graphs or other base data structures. The physical manifestation of the information could be in computer memory, on disk, as a signal over Internet wires and optical fibers, etc. A translator between formats will be defined as any piece of software that takes information in a source format as input and produces corresponding information in another, target format as output. Parsers, mappers and serializers are regarded as special cases of translators: a parser translates from a stream-based format into an object model, a mapper converts between object models, and a serializer translates from an object model to a stream-based format.
Translators can be connected in sequence to make other translators. Since file formats are stream-based, a file format translator takes streams to streams. In principle, this can be achieved with a parser connected to any number of successive mappers connected to a serializer. In practise, we attempt to optimize performance by utilizing a single mapper stage as in Section 3.
Many examples in this paper use XML to present example objects, XML Schema [Schema] to present the object model (class definitions), and XSLT to present the transformation from source to target format. It is important to recognize that these are strictly a means of presenting or storing the information, and they are not necessarily the encodings actually being proposed for efficient file format translator code. To begin with, in-memory objects do not always need to be serialized as XML—they might represent a filter API or the output of a parser, for example. Secondly, XSLT processors expect XML as input - either serialized, as SAX events, or as an XML DOM implementation. None of these options is necessarily the most efficient interface with a binary format parser, which can create tailored, format-specific events or objects in memory rather than generic SAX events or XML DOM nodes. Although the XML DOM has been implemented in a format-specific way [BizDOM], this requires a (less efficient) interpreted language because the class definitions are effectively constructed at runtime. Furthermore, XSLT is most useful only for the tree-walking aspects of transformation, not the raw mathematical computation aspects. So, at the very least an XSLT solution should be cross-compiled with code written in a more traditional language, such as Java (using XSLTC from http://xml.apache.org/xalan-j/xsltc_usage.html), or C# (using VisualStudio.NET).
Before specializing the analysis to parser, mapper or serializer, we will introduce a formalism applicable to translators in general. The formalism is for describing translation patterns—commonalities discovered among many real-world translation problems. The encoding of these patterns is a crucial step toward automating translator software development.
Translation patterns are specific to target format but independent of source format. They represent commonalities among many translation problems with a given target format. Therefore to find translation patterns, begin by classifying data structures seen in instances of the target format. Each such structure will have a corresponding translation pattern, and the whole problem of translating to a given target format is thereby subdivided into a number of separate translation patterns.
The target structures of interest are those that occur in typical translator output; they cannot be deduced purely from the target format schema. For example, translating a well-structured document format to XHTML leads to an arrangement of headings <h1> through <h6> such that any heading number is at most one more than the previous heading number. This could be expressed by redefining the <body> content model in the XHTML 1.1 DTD with
<!-- The following entity represents all block content except headings -->
<!ENTITY % cont "%Block.class; | %List.class; %Misc.class;">
<!ELEMENT body
((%cont;)*,
(h1,(%cont;)*,
(h2,(%cont;)*,
(h3,(%cont;)*,
(h4,(%cont;)*,
(h5,(%cont;)*,
(h6,(%cont;)*)*
)*
)*
)*
)*
)*
)
>as a result of which the following sample XHTML 1.1 file would no longer validate:
<html>
<head>
<title>Demonstration of header pattern</title>
</head>
<body>
<h1>OK</h1>
<p>content</p>
<h2>OK</h2>
<h3>OK</h3>
<p>content</p>
<h1>OK</h1>
<p>content</p>
<h3>Invalid!</h3>
</body>
</html>The point of this example is not to re-define XHTML 1.1. It is to show that there is a typical target structure that goes beyond anything encoded in the XHTML 1.1 DTD. Associated with this structure is the pattern of translations that give rise to it.
This begs the question: What else is needed to represent a translation pattern, other than its target structure? Our proposed answer takes the form of an object model together with procedures for processing each instance of the object model into a particular unit of translation.
Specifically, for each identified pattern, our goal is to create the following:
An object model for the information represented in the pattern
A code generator that turns instances of the object model into translator code which, in turn, outputs the target structure
A schema for the target structure
In the previous example, the rule about header numbers arises because headers are normally associated with document divisions such as chapters or sections, and the header number represents the depth to which the divisions have been nested. For example, suppose the source to be translated is the following XML file, presentationTips.xml :
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE section [
<!ELEMENT title (#PCDATA)>
<!ELEMENT content (#PCDATA)>
<!ELEMENT section (title,content*,section*)>
]>
<section>
<title>Presentation Tips</title>
<section>
<title>Preparation</title>
<section>
<title>Do...</title>
<content>rehearse</content>
<content>check your zipper</content>
</section>
<section>
<title>Don't...</title>
<content>make 300 slides for a 15-minute presentation</content>
<content>show up late</content>
</section>
</section>
<section>
<title>Delivery</title>
<section>
<title>Do...</title>
<content>speak into the microphone</content>
<content>look at the audience</content>
<content>ask questions</content>
</section>
<section>
<title>Don't...</title>
<content>eat</content>
<content>yawn</content>
<content>mumble about minutiae</content>
</section>
</section>
</section>Although there is only one <title> element type, we can tell whether one title is a subtitle of another by whether its section is a child of the other's section. This leads to a natural translation to XHTML in which the header number corresponding to a given <title> element is the count of <section> ancestors of that element:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html>
<head>
<title>Presentation Tips</title>
</head>
<body>
<h1>Presentation Tips</h1>
<h2>Preparation</h2>
<h3>Do...</h3>
<ul>
<li>rehearse</li>
<li>check your zipper</li>
</ul>
<h3>Don't...</h3>
<ul>
<li>make 300 slides for a 15-minute presentation</li>
<li>show up late</li>
</ul>
<h2>Delivery</h2>
<h3>Do...</h3>
<ul>
<li>speak into the microphone</li>
<li>look at the audience</li>
<li>ask questions</li>
</ul>
<h3>Don't...</h3>
<ul>
<li>eat</li>
<li>yawn</li>
<li>mumble about minutiae</li>
</ul>
</body>
</html>In the source file, the <section> ancestor count of successive <title> tags can go up by at most one at a time. This is just a property of ancestry: you cannot have a grandchild without having a child. Consequently, in the translated file, the header numbers of successive header tags can go up by at most one at a time. This means the translated file contains an instance of the target structure.
The source format in the above example was described by a simple three-line DTD internal to the file:
<!ELEMENT title (#PCDATA)> <!ELEMENT content (#PCDATA)> <!ELEMENT section (title,content*,section*)>
The example is prototypical, and suggests how to generalize our translation pattern. The <section> elements could represent any hierarchy of titled objects, the <title> element content represents the actual header text that needs to be filled in when constructing <h1> to <h6>, and the <content> element is a placeholder for any further content of the titled objects. This establishes three roles for the source elements:
as objects that participate in this translation pattern
as input parameters for those objects, and
as objects that participate in a pattern defined elsewhere.
A DTD snippet for representing these notions in general might look as follows:
<!ELEMENT translation (object)> <!ATTLIST translation uri CDATA #REQUIRED version CDATA #IMPLIED > <!ELEMENT object (object | param)*> <!ATTLIST object name CDATA #IMPLIED id ID #IMPLIED href CDATA #IMPLIED > <!ELEMENT param (#PCDATA)> <!ATTLIST param name CDATA #IMPLIED type CDATA #IMPLIED default CDATA #IMPLIED >
Utilizing this on our example, we would get the following structure:
<translation uri="http://www.schemasoft.com/translations/headers">
<object name="list" id="sections">
<param name="for" type="nodeset"/>
<object name="section">
<param name="title" type="string"/>
<object name="content" href="./content.xml"/>
<object name="list" href="#sections"/>
</object>
</object>
</translation>Three object types appear here: the section object that is being specified herein, a content object that is specified elsewhere, at a relative URI given in the href attribute, and a list object used for aggregating sibling section objects. The input parameters are the title parameters that will be turned into XHTML header text and the for parameters that determine the nodesets over which the list objects iterate. Observe that the list object is defined once, then referenced a second time via URI. This allows the section tags to recursively nest.
The objects corresponding to the elements in presentationTips.xml are arranged in a tree data structure, but their classes are not, since the class of section objects have list members that have section members, a graph cycle. In general, our object models for formats will be trees at the object level but not necessarily at the class level. As you will see in Section 3, these object trees are important for determining a format's default read order and object dependencies. Data access can happen between any two objects using public member functions, so the tree structure is not a restriction on data access. The XML Schema representation for these member function calls will be detailed in Section 2.5.
Although we have found a way to model the translation pattern effectively, we have not provided any indication of how the parameters of the model will be filled in. This brings us to the next topic of discussion: specifying a translator instance.
For each source format and each way of mapping it, there will be a unique translator, formed as a unique instance of the translator pattern. In general, specifying an instance will require a combination of three basic operations: setting data, getting data and manipulating data. We will introduce an XML element for specifying each of these operations:
<const> represents constants for setting data
<query> represents queries for getting data from the source file
<func> represents functions for manipulating data
Each individual parameter in the object model for a translation pattern can be filled in with a constant, a query or a function. In XML terms, "filling in" means placing content between each <param>, </param> pair. If a function is used, then each of its arguments can be filled in with a constant, a query or another function, and so on. Continuing until all arguments are filled, a tree structure emerges, with interior nodes that are functions and leaf nodes that are queries or constants.
As an example, let's specify an instance of the header pattern that reads files with the grammar used in presentationTips.xml . Here is the object model filled in using our new elements:
<?xml version="1.0" encoding="UTF-8"?>
<translation xmlns="http://www.schemasoft.com/schema/translation.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.schemasoft.com/schema/translation.xsd
translation.xsd" uri="http://www.schemasoft.com/translations/headers">
<object name="list" id="sections">
<param name="for">
<query xsi:type="xpathQueryType" xpath="./section"
type="nodeset"/>
</param>
<object name="section">
<param name="title">
<query xsi:type="xpathQueryType" xpath="./title/text()"
type="string"/>
</param>
<object name="content" href="./content.xml"/>
<object name="list" href="#sections"/>
</object>
</object>
</translation>Since the source format is an XML grammar, the queries are done with XPath expressions. The relative XPath expression "./section" is used to select the list of all section children of the current node, and the relative XPath expression "./title/text()" gets the title text from each of the sections.
To show the versatility of translator patterns, we will use the same pattern to make a translator for a completely different source format.
The format is for taxonomy, and is described by the following XML Schema file, taxonomy.xsd:
<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.schemasoft.com/schema/taxonomy.xsd"
xmlns="http://www.w3.org/2001/XMLSchema"
xmlns:tax="http://www.schemasoft.com/schema/taxonomy.xsd"
elementFormDefault="qualified" attributeFormDefault="unqualified">
<element name="taxonomy">
<annotation>
<documentation>Classification of organisms</documentation>
</annotation>
<complexType>
<sequence>
<element name="title" type="string"/>
<element name="summary" minOccurs="0">
<complexType>
<sequence>
<element ref="tax:note" maxOccurs="unbounded"/>
</sequence>
</complexType>
</element>
<element ref="tax:kingdom" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
</complexType>
</element>
<element name="note" type="string"/>
<element name="kingdom">
<complexType>
<sequence>
<element ref="tax:note" minOccurs="0" maxOccurs="unbounded"/>
<element ref="tax:phylum" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<attribute name="name" type="string" use="required"/>
</complexType>
</element>
<element name="phylum">
<complexType>
<sequence>
<element ref="tax:note" minOccurs="0" maxOccurs="unbounded"/>
<element ref="tax:class" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<attribute name="name" type="string" use="required"/>
</complexType>
</element>
<element name="class">
<complexType>
<sequence>
<element ref="tax:note" minOccurs="0" maxOccurs="unbounded"/>
<element ref="tax:order" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<attribute name="name" type="string" use="required"/>
</complexType>
</element>
<element name="order">
<complexType>
<sequence>
<element ref="tax:note" minOccurs="0" maxOccurs="unbounded"/>
<element ref="tax:family" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<attribute name="name" type="string" use="required"/>
</complexType>
</element>
<element name="family">
<complexType>
<sequence>
<element ref="tax:note" minOccurs="0" maxOccurs="unbounded"/>
<element ref="tax:genus" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<attribute name="name" type="string" use="required"/>
</complexType>
</element>
<element name="genus">
<complexType>
<sequence>
<element ref="tax:note" minOccurs="0" maxOccurs="unbounded"/>
<element ref="tax:species" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<attribute name="name" type="string" use="required"/>
</complexType>
</element>
<element name="species">
<complexType>
<sequence>
<element ref="tax:note" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<attribute name="name" type="string" use="required"/>
</complexType>
</element>
</schema>Here is an example of a valid taxonomy file, taxonomy.xml :
<?xml version="1.0" encoding="UTF-8"?>
<taxonomy xmlns="http://www.schemasoft.com/schema/taxonomy.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.schemasoft.com/schema/taxonomy.xsd
taxonomy.xsd">
<title>Domestic Zoology</title>
<summary>
<note>A selection of animals sharing your home.</note>
</summary>
<kingdom name="Animalae">
<phylum name="Arthropoda">
<note>The bodies of arthropoda are segmented.</note>
<class name="Insecta">
<order name="Dictyoptera">
<family name="Blattidae">
<genus name="Blatta">
<species name="Blatta orientalis">
<note>Oriental cockroach.</note>
</species>
</genus>
</family>
</order>
<order name="Diptera">
<family name="Muscidae">
<genus name="Musca">
<species name="Musca domestica">
<note>House fly.</note>
</species>
</genus>
</family>
</order>
</class>
<class name="Arachnida">
<order name="Phalangida">
<note>Daddy-long-legs.</note>
</order>
<order name="Araneida">
<note>Spider.</note>
</order>
</class>
<class name="Chilopoda">
<note>Centipede.</note>
</class>
</phylum>
<phylum name="Chordata">
<note>Most pets are chordata.</note>
<class name="Aves">
<order name="Psittaciformes">
<family name="Psittacidae">
<genus name="Melopsittacus">
<species name="Melopsittacus undulatus">
<note>Budgie.</note>
</species>
</genus>
</family>
</order>
</class>
<class name="Mammalia">
<order name="Carnivora">
<family name="Felidae">
<genus name="Felis">
<species name="Felis domestica">
<note>House cat.</note>
</species>
</genus>
</family>
<family name="Canidae">
<genus name="Canis">
<species name="Canis familiaris">
<note>Domesticated dog.</note>
</species>
</genus>
</family>
</order>
<order name="Primates">
<family name="Hominidae">
<genus name="Homo">
<species name="Homo sapiens">
<note>You and me.</note>
</species>
</genus>
</family>
</order>
</class>
</phylum>
</kingdom>
</taxonomy>The taxonomy format can be translated using the following instance of the header translation pattern:
<?xml version="1.0" encoding="UTF-8"?>
<translation xmlns="http://www.schemasoft.com/schema/translation.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.schemasoft.com/schema/translation.xsd
translation.xsd" uri="http://www.schemasoft.com/translations/headers">
<object name="list" id="sections">
<param name="for">
<query xsi:type="xpathQueryType"
xpath="kingdom|phylum|class|order|family|genus|species"
type="nodeset"/>
</param>
<object name="section">
<param name="title">
<func name="concatenate" type="string">
<query xsi:type="xpathQueryType" xpath="local-name()"
type="string"/>
<const value=": " type="string"/>
<query xsi:type="xpathQueryType" xpath="@name"
type="string"/>
</func>
</param>
<object name="content" href="./content.xml"/>
<object name="list" href="#sections"/>
</object>
</object>
</translation>Note that the translation explicitly prepends the relevant tag name (kingdom, phylum, etc.) to the value of each name attribute in the source file. This is one choice for how to translate the file to the header pattern, but not the only choice. Thus translation pattern instances encode more than just source format information; they also encode the chosen nature of the translation.
Now that we have specified a translator pattern instance, our intent is to generate translator code from that specification. The code could be in one of a number of programming languages. Here, for example is XSLT code to process taxonomy.xsd files into XHTML as specified by the translation pattern instance:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"
doctype-public="-//W3C//DTD XHTML 1.1//EN"
doctype-system="http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"/>
<xsl:template match="taxonomy">
<html>
<head>
<title>
<xsl:value-of select="./title/text()"/>
</title>
<style type="text/css">
.intro {margin-left:36pt; margin-right:36pt; font-style:italic}
.subhead {font-size:smaller; text-decoration:underline}
</style>
</head>
<body>
<xsl:apply-templates/>
</body>
</html>
</xsl:template>
<xsl:template match="title"/>
<xsl:template match="summary">
<div class="intro"><xsl:apply-templates/></div>
</xsl:template>
<xsl:template match="note">
<p><xsl:apply-templates/></p>
</xsl:template>
<xsl:template match="kingdom">
<h1><xsl:call-template name="headerText"/></h1>
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="phylum">
<h2><xsl:call-template name="headerText"/></h2>
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="class">
<h3><xsl:call-template name="headerText"/></h3>
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="order">
<h4><xsl:call-template name="headerText"/></h4>
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="family">
<h5><xsl:call-template name="headerText"/></h5>
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="genus">
<h6><xsl:call-template name="headerText"/></h6>
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="species">
<p class="subhead"><xsl:call-template name="headerText"/></p>
<xsl:apply-templates/>
</xsl:template>
<xsl:template name="headerText">
<xsl:value-of select="local-name()"/>
<xsl:text>: </xsl:text>
<xsl:value-of select="@name"/>
</xsl:template>
</xsl:stylesheet>When the above XSLT is run on our example file taxonomy.xml , the result is the following XHTML:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html>
<head>
<title>Domestic Zoology</title>
<style type="text/css">
.intro {margin-left:36pt; margin-right:36pt; font-style:italic}
.subhead {font-size:smaller; text-decoration:underline}
</style>
</head>
<body>
<div class="intro">
<p>A selection of animals sharing your home.</p>
</div>
<h1>kingdom: Animalae</h1>
<h2>phylum: Arthropoda</h2>
<p>The bodies of arthropoda are segmented.</p>
<h3>class: Insecta</h3>
<h4>order: Dictyoptera</h4>
<h5>family: Blattidae</h5>
<h6>genus: Blatta</h6>
<p class="subhead">species: Blatta orientalis</p>
<p>Oriental cockroach.</p>
<h4>order: Diptera</h4>
<h5>family: Muscidae</h5>
<h6>genus: Musca</h6>
<p class="subhead">species: Musca domestica</p>
<p>House fly.</p>
<h3>class: Arachnida</h3>
<h4>order: Phalangida</h4>
<p>Daddy-long-legs.</p>
<h4>order: Araneida</h4>
<p>Spider.</p>
<h3>class: Chilopoda</h3>
<p>Centipede.</p>
<h2>phylum: Chordata</h2>
<p>Most pets are chordata.</p>
<h3>class: Aves</h3>
<h4>order: Psittaciformes</h4>
<h5>family: Psittacidae</h5>
<h6>genus: Melopsittacus</h6>
<p class="subhead">species: Melopsittacus undulatus</p>
<p>Budgie.</p>
<h3>class: Mammalia</h3>
<h4>order: Carnivora</h4>
<h5>family: Felidae</h5>
<h6>genus: Felis</h6>
<p class="subhead">species: Felis domestica</p>
<p>House cat.</p>
<h5>family: Canidae</h5>
<h6>genus: Canis</h6>
<p class="subhead">species: Canis familiaris</p>
<p>Domesticated dog.</p>
<h4>order: Primates</h4>
<h5>family: Hominidae</h5>
<h6>genus: Homo</h6>
<p class="subhead">species: Homo sapiens</p>
<p>You and me.</p>
</body>
</html>The astute reader may have noticed that our example translation pattern instances both referenced the XML Schema file translation.xsd . This schema encodes parameterized object models for translation patterns as well as the function-query-constant trees used to specify instances. translation.xsd is listed below:
<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.schemasoft.com/schema/translation.xsd"
xmlns="http://www.w3.org/2001/XMLSchema"
xmlns:trans="http://www.schemasoft.com/schema/translation.xsd"
elementFormDefault="qualified" attributeFormDefault="unqualified">
<element name="translation">
<annotation>
<documentation>For translation patterns and instances</documentation>
</annotation>
<complexType>
<sequence>
<element name="object" type="trans:objectType"/>
</sequence>
<attribute name="uri" type="anyURI" use="required"/>
<attribute name="version" type="decimal"/>
</complexType>
</element>
<complexType name="objectType">
<choice minOccurs="0" maxOccurs="unbounded">
<element name="object" type="trans:objectType"/>
<element name="param" type="trans:paramType"/>
</choice>
<attribute name="name" type="Name"/>
<attribute name="id" type="ID"/>
<attribute name="href" type="anyURI"/>
</complexType>
<complexType name="paramType" mixed="true">
<choice minOccurs="0">
<element name="func" type="trans:funcType"/>
<element name="query" type="trans:queryType"/>
<element name="const" type="trans:constType"/>
</choice>
<attribute name="name" type="Name"/>
<attribute name="type" type="string"/>
<attribute name="default" type="string"/>
</complexType>
<complexType name="funcType">
<choice minOccurs="0" maxOccurs="unbounded">
<element name="func" type="trans:funcType"/>
<element name="query" type="trans:queryType"/>
<element name="const" type="trans:constType"/>
</choice>
<attribute name="ns" type="anyURI"/>
<attribute name="class" type="Name"/>
<attribute name="xpath" type="string"/>
<attribute name="name" type="Name" use="required"/>
<attribute name="type" type="string"/>
</complexType>
<complexType name="queryType">
<annotation>
<documentation>Subclassed for various query sources</documentation>
</annotation>
<simpleContent>
<extension base="string">
<attribute name="type" type="string"/>
</extension>
</simpleContent>
</complexType>
<complexType name="constType">
<attribute name="type" type="string"/>
<attribute name="value" type="string" use="required"/>
</complexType>
<complexType name="xpathQueryType">
<simpleContent>
<extension base="trans:queryType">
<attribute name="xpath" type="string" use="required"/>
</extension>
</simpleContent>
</complexType>
<complexType name="sqlQueryType">
<simpleContent>
<extension base="trans:queryType"/>
</simpleContent>
</complexType>
<complexType name="regExpQueryType">
<simpleContent>
<extension base="trans:queryType">
<attribute name="regExp" type="string" use="required"/>
<attribute name="encoding" type="string" default="UTF-8"/>
<attribute name="index" type="nonNegativeInteger" default="1"/>
<attribute name="length" type="trans:nonNegativeCountable"
default="unbounded"/>
<attribute name="overlap" type="boolean" default="true"/>
</extension>
</simpleContent>
</complexType>
<complexType name="binQueryType">
<complexContent>
<extension base="trans:queryType">
<sequence>
<element name="func" type="trans:funcType" minOccurs="0"/>
<element name="range" type="trans:rangeType" minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
<attribute name="endian" default="little">
<simpleType>
<restriction base="NMTOKEN">
<enumeration value="little"/>
<enumeration value="big"/>
</restriction>
</simpleType>
</attribute>
</extension>
</complexContent>
</complexType>
<complexType name="rangeType">
<sequence>
<element name="skip" type="trans:delimiterType" minOccurs="0"
maxOccurs="unbounded"/>
<element name="start" type="trans:delimiterType"/>
<element name="skip" type="trans:delimiterType" minOccurs="0"
maxOccurs="unbounded"/>
<element name="end" type="trans:delimiterType"/>
<element name="range" type="trans:rangeType" minOccurs="0"
maxOccurs="unbounded"/>
<element name="count" type="trans:inputType" minOccurs="0"/>
</sequence>
</complexType>
<complexType name="delimiterType">
<sequence>
<element name="offset" type="trans:inputType" minOccurs="0"/>
<element name="repeat" type="trans:inputType" minOccurs="0"/>
</sequence>
<attribute name="align" type="positiveInteger" default="1"/>
<attribute name="bitsPerUnit" type="positiveInteger" default="8"/>
<attribute name="ref" default="relative">
<simpleType>
<restriction base="NMTOKEN">
<enumeration value="relative"/>
<enumeration value="current"/>
<enumeration value="absolute"/>
</restriction>
</simpleType>
</attribute>
</complexType>
<complexType name="compareDelimiterType">
<complexContent>
<extension base="trans:delimiterType">
<sequence>
<element name="compareTo" type="trans:inputType"/>
</sequence>
<attribute name="size" type="positiveInteger" default="1"/>
<attribute name="interval" type="positiveInteger"/>
<attribute name="relation" default="equal">
<simpleType>
<restriction base="NMTOKEN">
<enumeration value="less"/>
<enumeration value="lessOrEqual"/>
<enumeration value="equal"/>
<enumeration value="greaterOrEqual"/>
<enumeration value="greater"/>
<enumeration value="notEqual"/>
</restriction>
</simpleType>
</attribute>
</extension>
</complexContent>
</complexType>
<complexType name="regExpDelimiterType">
<complexContent>
<extension base="trans:delimiterType">
<attribute name="regExp" type="string" use="required"/>
<attribute name="encoding" type="string" default="UTF-8"/>
<attribute name="matchedChar" default="first">
<simpleType>
<restriction base="NMTOKEN">
<enumeration value="first"/>
<enumeration value="last"/>
</restriction>
</simpleType>
</attribute>
</extension>
</complexContent>
</complexType>
<complexType name="inputType" mixed="true">
<choice minOccurs="0">
<element name="func" type="trans:funcType"/>
<element name="query" type="trans:queryType"/>
<element name="const" type="trans:constType"/>
</choice>
<attribute name="type" type="string"/>
</complexType>
<simpleType name="nonNegativeCountable">
<union>
<simpleType>
<restriction base="nonNegativeInteger"/>
</simpleType>
<simpleType>
<restriction base="string">
<enumeration value="unbounded"/>
</restriction>
</simpleType>
</union>
</simpleType>
</schema>Below is an explanation of the main element types defined in this schema:
<translation>
The <translation> element is outermost, with a uri attribute to uniquely identify the pattern and a version attribute to keep track of ongoing updates to the pattern.
<object>
The pattern's object model is written as a hierarchy of <object> elements, each of which can have a name attribute to identify its class, as well as an href attribute to point to an external definition of the object, or an id attribute to allow others to point to this object definition.
<param>
The object model is parameterized with <param> elements with name attribute to identify the parameter, type attribute to indicate data type expected for the parameter, and default attribute giving the value used when the parameter is not filled in. Parameters are filled in by providing a <func>, <query> or <const> child element.
<func>
Input data is manipulated with functions represented by <func> elements. These are identified by their name attribute, and the return type is as specified by their type attribute. Functions that are methods of objects in the translator's output object tree are indicated by providing an xpath attribute to point to the object, or a class attribute to identify the class of the object. In the latter case, the first argument of the function is the "this" argument that determines which object of the given class is being referenced, unless the function is static, in which case no "this" argument is needed. The ns attribute can be used to identify function libraries or packages with a namespace. When referring to member functions of objects generated by another translation pattern instance, this must match the uri attribute of the corresponding <translation> element.
Each of the child elements of a <func> element describes how to fill one of the function inputs. The allowed child elements are the same as those used to fill in parameters: <func>, <query> or <const>. Queries can return collections, such as the node sets returned by XPaths. Collections are interpreted as arrays for purposes of function input. When an array of objects is specified for an input which is not typed as an array, the interpretation is that the function is to be applied to each element of the array individually, and an array of results is to be returned. This can lead to arrays of arrays being returned if more than one argument is specified in this manner. The interpretations described here are enforced by default in the code generators for translation patterns.
<query>
Each <query> element specifies how to get a unit of information from the source file, and its type attribute indicates the data type of the information once gotten. This attribute is not to be confused with the xsi:type attribute for indicating the inherited type of <query> element used in instance documents. There are four inherited types of <query> element defined in translation.xsd , all by extension of the base queryType. These are:
xpathQueryType for XPath queries on XML document sources
sqlQueryType for SQL queries on database sources
regExpQueryType for regular expression queries on text file sources, using the syntax of [Schema] part 2
binQueryType for queries on binary source files
Each of these has its own content model and attributes specific to the query type. The forthcoming sections will explain xpathQueryType and binQueryType in greater detail. One can add other query types as needed; for example, an XQuery type might be added in future.
<const>
Constants are set with the <const> element by providing the data type in its type attribute and the data value in its value attribute.
A binary to XML translator ultimately takes one stream to another stream. If the data structures of source and target map directly across and happen to be stored in the same sequence, then one can basically follow each read operation with a corresponding write operation, and very little needs to be stored in memory. However, we are rarely ever this lucky. Realistic file format mapping problems generally require an object model for the source format and another for the target format.
These object models are more than just nice object-oriented representations: they improve efficiency by setting up optimal addressing mechanisms for data that is stored and re-used during the mapping process. Furthermore, good design can ensure that only needed information is kept: objects are created, read in and destroyed as needed, and the full set of objects representing a document may never be simultaneously stored. In some circumstances, it is optimal for the objects to only store pointers or references to data rather than the data itself. As long as their member functions are capable of returning translated data, pointers are enough. Since complex translation effectively involves addressing into both a source and a target object model, two object models are generally optimal.
If two successive object models separate the source stream from the target stream, then according to the definitions of Section 2, the translator must have a parser stage followed by a mapper stage followed by a serializer stage. Here is the basic architecture planned for each stage:
An instantiated parser consists of a tree of target objects under class membership. In other words, the children of a given object are among its member objects. The object model is defined by wiring together a collection of parsing patterns. The "wiring" is achieved by setting href attributes on <object> elements of files adhering to the translation.xsd schema, as explained in Section 2.5.
Each target object implements a parse interface, including a method to read itself in. The code to do this is determined by the function-query-constant trees assigned to the pattern object model's parameters. A binary file format parser uses the binQueryType of <query> element to specify file access.
It is important not to confuse the translation pattern object model with the target object model. The translation pattern objects exist at design time only, are represented by <object> elements, and contain the parameter inputs to the code generators. The target objects exist at run time only, include member functions that can be called using <func> elements, and are created by the generated code. The two are related, but in general they are different, as was apparent from the examples of Section 2.
The mapper architecture is very similar to the parser architecture, except that the mapper's source is the parser's target, and it is queried with the xpathQueryType of <query> element. Although the queries are of in-memory objects, not XML, the xpathQueryType is used to specify the queries nonetheless. The idea is that there exists a prescription for serializing the mapper's source objects as XML, and it is convenient to describe object queries in terms of queries on the prescribed XML serialization of the objects, even if that serialization is not ever performed.
The most efficient way to do the final serialization step is via an interface implemented by each of the objects in the mapper target format. This is optimized so that each mapper target object serializes itself as soon as all its information has been read in from the parser's target object tree.
Serialization is an optional part of the architecture, since it is not needed for import filters. On the other hand, serialization is useful for test purposes during development. Therefore both the parser and mapper target objects will implement the serialization interface in debug builds.
The high-level view is that data flows from the input file to the parser's tree of target objects, to the mapper's tree of target objects, to the output file. Within the parser and the mapper, individual pieces of data flow from leaf to root in function-query-constant trees, or more precisely in corresponding generated code structures associated with target objects. The details are revealed by a closer look at the element types defined in translation.xsd —particularly the <query> element types, since this is where the data access occurs.
The binQueryType content model includes <range> elements used to specify byte ranges. <range> elements can contain other <range> elements, so that the final data query can be arrived at via a process of refinement in which you identify ranges, sub-ranges, sub-sub-ranges, etc.
Furthermore, the binQueryType content model allows an optional <func> element. If present, this means that the query is to be performed on the stream output of the function identified by the <func> element rather than on the source file stream. That function can of course be a member function of a parser target object, and its inputs can be specified as file stream read operations and/or further function calls. The implications for data flow are that parts of the file stream can be processed in stages, each intermediate stage being another stream. The formalism and code generator are capable of handling these intermediate streams in the same manner as the original file stream. This is useful when the source format has any of the following features:
Structured storage, such as OLE DocFile or RIFF (Resource Interchange File Format)
Compression, such as Huffman, Lempel-Ziv-Welch, arithmetic coding, run-length coding, discrete cosine transform, or fractal
Encoding, such as Base64, UUencode, BinHex, Quoted-Printable or yEnc
Encryption, such as RSA, Diffie-Hellman, DES, Blowfish, IDEA, or RC4
Embedded formats
Proprietary obfuscation algorithms
Binary queries can access arrays of data just as XPath queries can. In particular, multiple sibling <range> elements are interpreted as an array of values. Furthermore, an optional <count> child of the <range> element can be used to access repetitive byte sequences as arrays. This <count> element specifies the maximum number of times to repeat the range specification; the end of the stream or sub-stream might be reached first.
Often, the <count> data is something that must be read from file, rather than a constant. The same is true of other elements used to express binary queries; namely, <offset>, <repeat> and <compareTo> (these elements will be further explained below). For this reason, all four elements can have separate function-query-constant trees descending from them (they all have type inputType ). We have already seen that <range> nodes can have <func> children. Thus binQueryType <query> elements have the unique feature that further <query> elements can be descendants to any depth, and there are many ways in which this can occur.
Each <range> element has <start> and <end> children to delimit the start and end of a byte sequence or bit sequence that is to be read, and these delimiter elements can be interspersed with any number of <skip> elements to indicate ranges that are skipped over, not read. <start>, <end> and <skip> elements are all of type delimiterType , and allow you to specify stream position by any of the following kinds of operations, in the following order:
Match based on stream content. The details depend on the inherited sub-type of delimiterType:
compareDelimiterType to match based on numerical equality or inequality. The comparison is with a <compareTo> child, which is of type inputType, and the nature of the comparison is given by the attribute relation . The compared stream content is of size given by the attribute size , and the address is incremented by an amount given by the attribute interval after each unsuccessful comparison. The start of the first successful comparison is the matched address.
regExpDelimiterType to do string matching using a regular expression. The regExp attribute provides a regular expression in the syntax of [Schema], the encoding attribute determines how byte sequences are to be interpreted as strings, and the matchedChar attribute determines whether the matched address is that of the first or last character of the matching content.
Apply an offset to the address with an <offset> child element, which is of type inputType.
Align to the next nearest absolute address that is the specific multiple of bytes or bits given by the align attribute value.
Repeat the above operations a number of times given by the <repeat> child element, which is of type inputType, culminating with the actual delimiter position at the end of the last operation.
delimiterType elements also share the following attributes that determine the meaning of numerical address values:
bitsPerUnit provides the number of bits per address increment assumed for the <offset>, align, size and interval data; e.g. set bitsPerUnit="1" for bit counts, or bitsPerUnit="8" for byte counts
ref determines whether the delimiter definition is "absolute", "relative" or "current", where the meaning of these terms is analogous to their meaning in the XPath specification. Absolute addresses are file stream positions starting at zero, relative addresses are the amount that has to be added to the prior delimiter element's position (among previous sibling or ancestor <range> elements within a common <query> subtree), and current addresses are the amount that has to be added to the current context position, determined by the nearest-ancestor <query> element.
These attributes were chosen to accommodate the typical forms of binary file format data access needed by a parser. Binary data access has been modelled as a set of pointers into the file stream, making it easier to generate code that stores pointers, then fetches the data only when it is needed, and without needlessly making intermediate copies of the data.
The xpathQueryType of <query> element has a required xpath attribute that encodes its data as a standard XPath. The XPath is interpreted as acting on the source object model, where objects are interpreted as nodes and member objects are interpreted as child nodes. The context for the XPath is set by the code generated from the mapper pattern, starting with a context node assignable to each mapper pattern object. For example, in Section 2.3, the "list" object had a "for" parameter that determined the context of each of its children.
XPath expressions that return node sets are normally turned into code that returns arrays as defined in the target language of the code generator.
Control begins at the root of the mapper target object tree. In the case of a translator that serializes its output, control begins with a call to the root object's "write" member function. As with other objects in the target object tree, this will trigger calls into the parser target object tree to access its member data, which will be written out, interspersed with calls to the "write" functions of child objects (which are created as needed).
When parser target objects are queried, they will read their data in, create any other objects on which they depend (if not already created), and query any other objects on which they depend. Generally, dependence happens along lines of ancestry.
Much of this paper is focussed on a grammar for describing and abstracting format translation problems. The essential practical significance of such a grammar is that it can be used as the basis for code generators: it defines a code generator input format, and it determines much about the architecture of the code generators. Having dealt with the architecture of generated code in the previous chapter, we will now briefly touch upon the architecture of the code generators themselves—including parser generators and mapper generators for each individual pattern.
In the case that our code generators are to produce filters for desktop applications, it is important that the format mappings be as efficient as hand-coded, natively compiled C++. Therefore in these situations we choose C++ as the programming language of the generated code. The code generator itself is used at design time and has no such efficiency constraints. It consumes XML written in the grammar of translation.xsd and it performs a very general templating function driven by the input XML. XSLT is ideal for such a task, so we choose it as the programming language of the code generator. Note that this XSLT has C++ as its target, and it is run once for each translation pattern.
A file format translator consists of general code infrastructure into which specific, generated code components plug in. The general code takes the form of a set of libraries that are statically linked, and is needed to look after things like setting context variables, managing input and output streams, and managing information that permits streaming, progressive rendering and memory use optimizations. The specific code components plug in by inheriting from library base classes and implementing library interfaces. These interfaces include handlers for the various parsing, mapping and serializing tasks, and they provide access to runtime variables affecting the specific code.
There are "default" code-generators; master XSLT files that are imported into each individual pattern's XSLT file, and whose individual code-generating templates can be overridden. The default code-generators will generate C++ classes and member variables with naming and membership corresponding to the translation pattern's object model, and they will provide a generic treatment for function-query-constant trees. Pattern-specific overrides can, for example, create a target object model different from the pattern object model and can add member functions specific to each object.
Parser pattern overrides tend to leave the default target object model code generation in place (but will change other things), whereas mapper pattern overrides tend to replace the default with code generation for target objects that is unique to each pattern. This reflects the assertion that a parser's job is to faithfully construct an in-memory object representation of the input information without fundamentally changing or re-interpreting that information, whereas a mapper's job is to apply whatever re-interpretation is necessary to optimally map to target format constructs.
To rapidly develop a file format parser, the software developer would complete the following sequence of steps:
Discover or hypothesize a parsing pattern used by the source file format
Create this pattern if it does not already exist (create a parameterized object model, a code generator and a schema for the target object model)
Add this pattern to the parsing pattern collection, and include the appropriate URLs to wire it in
Fill in the parameters with function-query-constant trees
Generate a new parser
Parse example input files, and serialize the parser output
Verify the output
Modify the existing pattern and/or parameter inputs and repeat steps 3-7 until the output is as expected
Repeat steps 1-8 until a complete parser has been created
Part of the ongoing research program at SchemaSoft is the creation of a GUI tool to facilitate this rapid parser development workflow.
Mappers are rapidly created according to a similar prescription, except that the mapper patterns are chosen and filled out based on an analysis of the comparative semantics of source and target formats, whereas parser patterns are chosen and filled out based on incremental file format analysis. Consequently, there is a different GUI tool for rapid mapper development.
For illustrative purposes, we will invent a binary file format called PolyDraw, which we imagine to be used for simple GIS and CAD drawings. We'll only specify enough of the format for illustrative examples, so don't start writing applications for PolyDraw yet!
The Map shown in Figure 1 is an example of the sort of content that can be encoded in PolyDraw.
This map can be represented as an area which is subdivided into a number of immediately adjacent regions, as is typical of maps. Map regions are typically polygons, since they are normally defined by a finite set of boundary points obtained from survey data, and polygons interpolate the survey data in the simplest possible manner—linearly. These regions share polygonal boundary segments, and the same vertex data is repeatedly used in constructing all the regions. In the Great Lakes example, country boundaries share some vertices with province/state boundaries and with lake boundaries, while each province/state shares vertices with its neighbouring provinces and states. To avoid unnecessary redundancy, the source format might store a master list of points, with vertices recorded as indexes into that list.
The individual regions would normally also have some sort of styling information associated with them. For simplicity, we will limit allowed styling to colours that can be applied either to the interior or boundary of each region. In Figure 1, Canada is given a light green interior, U.S.A. is given a dark green interior, all states and provinces are given a red boundary, and all lakes are given a light blue interior. The PolyDraw format might internally reference these rendering styles by looking up a unique ID in a stored style table. This allows us to demonstrate a second, distinct lookup mechanism within the same example.
Similar principles apply to planar projections of facetted shapes in CAD formats, which is why PolyDraw doubles as a simplistic CAD format. If you start with a 3D scene consisting of a set of polyhedra, apply a lighting or shading model to determine the colour of each facet, and project the scene into a 2D view, then the result is a collection of coloured polygons with shared vertices. Real-world CAD formats typically store such 2D views along with the full 3D model. See Figure 2 for a simple cube example.
Here is a proposed "Subdivided Area" target object model for the styled polygons with shared boundaries included in the PolyDraw file format. The object model is depicted in Figure 3 as a UML diagram, following which is a corresponding XML Schema file:
<?xml version="1.0" encoding="UTF-8"?> <schema xmlns="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.schemasoft.com/schema/subdividedArea.xsd" xmlns:area="http://www.schema