XML Europe 2002 logo

Generalised XML extraction from relational databases based on ad-hoc XML schema.

Abstract

Relational databases are a mature technology that dominates the implementation of database systems. XML is a new technology that is attracting a great deal of interest and there are indications that it may dominate some areas of information system development, particularly amongst distributed systems. The role of XML within information systems and legacy information systems is still to be precisely determined however.

This paper describes a generalised mapping between relational databases and XML documents that is based on the tree structure of the Document Object Model (DOM). It also describes XR2 (XML to Relational Translation), an implementation of the solution we describe

XR2 automates the major functions involved in the solution proposed, which achieves a mapping from a relational database to an XML format. These are:

1. Facilitate the specification of an XML schema that maps from the metadata of a relational database to the XML output structure that a user requires.

Given the XML schema produced in 1.,

a. XR2 will construct queries to extract the data required.

b. XR2 generates output that places XML formatting appropriate to the data retrieved by the queries.

We have investigated the use of two types of XML schema for the purpose of database extraction, namely Document Type Definitions (DTD) and the schema specified by the XML-Data proposal of the World Wide Web Consortium. XR2 generates both DTD and XML schema. XR2 will also extract data from a database based on a DTD or XML schema. While it is possible to achieve an extraction process using either a DTD or XML schema, it is simpler to use an XML schema, the reasons for which will be explained in subsequent sections.

We describe a general solution to the problem of extracting XML formatted data from a database. The solution described is to a generalised problem, that is, only natural assumptions are made on the structure of the database and on the structure of the XML documents to be produced. A natural assumption is taken here to mean that it is possible to create queries on the database that will result in the production of the XML document.

The solution we describe can be applied independently of the database platform, by conforming to standard relational database mechanisms (SQL). The current implementation of the solution operates external to the database, examining database structure and querying the database when necessary. The extraction process can be based on ad-hoc XML schema. That is, the extraction can operate on a portion of the database defined in a given XML schema. This allows the needs of any particular extraction to be met through the schema definition. The database structure need not be altered.

Keywords


Table of Contents

1. Introduction
2. Background
3. Problem and Solution
3.1. Description of the problem
4. Extracting data from the database
4.1. Determining the structure of the extract
4.2. Output generation based on the DOM
4.2.1. Constructing a node query
5. XR2
5.1. Specification of DTD
6. Conclusion
Bibliography
Biography

1. Introduction

Relational databases are a mature technology that dominates the implementation of database systems. XML is a new technology that is attracting a great deal of interest and there are indications that it may dominate some areas of information system development, particularly amongst distributed systems. The role of XML within information systems and legacy information systems is still to be precisely determined however.

This paper describes a generalised mapping between relational databases and XML documents that is based on the tree structure of the Document Object Model (DOM). It also describes XR2 (XML to Relational Translation).

XR2 automates the major functions involved in the solution proposed, which achieves a mapping from a relational database to an XML format. These are:

2. Background

The issue of developing a data store for XML documents is an active area of investigation within the XML development community. Some of these efforts are concerned with determining the role of a relational database within a storage architecture [WI01]. A related consideration in the investigation is the use of XML as a technology for data transmission or for data storage [CL01]. This has also been stated as the document-centric and data-centric views of XML. The procedure described in this paper is of principal interest when XML is viewed as a technology for data transmission.

When viewing XML as a technology for data storage, the design of systems is from the document-centric point of view. The result of this is to reduce the role of relational database technology and to preserve the tree structured hierarchy that is central to XML. This would have certain advantages where that tree structure is an appropriate structure to use as the basis of program procedures. It will have disadvantages where that tree structure is not appropriate. The aim of research at this stage is to identify in what cases it is appropriate.

Each of the major database vendors has developed means of extracting XML formatted data from databases. While the solution we describe is closest to that offered by Microsofts SQL server, our solution has three additional advantages. The first is that the solution is generic, thus it applies to any SQL database regardless of platform. The second is that nested elements can be specified in the mapping, through the use of a recursive extraction mechanism. The third advantage is that a flexible mapping is possible through the use of an additional schema attribute, sql:content.

The generalised approach we have described allows a database structure to be designed that is independent of XML output requirements. That is, the database structure can be developed without considering that there will be a need for retrieval of data in an XML format. This means that the application is dependent only on a domain problem analysis and so standard database design techniques can be applied. It also means that legacy database systems can produce output in an XML format without significant redesign.

3. Problem and Solution

We describe a general solution to the problem of extracting XML formatted data from a database. The solution described is to a generalised problem, that is, only natural assumptions are made on the structure of the database and on the structure of the XML documents to be produced. A natural assumption is taken here to mean that it is possible to create queries on the database that will result in the production of the XML document.

The solution we describe can be applied independently of the database platform, by conforming to standard relational database mechanisms (SQL). The current implementation of the solution operates external to the database, examining database structure and querying the database when necessary.

3.1. Description of the problem

Consider the architecture depicted in Figure 1. We are considering the extraction process depicted. The same general approach could be taken to achieve the input/update process. Extensions to XR2 to this end are the subject of active and ongoing research.

Figure 1.

click image for full size view

General architecture

We describe the problem using an example. Figure 2 depicts a simple UML class diagram. Although it is not an entity-relationship diagram, it describes a simple relational schema for a document. This can be seen as a part of a larger relational schema.

Figure 2.

click image for full size view

Example domain

This simple example can be structured as a Document Object Model (DOM) in a DTD, resulting in the tree structure of Figure 3 (note the recursive structure).

Figure 3.

click image for full size view

Example domain

Given the DTD and the relational database, we require a means of generating XML formatted output that contains the data within the database, formatted according to the schema specification.

The XML fragment given below is a sample of the kind of output we would like to be able to extract (ellipses indicate where the extract is truncated).

A relational schema has no indication of directionality. This means that (without a general solution) extracting a set of data from a relational database formatted with XML markup is only possible through the development of a programmatic procedure designed for the purpose. The procedure would specify which tables were to be queried for data and which fields of the tables were to be used to join tables. This is not a flexible solution.

An XML schema and DTD, however, describes a tree structure, i.e., XML schema and DTD specify the structure of a DOM. We use the DOM as a means of specifying how data is to be extracted from the relational database and how XML markup is to be formatted. Using this structure to create a general solution to the problem, we do not have to produce a procedure each time we wish to extract a part of the database.

To extract part of the database, we need only to define a DOM that specifies what part of the database we require. The DOM can then be used to guide the automated extraction process.

Figure 4.

click image for full size view

XML fragment extracted from a relational database

4. Extracting data from the database

The extraction of data from a database relies on a mapping between the DOM and the respective portions of the database schema. Having specified which portions of the database are required, an extraction process can be carried out. The extraction process consists of generating appropriate SQL queries and output the results of the queries in an XML format.

4.1. Determining the structure of the extract

The constructs in relational database metadata do not indicate a tree structure of any kind. Such a structure could be forced, for example through a naming convention. However, it seems more appropriate to use the tree structure specified in a DOM, either from a DTD or from an XML schema. For our example, the tree structure is given by:

(Editor's note: this figure was missing at the time of production of the proceedings. Please check the XML Europe web site, where a corrected version of the paper will be posted as soon as is possible.)

This defines the tree structure depicted in Figure 3 (above). Slightly less obviously, an XML schema can be interpreted to define the same tree structure.

The general concept mapping specified in the DTD and XML schema fragments results in a conceptual arrangement of the relational tables into a tree structure. Thus, we can think of each node in the DOM being associated with a table, and hence all tables (except the root table) as having parent tables. In other words, we specify a tree of database tables.

4.2. Output generation based on the DOM

A query is needed which extracted the desired portions of the database. The user would indicate which data is to be extracted. For our example, the query would specify which chapters are to be extracted.

Aside from the root node, links between elements are produced through the generation of a linking query. In order for this linking to be successful, the DTD should only specify links between tables that can be linked.

Figure 5 depicts an example fragment of a mapping between a DOM structure as indicated in an XML schema and the SQL statements that are necessary to extract the data specified in the schema.

Figure 5.

click image for full size view

DOM structure: SQL query mapping

The general procedure steps through the DOM and sets a query at each node. For each record retrieved in the query, the data is output as XML elements and attributes. During that output, any children of the current node are processed recursively using the same procedure.

In the example shown in Figure 3, the root query extracts all the required CHAPTER records. As this is the root of the tree, user parameters would indicate which chapters are requested, eg.

Figure 6.

click image for full size view

User specified SQL

4.2.1. Constructing a node query

Aside from the root node, all other nodes have a parent node. The query constructed at a given node has two main functions:

Determining the means of joining the table associated with the node and the table associated with its parent node. This is indicated in the mapping.

Specifying that the data to be retrieved is only that which corresponds to the current record of the current parent table.

5. XR2

We have implemented the XML extraction procedure described above in an information system that manages the content of a set of medical information. XR2 has two main functions.

The first function is the specification of a DOM in an XML schema or DTD through a user interface.

The second function is the extraction of data in XML format that conforms to the XML schema and DTD.

Figure 7.

click image for full size view

DOM structure: SQL query mapping

5.1. Specification of DTD

A DTD is constructed by selecting tables from the database and placing them in a tree structure. This is simply a matter of listing all available tables and having a mechanism available that allows the addition and deletion of nodes. When a node is added to the tree, the element it represents is constructed based on the structure of the table on which it is based XR2 also enables the editing of element and attribute information.

XR2 formats a DTD and an XML schema based on the contents of the tree as the user creates it. The screenshot of Figure 7 demonstrates this. The CHAPTER element is selected in the tree and its attributes are displayed in the bottom right pane. The XML schema that XR2 has produced (in the top right pane) shows how the element structure and the attributes definitions are represented.

Figure 8 shows the XML output that is produced when this DOM is applied to the extraction process defined above.

Figure 8.

click image for full size view

XR2 - XML output

The final partial screenshot (Figure 9) shows the SQL statement that is generated for the TOPIC element as part of the extraction process.

Figure 9.

click image for full size view

Generated SQL statement

6. Conclusion

We have presented a generalised means of extracting data from a relational database. The main features of this extraction mechanism are:

- Production of a XML formatted output is based on the database structure and specified in a schema, expressed either in an XML schema language or a DTD.

- No alterations to the database structure are required.

- No additional programming is required.

- No alterations to the database are required.

- The solution is applicable to any relational database platform.

- XML schema are a better mechanism for this process than is a DTD.

We have also presented an implementation of the extraction mechanism, XR2. The main features of XR2 are:

- Allows specification of a DTD and an XML schema without user knowledge of XML syntax or structures.

- Imports and parses both XML schema and DTD files, allowing for the ongoing modification of the database mapping.

- Ensures that the XML schema and DTD produced will allow correctly formatted output

- Automates extraction of data that will form the content of the XML output.

- Automates XML formatting of the extracted data.

The implementation of XR2 is a proof of concept. Extraction efficiency and speed have yet to be significantly tested and compared to alternative mechanisms. Logical extensions of XR2 will allow the insertion of XML data, which is basically the reverse of the output process.

Bibliography

[CL01] Clarke, K. (2001). Medlane/XMLMARC Update: From MARC to XML Databases. MLA 2001 national conference 2001.

[WI01] Williams, K., M. Brundage, et al. (2000). XML design for data. Professional XML databases. Birmingham, UK, Wrox Press: 11-45.

Biography

Bryn Lewis is an academic researcher with interests in health informatics and artificial intelligence. His current research topics include data modelling of the health field, decision support and knolwedge representation.