XML Europe 2003 logo

Restoring the Primacy of PCDATA

Abstract

Current markup languages and processing tools assume and even impose a hierarchical, tree-based approach to the data encoded in documents. This paper explores the benefits and gains made possible by processing documents marked up in XML syntax as definitions of sets and the relations between them. This change of understanding has implications for the relationship between data and metadata without necessitating either a new syntax or set of processing tools.

The use of milestones or empty elements in XML documents has traditionally been advocated as a solution to the problem of multiple and potentially overlapping structures. One of the difficulties of processing milestones with tree-based tools is that it requires the extraction of a node (and element with child PCDATA) from a flat representation where the milestones are siblings of the PCDATA. This procedure is made more difficult by the presence of intervening elements and structures. A set-based understanding of markup syntax treats all elements as milestones, thereby 'flattening' the document and raising the PCDATA to the primary level. The virtual milestones function to mark the boundaries of a set.

The following is an example of this approach:

<p> This <u>is</u> <i>italic <b> bold </i> bold </b> text </p>

The advocated set understanding would view this example as:

<pStart/> This <uStart/> is <uEnd/> <iStart/> italic <bStart/> bold <iEnd/> bold <bEnd/> text <pEnd/>

with the following set enumerations and relations:

p = {This, is, italic, bold, bold, text}

u = {is}

i = {italic, bold}

b = {bold, bold}

u &#2286; p

i &#2286; p

b &#2286; p

i &#2229; b = { bold }

u &#2229; u = &#2206;

This example illustrates: (1) the primacy of the data (the members of P are all PCDATA, it has no child elements) and (2) the ability to process overlapping structures (the start and end points of the italics and bold do not coincide).

Interpretation of markup becomes a choice at time of processing and the user is not bond by choices made by the encoder of the document. Further, through the addition of new milestones or out-of-line pointers to an existing document it is possible to add new relationships and structures.

The demonstrations for this proposal will include use an XSLT stylesheet to produce a traditional tree view of sample documents and denotation of sets over the same documents using SVG. Additionally, the use of set processing will be demonstrated for querying multiple structures within a single document.

One of the more important implications of this approach is the 'moving' of PCDATA without the necessity of finding child elements and selecting the content of each child and then gathering all such strings together. By declaring a set relationship between elements, the PCDATA simply becomes a member of the larger set. Extraction of relevant data becomes easier with this approach and less likely to require specialized assistance.

Keywords


The full paper was not available at the time the proceedings were created. Please check the conference web site, http://www.xmleurope.com, to find an updated version of this paper.

Biography

Patrick Durusau is the Director of Research and Development for the Society of Biblical Literature (SBL). His primary research interests include the use of XML for the encoding and analysis of biblical and Ancient Near Eastern texts.

Matthew Brook O'Donnell is Director of Research and Development for OpenText.org, an initiative to develop XML-based tools and resources for linguistic analysis. His research interests include corpus linguistics, text encoding and the linguistic analysis of ancient Greek.