XML 2001 logo

Regular Fragmentations: Treating Complex Textual Content as Markup

Simon St.Laurent

1. Processing Atoms and Molecules

XML provides developers with a set of tools for structuring and labeling information using a particular set of syntactical tools. There is nothing in XML 1.0 that specifies how fine or coarse a structure developers must create, and best practices for XML often recommend choosing a level of granularity appropriate to the task at hand. While that usually gets developers through the first set of problems they encounter, repurposing XML documents can be quite difficult if the structures initially chosen to solve one set of problems are too large (or perhaps too small) to address the needs of another set of applications.

The W3C's approach to these issues has been to standardize a set of datatypes which have known granularities. Known, that is, if developers choose to work within the ponderous superstructure of W3C XML Schema. W3C XML Schema provides some mechanisms for controlling the information stored in these types, as well as for creating custom types, but the descriptive approach W3C XML Schema has some problems. Apart from the processing overhead and learning curve involved in W3C XML Schema, the Datatypes it provides are designed to solve a certain set of problems in terms which make most sense for a relatively small class of applications - those exchanging information between databases using a set of core types to store information. While developers can repurpose these types for other types of applications, doing so may require the use of additional code for understanding these types or the use of a full-blown (and thus far nonexistent) Post-Schema Validation Infoset (PSVI) processor.

The alternative approach presented here uses a lightweight and readily understood processing model in place of the description approach of the PSVI. Instead of creating a new understanding of a particular markup vocabulary, Regular Fragmentations uses existing markup and a set of regular expression-based rules to transform the document into a new document which contains additional information about its content. That document may exist only as a passing stream of SAX events or a DOM tree, or it may be re-serialized and passed to other processing systems. The result of Regular Fragmentations processing is simply more XML, reducing or eliminating the need for developers to bind their code tightly to a particular API for working with W3C XML Schema processors.

2. Regular Expressions and Textual Content

Regular expressions are a well-known and well-understood technology and have long simplified processing of textual content. XML 1.0 itself cannot be processed easily with regular expressions, because of features like default attributes, general entities, and the interpretation of XML namespaces. Regular expressions remained a tool for programmers working with XML, but received relatively little attention as a tool fundamental to processing XML. W3C XML Schema, to some extent, revived an interest in the combination of regular expressions and XML, using its own subset of regular expressions as a pattern facet on customized datatypes. Developers could use the simplest functionality provided by regular expressions - binary yes/no matching - to create additional limitations on conformant content.

Binary matching suits the needs of a descriptive system designed to return validation information, but is only a tiny fragment of the potential of regular expressions. Regular expressions are more typically used as part of a processing pipeline, not merely for pattern matching. While W3C XML Schema reintroduces regular expressions to XML, and does so in the much more appropriate context of post-parse text processing, it doesn't consider applications which might conflict with its own understanding of type identification and validation.

Once an XML document has been parsed, the barriers which blocked the use of regular expressions are removed. XML parsers have already done the hard work of supplying referenced information and normalizing textual content. The regular expressions are now free to operate on just the character content of the elements and attributes, not the markup which defines those structures. In concert with XML tools which understand the supporting structures, regular expressions can be used to enrich the post-parse content of XML documents by using textual cues to create new markup structures.

Many markup structures contain additional structures which are not explicitly described by the markup. To some extent, the finest-grained markup would refer to each character individually, or perhaps even to parts of compound characters, but this level of markup is rarely used. (Some projects do require it.) Similarly, words or sentences are indicated with lexical conventions like whitespace and punctuation, though this level of markup is also rare. In the data side, however, there are strong lexical conventions used by W3C XML Schema itself which are ripe for textual processing. For example:

1975-11

is a gYearMonth which contains a year (itself fragmentable into century and year identifiers), a dash which is purely for legibility to both humans and computers, and a number indicating a month. (From a computer perspective, the dash is necessary because both the year and the month may have a variable number of digits, so the dash avoids ambiguity.) W3C XML Schema defines more complicated basic types, notably ISO 8601 dates and times:

1999-05-31T13:20:00+05:00
1999-05-31T13:20:00Z

Like the simpler gYearMonth example, these constructs use lexical markers to identify their parts, though not XML-style angle brackets. Dashes, pluses, Ts, and colons provide separation, while Z indicates Zulu (Universal) Time. This lexical notation is a prime target for regular expressions,

3. Converting Lexical Notation to Markup

Regular Fragmentations takes a rule-based approach to converting textual content to a combination of explicit markup and content. Rules include a list of which elements or attributes they apply to, a regular expression, the way in which that expression is to be applied, and a list of results. The results reflect the parts generated by the regular expression and specify how they should be represented.

Regular Fragmentations currently supports two approaches to regular expression processing. The first is a simple pattern match, where parentheses indicates results. For example, the regular expression:

(\d{1,5})(\d{2})-(\d{2})-(\d{1,2})

would produce four results, containing sets of digits. Applied to a string, like:

1989-12-4

the four results would be 19, 89, 12, and 4.

Some situations are easier to address using split patterns. Rather than providing a pattern which describes the entire content of an element or attribute, these patterns identify delimiters. Those delimiters are then used to break down the content into a series of result elements. For example, the delimiter:

-

applied to a string like:

2100-b4a50-999F

would produce the results 2100, b4a50, and 999F. While it is typically more difficult to guess what the results of a split will produce, Regular Fragmentations allows developers to capture the results of that split in a series of elements or attributes. This makes marking sentences or words much simpler, and also permits the conversion of CSV content stored inside XML documents to a purer XML form.

Regular Fragmentations does permit recursive processing, allowing developers to create further fragments from fragments. Also, I'm considering adding support for pattern sequences, creating something like a lex processor, but haven't had time to explore that possibility fully.

4. Rules and Results

The following examples show a little bit of what Regular Fragmentations can do. The current implementation is a SAX filter, which reads information from a SAX-based XML parser, and modifies those events to produce a fragmented stream. For these examples, those events are written back out to XML documents, making it easy to compare the original and the results. For complete details on creating rules, see http://simonstl.com/projects/fragment/docs/overview-summary.html.

The first example will fragment a gYearMonth as specified by W3C XML Schema Datatypes. For demonstration, the example includes two elements of type gYearMonth, in slightly different circumstances:

<test xmlns="http://simonstl.com/ns/test/">
		<message>Hello! This document contains a gYearMonth.</message>
		<gYearMonth>1970-11</gYearMonth>
		<myYearMonth>1970-11</myYearMonth> 
</test>

The rules for this test document will apply to both the gYearMonth and myYearMonth elements, without needing any understanding that they share a common XML Schema type. The rule looks like:

<?xml version="1.0" encoding="UTF-8"?>
<fragmentRules xmlns="http://simonstl.com/ns/fragments/">
<fragmentRule pattern="(\d{1,5})(\d{2})-(\d{2})">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
<element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
<element nsURI="http://simonstl.com/ns/types/" localName="year" prefix="type" />
<element nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>
</fragmentRules>

The root element, fragmentRules, just indicates that this is a set of rules for the Regular Fragmentations processor. The fragmentRule element contains the instructions for fragment processing, and there can be many of them in a rules document. The pattern attribute specifies the regular expression that will be used. The applyTo element contains child elements specifying which elements in which namespaces are the targets to which the rule should be applied, while the produce element identifies containers for the result of that processing.

Applying these rules to the test document produces a modified document:

<?xml version="1.0" standalone="yes"?>
<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth>
<type:century xmlns:type="http://simonstl.com/ns/types/">19</type:century>
<type:year xmlns:type="http://simonstl.com/ns/types/">70</type:year>
<type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month>
</gYearMonth>

<myYearMonth>
<type:century xmlns:type="http://simonstl.com/ns/types/">19</type:century>
<type:year xmlns:type="http://simonstl.com/ns/types/">70</type:year>
<type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month>
</myYearMonth>
</test>

Instead of containing the lexical version of the date, the elements now contain a markup version of the date's components.

Similar work can be performed with rules for splitting content. The test document is somewhat different:

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a splittable piece.</message>
<threepart>2100-b4a50-999F</threepart>
</test>

The rules for splitting have a slightly different set of components from the rules for matching:

<?xml version="1.0" encoding="UTF-8"?>
<fragmentRules xmlns="http://simonstl.com/ns/fragments/">

<fragmentRule pattern="(-)" type="split" skipFirst="false" repeat="false">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="threepart"/>
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="one" prefix="type" />
<element nsURI="http://simonstl.com/ns/types/" localName="two" prefix="type" />
<element nsURI="http://simonstl.com/ns/types/" localName="three" prefix="type" />
</produce>
</fragmentRule>

</fragmentRules>

The fragmentRule element identifies a delimiter, and specifies a type of split. The pattern attribute identifies the delimiter. The skipFirst attribute tells the processor to include all results returned by the regular expression, as matching expressions return the entire value of the matched string as the first value but splitting operations do not. A repeat attribute, which takes the values true or false, provides an option for cases where the number of results is unknown.

When applied to the target document, the Regular Fragmentations processor produces:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">

<message>Hello!  This document contains a splittable piece.</message>

<threepart>
<type:one xmlns:type="http://simonstl.com/ns/types/">2100</type:one>
<type:two xmlns:type="http://simonstl.com/ns/types/">b4a50</type:two>
<type:three xmlns:type="http://simonstl.com/ns/types/">999F</type:three>
</threepart>
</test>

Regular Fragmentations allows permits attribute processing, including the creation of attributes from element content and vice-versa. Additional options permit you to skip content that should be deleted, and to add text before and after element content should you need to add labels.

5. Implications

Regular Fragmentations is fundamentally just a handy toolkit for processing complex content, but the implications of its style of processing go well beyond convenience. Where XML and XML Schema have 'enriched' our understanding of the textual form of XML documents with type and content information from descriptive sources, Regular Fragmentation enriches documents by modifying their markup directly. Such modifications are far simpler to pass between stages in a pipeline, and require very little understanding on the part of developers or programs. The results of Regular Fragmentations processing are XML documents - no more, no less. No processor other than the Regular Fragmentations processing needs to have an understanding of the rules used in its processing, and those rules are both simple and testable.

Regular Fragmentations also opens the way to looking at non-XML markup as XML structures. Although the current implementation focuses squarely on content inside of XML documents, there's not much fundamental reason for limiting such processing to that kind of content. As work proceeds on both Regular Fragmentations and Markup Object Events (a foundation model derived from Regular Fragmentations), one goal will be the erasure of the distinction between XML markup and other forms of textual markup. XML is easy to work with and process, but there is no fundamental reason for XML documents to be the beginning or conclusion of information processing.

Biography

Simon St.Laurent
Assistant Editor
O'Reilly and Associates
U.S.A.

Simon St.Laurent is an Associate Editor at O'Reilly & Associates, focusing on XML titles. Prior to that, he wrote XML: A Primer (3rd Edition), XML: Elements of Style, Programming Web Services with XML-RPC, Building XML Applications, and Cookies. In his spare time, he writes open source projects for processing XML with Java.