There have been a number of formal models proposed for XML. Almost all of these models focus on treating XML as semistructured data, typically as edge-labeled graphs, or as a tree of nodes. While this approach is fine for more traditional database applications, or situations where structure is of paramount importance, it fails to deal with the issues found in the use of XML for text. In particular, unless special functions are introduced, queries that involve text spanning node boundaries are not closed over a set of nodes, and likewise, the returned sets of nodes are not necessarily well-formed XML documents. This paper presents an algebra for data that, when applied to XML, is not only closed over the entire set of operations permissible in more traditional XML query languages, but over the operations that involve text and XML fragments that are not in themselves well-formed.
| XML Source | PDF (for print) | Author Package | Typeset PDF |
There are many situations where a sequence1 of data is given, and some structure is to be inferred such that the data might be further manipulated. Examples of this include traditional flat file data formats, DNA Sequences, and plain textual data. As an example, take the following piece of text.
This is a piece of italic text.
This is a sequence of characters, but there is some internal structure in the italicized portion that could possibly be manipulated. In languages such as XML, the structure of the data is made explicit via embedded markup, as shown below.
<doc>This is a piece of <i>italic</i> text.</doc>
This is a convenient way to make the implicit structure of the data explicit, and hence simplifies many forms of processing. Such markup typically needs to be nested with no overlap, though it is acknowledged that real text is often not so conveniently structured, as evidenced by papers dealing[durandMarkup][sperbergDesign1] with the issues of using SGML or XML to markup texts in the humanities.
Given the typical nesting structure, the markup lends itself to interpretation as a tree, so it is hardly surprising to see the tools and APIs supporting embedded markup to model it as such[w3cdom][w3cxsl], though there are some exceptions[adler]. One of the questions raised by using embedded markup is whether the markup is part of the text, or whether it is simply metadata about the text. In other words, is the tree the real document, or is the sequence of characters all there is? For the XML-savvy, the answer is generally that the tree is the canonical form because it represents the inherent structure of the document. An alternate, and possibly older view of documents, called parallel markup [nelsonLit] [nelsonEmbed], asserts that the markup is not part of the text, but rather an interpretation imposed over a range of characters.
This paper defines an algebra called Core Range Algebra that provides a formalism for the use of ranges as a means of dealing with the structure in sequences of data. The motivation behind Core Range Algebra is to provide a basis for a complete formalism bridging the worlds of structured data, such as XML, and semistructured data such as memos.
This paper introduces Core Range Algebra, an algebra of ranges. Core Range Algebra alone is not powerful enough to describe the markup structures possible with XML, or indeed of any nested embedded markup structure. That said, it is capable of modeling overlapping ranges of data, which XML and most typical markup languages disallow.
Core Range Algebra is defined as an algebra over a sequence(S) of items(I) from a finite set forming an alphabet (A) and is closed over all operations on A. At the heart of CRA are the abstract data types sequence, and range described below. In the descriptions, a period is used to denote access to an attribute of a given object, so foo.bar means “the bar attribute of foo”.
The sequence data type is one of the fundamental abstract data types in computer science: it is a completely ordered finite set of items. We use the following definitions:
| itemI |
A member of the alphabet(A), or |
| alphabetA |
A finite set of items(I). |
| sequenceS |
An ordered list of items(I) from an
alphabet(A) such that |
| lengthL |
The number of items held by the sequence. |
| positionP |
An index into the sequence which identifies an item. The position will be in 0..n-1 where n is the length of the sequence. |
Empty: sequence
IsEmpty: sequence → boolean
Contains: sequence
item → boolean
Equals: sequence
sequence → boolean
Diff: sequence
sequence → sequence
Union: sequence
sequence → sequence
Intersect: sequence
sequence → sequence
Concat: sequence
sequence → sequence
Insert: sequence
position
item → sequence
Delete: sequence
position → sequence
ItemAt: sequence
position → item or null
Last: sequence → item or null
First: sequence → item or null
| IsEmpty(S) |
Tests whether S contains any items, or S.length > 0. |
| Contains(S,I) |
Tests whether S contains an item equal to I. |
| Equals(S,T) |
Tests whether S and T are each of the same length, and that each item at a given position in S is equal to the item at the same position in T. |
| Diff(S,T) |
Returns a sequence containing the items in S that are not contained in T. |
| Union(S,S) |
Returns a sequence containing the union of S and T. |
| Intersect(S,T) |
Returns a new sequence holding the intersection of S and T. |
| Concat(S,T) |
Concatenate S and T and return a new sequence comprised of the members of S and T in the order (S1...Sn,T1...Tn). |
| Insert(S,P,I) |
Insert I into S at the position P and return a new sequence. If P is out of bounds no change occurs (this could also be raised as an error condition). This operation could be defined in terms of insertion (copying) into an empty sequence. |
| Delete(S,P) |
Delete the item at position P from the sequence S thereby creating a new sequence. If P is out of bounds no change occurs (this could also be raised as an error condition). Note that this could also be defined in terms of insertion (copying) into an empty sequence. |
| ItemAt(S,P) |
Return the item at position P within the sequence S, or null if P is out of bounds (this could also be raised as an error condition). |
| Last(S) |
Return the last item from S, or if S is empty, null. |
| First(S) |
Return the first item from S, or if S is empty, null. |
Perhaps the most important operator over sequences is the concatenation operator. Given 2 sequences, a third sequence is created by joining the 2 sequences together. There is a monoid relationship in this operator given that the empty sequence represents the unity item, and it also provides us with the means to derive the Kleen closure of a subset, which is critical to regular expressions.
The range data type is defined in terms of a sequence S. A range is essentially a marker for a sub-sequence of the sequence S, and consists of a position within S, and the length of the sub-sequence. We use the following definitions:
| range |
A {start, length} pair, where start is a position and length is the number of items included in the range.These can be accessed using the dot operator. For example, if we have a range R, we would write R.start or R.length. |
StartsBefore: sequence sequence → boolean
StartsAfter: sequence sequence → boolean
StartsWithin: sequence sequence → boolean
EndsBefore: sequence sequence → boolean
EndsAfter: sequence sequence → boolean
EndsWithin: sequence sequence → boolean
Within: sequence sequence → boolean
Same: sequence sequence → boolean
Create2
: position
length → range
Flatten: range
sequence → sequence
EndOf: range → position
Move: range
position → range
Resize: range
length → range
The following predicates are defined for ranges, where each takes two ranges, a and b. Note that all predicates are defined in terms of relationships between the start and length of ranges, hence the operations should be very efficient.
| StartsBefore(a, b) |
Tests whether a starts before b, or |
| StartsAfter(a, b) |
Tests whether a starts after b, or |
| StartsWithin(a, b) |
Tests whether a starts within b, or |
| EndsBefore(a, b) |
Tests whether a ends before b, or |
| EndsAfter(a, b) |
Tests whether a ends after b, or |
| EndsWithin(a, b) |
Tests whether a ends within b, or |
| Within(a, b) |
Tests whether a contains b, or |
| Same(a, b) |
Tests whether the start and length of
a are the same as the start and length
of b, or |
| Create(P,L) |
Given a position P and a length L, create a new range. Here both P and L are positive integers. |
| Flatten(R,S) |
Given a sequence S, and a range R, create a new sequence holding the sub-sequence of S identified by the position and length of R. This is essentially the same as copying all the items in the sub-sequence into a newly created sequence using Insert. |
| EndOf(R) |
Calculate the last position identified by
the range
R, or |
| Move(R, n) |
Move the range R forward of backward by n,
or |
| Resize(R, n) |
Make the range R larger or smaller by n items,
or |
In addition to the fundamental data types sequence and range, Core Range Algebra makes use of sequences of ranges. Functions defined in terms of sequences of ranges can be used to infer structure based on the patterns of containment and other such relationships between ranges.
StartOrder: sequence → sequence
EndOrder: sequence → sequence
Unique: sequence → sequence
Ancestor: range
sequence → sequence
Descendant: range
sequence → sequence
Parent: range
sequence → range
Child: range
sequence → range
| StartOrder(S) |
Given a sequence of ranges S, return a new sequence such that the ranges are ordered in increasing order based on the start of the ranges. If more than one range has the same start, these ranges are sorted by length.
|
| EndOrder(S) |
Given a sequence of ranges S, create a new sequence such that the ranges are ordered in increasing order based on the EndOf function on the ranges. If more than one range has the same EndOf value, these ranges are sorted in increasing order of length.
|
| Unique(S) |
Given a sequence of ranges S, create a new sequence such that no two ranges have the same start and length. |
| Ancestor(R, S) |
Given a sequence of ranges S, and a range R, return all ranges in S that are ancestors of R, or in other words, all ranges that completely contain R.
|
| Descendant(R, S) |
Given a sequence of ranges S, and a range R, return all ranges in S that are descendants of R, or in other words, all ranges that are completely contained by R.
|
| Parent(R, S) |
Given a sequence of ranges S, and a range R, determine which, if any, ranges in S directly contain R such that their start is closest to R.start.
|
| Child(R, S) |
Given a sequence of ranges S, and a range R, determine which, if any, ranges in S are contained by R, and by no other range whose start is greater than the start of R. In other words, determine which, if any, ranges are contained by R, and whose parent is R.
Note that the list of children can be ordered by applying order to the result of this function. |
The following grammar defines the expression language or Core Range Algebra; a very typical functional language. Expressions are evaluated in a dynamic environment with lexical scoping.
| name | Name | |
| function name | Function | |
| variable | Var | |
| constant | Const | |
| comparison operator | Op eq ::= = | != | < | > | <= | >= | |
| arithmetic operator | Op bin ::= + | - | * | / | % | |
| boolean operator | Op bool ::= && | || | |
| binary operator | BinaryOp ::= Op eq | Op bin | Op bool | |
| unary operator | UnaryOp ::= + | - | not | |
| expression | Expr ::= Const | constant expression |
| | Expr BinaryOp Expr | binary expression | |
| | UnaryOp Expr | unary expression | |
| | if Exp then Exp else Exp | conditional expression | |
| | let Var = Exp do Exp | local binding | |
| | Function(Expr; ... Expr) | function application | |
| | for Var in Expr do Expr | iteration expression | |
| | error | error expression |
In addition to the core expression languages, the functions defined above for the sequence and range data types can be used in expressions, as can the dot notion.
Core Range Algebra, as defined in the paper, is a simple algebra that operates over sequences, ranges, and sequences of ranges. The intent of Core Range Algebra is to provide a basis for manipulating sequences of data, especially text, as a sequence of items, and to provide means for layering higher-level interpretations over that substrate such that operations at the higher level can be defined in terms of the underlying model.
What is woefully missing in the existing specification is a means for constructing sequences of ranges from an underlying sequence. Given the Kleen closure, we have a formal basis for creating ranges based on regular expressions. This will not only provide a powerful means for creating ranges over unstructured texts, but also gives a means for inferring structurally recursive data structures, or to infer higher level structures over previously inferred structures.
One obvious further extension is to expand the Core Range Algebra to fully cover XML. An approach much like REX or Regular Fragmentation can be used to build the structures of XML, from the underlying sequences, and with additional functions for range construction such that elements, attributes, etc., can be constructed, a full query language can be defined over the Core Range Algebra defined above.
There is a fairly large body of work available to support aspects of Core Range Algebra. In terms of data model Alignment Calculus[grahneSafety], a declarative string database query language is in many ways similar, as is SRS[coupayeSRS] a very widely used system in the gene research field. Of course, the basic idea of parallel markup came from Ted Nelson [nelsonLit].
For querying and algebra for semistructured data or XML, there is an large and growing amount of literature. In Buneman[unQL] a complete query language and algebra based on structural recursion is presented. In Fernandez[xmlMonad] a proposal for a formal algebra for XML Query is presented which might become the standard for node-centric processing. There are many other papers related to retrieval of semi-structured data, including SAL[beeriSAL], XIRQL[fuhrXIRQL] and, of course, XQuery[wc3xquery] itself.
For the notion of parsing XML using regular expressions, regular fragmentations from Simon St. Laurent[regularFrag] and REX[cameron98] are especially applicable. Core Range Algebra provides a formal basis for both approaches.
|
Here “sequence” is not limited to textual data, but is closer to the mathematical sense. |
|
|
We will expand upon the notion of range creation later in this paper. |
[adler] Adler, Sharon C, DSSSL- Document Style Semantics and Specification Language, 1989
[beeriSAL] Beeri, Catriel, and Yariv, Tzaban, SAL, An Algebra for Seminstructured Data and XML
[cameron98] Cameron, Robert D., REX: XML Shallow Parsing Using Regular Expressions, 1998
[coupayeSRS] Coupaye, Thierry, Extracting and Converting Data from Semistructured Biological Databanks with SRS, 1996
[durandMarkup] Durand, David G., DeRose, Steven J., and Mylonas, Elli, 1996
[fuhrXIRQL] Fuhr, Norbert, and Grobjohann, Kai, XIRQL: A query language for information retrieval in XML documents
[grahneSafety] Grahne, Gosta and Nykanen, Matti, 1997
[nelsonEmbed] Nelson, Theodor Holm, Embedded Markup Considered Harmful, Fall 1997
[nelsonLit] Ted Nelson, Literary Machines, 1998
[regularFrag] St. Laurent, Simon, Regular Fragmentations, 2001
[sperbergDesign1] Sperberg-McQueen, C. M., and Burnard, Lou, The Design of the TEI Encoding Scheme, 1995
[unQL] Buneman, Peter, Fernandez, Mary, and Suciu, Dan, A Query Language and Algebra for Semistructured Data Based on Structural Recursion
[w3cdom] W3C DOM Working Group, The Document Object Model (DOM) Level 1, 1998
[w3cxsl] W3C XSL Working Group, Extensible Stylesheet Language
[wc3xquery] W3C XQuery Working Group, XQuery 1.0: An XML Query Language
[xmlMonad] Fernandez, Mary, Simeon, Jerome, and Wadler, Philip, Query Language for Information Retreival in XML Documents