Describes how the authors applied an XSLT engine (4Suite's Python XSL package) to the processing of arbitrary groves and abstract hyperdocuments managed by a generic link management system developed by DataChannel. This implementation experience demonstrates that it is both easy and useful to bind XSL processing not just to XML DOMs to but groves of any sort as well as to more abstract business objects, in this case, abstract hyperdocuments.
Discusses the grove- and hyperlinking-specific XSL and XPath extensions created, how XPath expressions were bound to groves and hyperdocuments, and the details of how the implementation was accomplished. Discusses possible future directions and potentials. Provides samples of working hyperdocuments, style sheets, and the resulting output.
Keywords: XSLT; XPath; Python; DOM; Processing
| XML Source | PDF (for print) | Author Package | Typeset PDF |
The XSL recommendation was specifically designed to enable the transformation of XML documents through the use of largely declarative “style sheets”. However, there is no reason in theory why XSL need be limited strictly to the processing of XML documents. Because XSL templates and XPath expressions operate on nodes with properties, it should be possible to apply XSL transformations to any data objects that can be interpreted as nodes with properties. In particular, it should be possible to apply XSL processing to arbitrary groves as defined in ISO/IEC 10744:1997, HyTime, and ISO/IEC 10879, DSSSL. Given an XSL implementation that is not too tightly bound to an underlying DOM implementation, it should be relatively easy to rebind the XSL processor to a grove implementation or other object models. Our experience is that it is in fact easy.
DataChannel is currently developing a grove-based hyperlink management system, code named Bonnell11. One of the requirements for this system is to provide some form of built-in transformation and page composition technology. Given the choices available, the only standardized technologies are DSSSL and XSL. DSSSL suffers from a syntax that, while powerful, is difficult for many people to learn and use effectively. It also suffers from lack of commercial support (although there is at least one commercial DSSSL-based system). XSL has the advantage that it is easier for people to learn and use and has a wide range of support, both commercial and open source. With the release of the XSL-FO implementation from the Apache product (part of the Xerces package), there is reasonable page composition functionality at a reasonable price. Thus, it was clear that the best approach would be to integrate an XSL processor with the larger grove-based link management system.
The Bonnell system is inherently grove based. The grove mechanism can be thought of as a more generic DOM—it provides a standard for representing data of any sort as a collections of nodes and properties. Because the Bonnell system is not in any way XML-specific, we had to provide more than just DOM-based processing, thus our use of the grove specification. By using the GroveMinder system from Epremis Corp., we have ready access to an industrial-strength grove implementation that makes it practical for us to build the rest of the system (however, the use of the GroveMinder product is not a prerequisite for this approach—any grove implementation would serve—as is shown later, the technique can be applied to objects of any type, not just groves).
We knew that it would be possible to bind XSL to groves—the DOM can be thought of as a specific kind of grove. However, we were not sure how easy it would be. We thought that it might require significant effort to rebind an XSL processing engine from a DOM-based process to a grove-based process. However, as it happened, the binding turned out to be much easier than we expected. This may partly be a side effect of our implementation language (Python) and the architecture of the XSL engine we chose (4XSLT, part of the 4Suite package from Fourthought, Inc.). In particular, we discovered that we did not need to change XPath syntax in order to access properties of arbitrary grove nodes—it was sufficient to treat properties of nodes as attributes using the “@name” syntax for selecting attributes and their values.
Because the Bonnell system is a hyperdocument management system, it is not sufficient to apply XSLT style sheets to single documents or groves. It must be possible to process documents in the context of a larger hyperdocument. In particular, it must be possible for the style sheet to style nodes based on their use as anchors of hyperlinks. This is needed to implement both transclusion (for example, to render compound documents composed of parts from many other documents) and navigational hyperlinking. Thus we had to somehow extend XSL to give access to the link-related properties of nodes. This required providing XSL extensions and XPath functions that provide access to the hyperlink information provided by the Bonnell system. This also turned out to be easier than we expected.
Once the binding of XSLT processing to groves and hyperdocuments has been achieved, adding in support for things like XSL-FO is simply a matter of applying the output of the XSLT process to the XSL-FO processor. At the time of writing we have not created any specific XSL-FO extensions for working with hyperdocuments in the FO domain.
The following subsections present two examples of using the XSL-to-grove binding: the first renders a Word document to HTML through its grove representation, the second adds hyperlinking to show how a document can be presented in the contex of a complex hyperdocument. These examples are not explained in detail. The mechanisms used and their implementation is then covered in detail in the implementation sections.
The GroveMinder product comes with a very simple grove constructor for Word documents. It produces a grove in which a Word document is represented as a sequence of “Para” nodes. Each Para node has a “Text” property whose value is the text content of the paragraph. Obviously this grove does not provide a complete representation of the information content of Word documents, but it is sufficient to demonstrate grove construction from non-XML data. A production-quality Word grove would reflect most or all of the information in a Word document, including styling information, bookmarks, metadata, and so on. Making such a grove constructor is literally a simple matter of programming.
In this example, a simple style sheet translates the Word document to HTML through it's grove representation. Because there is so little data in the grove, there's not much for the style sheet to do. The Word source document is shown in 1. The Word document grove is shown in 2
The style sheet for rendering this grove is shown in 3. It has two templates: one for the root node and one for the Para nodes. The mapping of XSL to groves treats node classes as though they were element types. Thus the template match on “Para” is matching the node class “Para”. Node properties are treated as though they were element attributes. Thus, the match “@Text” in the value-of statement selects the node property named “Text”.
<?xml version="1.0"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:ext="http://datachannel.com/Bonnell/Transform"
extension-element-prefixes="ext"
version="1.0"
>
<xsl:template match="/">
<html>
<head><title>Word Document</title>
</head>
<body>
<h1>Word Document</h1>
<xsl:apply-templates/>
</body>
</html>
</xsl:template>
<xsl:template match="Para">
<p>
<xsl:value-of select="@Text"/>
</p>
</xsl:template>
</xsl:stylesheet>
The rendered result is shown in 4. There is nothing particularly interesting about it—it is presented here to contrast with the rendered result in the next example, where hyperlinks have been brought into play.
The point of this example is not that HTML has been generated from Word but that normal XSL processing has been applied to a grove that is not an XML document. The most exciting implications of this are in the realm of hyperlinking, where the Word grove can be combined with any other grove-based data from any source using a consistent set of simple functions and extension elements. The XSL processing, as the top layer of a fairly deep system of information processing components, affords the style sheet creator tremendous leverage at a fairly low cost of entry. A full-featured XSL processor provides a wide array of useful functions for accessing and organizing structured data. Being able to apply those functions to arbitrary data from any source provides an immensely powerful system that is, as much as possible, completely standards based.
There is nothing in these examples that cannot be done in other ways. However, the things done in these examples have never been done with this degree of ease using tools with as wide a user base or as deep a support community. XSL has proved itself to be both tremendously useful and sufficiently easy to learn and use that people can become proficient with it. It reflects decades of experience in the design of transformation and formatting languages. The goal of this paper is to demonstrate both the utility of applying XSL beyond the domain of XML-based data (and without first literally transforming existing data to XML or pretending that it is XML while it's being processed) and the ease with which such applications can be built.
This example takes the previous example to the next step: hyperlinking. In this scenario, three documents are involved: the previous Word document, an Excel spread sheet, an XML document that annotates the Word document, and an XML document that establishes extended links between nodes in the Word document and nodes in the spreadsheet.
In addition demonstrating the raw linking functionality available, this example also demonstrates an important separation of concerns between the style sheet and the hyperdocuments and hyperdocument processor. Because a style sheet language like XSL can directly address nodes in any source document, it is possible to implement the sort of linking semantics demonstrated here entirely in the style sheet itself, including defining the link instances purely as style rules applied to documents. However, this sort of “do it all in the style sheet” approach violates the principal of separation of concerns by binding both information presentation semantics and information representation semantics in a single object. By keeping the presentation semantics separate from both the declarative definition of the links (the linking elements in the XML documents) and the data processing needed to understand those links, each part of the system becomes independent of the other, thus protecting those components from changes in other parts.
This separatation of concerns also helps keep each part as simple as possible, with the greatest complexity concetrated in the system component most hidden from end users, the hyperdocument processor. Because the XSL processing is entirely in terms of a generic API for hyperdocuments, it is independent of the details of the underlying data and processors, including the use or non-use of HyTime or of a particular hyperdocument manager implementation. We feel that this separation of concerns is of critical importance, especially as systems get larger and are applied to wider scopes of information sets and maintained for longer periods of time. Because our business focus is on building large-scale systems to manage long-lived complex document sets, part of our engineering focus is aways on what best protects the investment of the system user in configuring the system, of which style sheets are a key part.
(In fact, these sample style sheets could probably be made significantly simpler by adding a few more built-in functions or extension elements—we expect to do that as we gain more practical experience with this technology.)
The rendered result of the Word document in the context of the hyperdocument is shown in 5.

This is exactly the same Word document, but now it has been augmented with a variety of links, including node-to-node transclusions, annotations imposed from another document, and node-to-node navigation links, all imposed through extended links onto an otherwise unmodified Word document.
The highlighted rows are paragraphs that are linked in some way. In the first paragraph, there is a “annotation” link between the paragraph and the comment (shown in the right-hand column). The asterisk in brackets is the link to the comment itself. The “More info...” is the link to the more info (nodes in the Excel spreadsheet) and demonstrates a typical presentation style of putting links in a column next to the data then are associated with.
The second highlighted row reflects a transclusion from the Word paragraph (which was an empty paragraph) to a cell in the spreadsheet. The right-hand column reflects the anchor role of the transclusion target (“used-node”) and the node class and data content of the target node (whose data content is also reflected where the original Word paragraph was).
The style sheet for this example is shown in 6. All of the “when” checks in the “Para” template are producing the appropriate presentation results for nodes that are participating in hyperlinks (the extension functions and elements in this example are explained later in this. If a node participates in any links (is-anchored-object()), the template does whatever is appropriate for a particular type of link or anchor. It is not sufficient to just blindly convert links to HTML A elements—each link type and anchor role has a unique semantic that, in this case, requires different presentation results.
The style sheet is also complicated by the need to present the linking details in addition to producing the normal presentation. The linking details are provided for information for the purposes of this example—it's not necessarily a typical presentation choice.
<?xml version="1.0"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:ext="http://datachannel.com/Bonnell/Transform"
extension-element-prefixes="ext"
version="1.0"
>
<xsl:template match="/">
<html>
<head><title>Word Document</title>
</head>
<body>
<h1>Word Document</h1>
<table width="100%">
<tr bgcolor="yellow">
<td width="60%"><b>Paragraphs in Word Document</b></td>
<td width="40%"><b>Traversal Target Details</b></td>
</tr>
<xsl:apply-templates/>
</table>
</body>
</html>
</xsl:template>
<xsl:template match="Para">
<tr>
<!-- This first choose handles the presentation of the base paras.
It checks for any node-to-node transclusions and resolves them.
-->
<xsl:variable name="para-node" select="."/>
<xsl:choose>
<xsl:when test="ext:is-anchored-object()">
<td width="60%" valign="top" bgcolor="yellow">
<p>
<!-- First see if the node is transcluded and if so, get the transcluded value: -->
<xsl:choose>
<xsl:when test="ext:has-target-of-role('used-node')">
<xsl:for-each select="ext:get-traversal-targets('transclusion','used-node')[1]">
<xsl:apply-templates select="ext:get-property-value('traversals')"/>
</xsl:for-each>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="ext:get-object-property($para-node, 'Text')"/>
</xsl:otherwise>
</xsl:choose>
<!-- Now handle any non-transclusion links: -->
<xsl:for-each select="ext:get-traversal-targets()">
<xsl:variable name="rolename"
select="ext:get-object-property(ext:get-property-value('anchor'),
'getRoleName')"/>
<xsl:choose>
<xsl:when test="$rolename = 'used-node'">
<!-- Already handled above -->
</xsl:when>
<xsl:when test="$rolename = 'comment'">
<xsl:text>[</xsl:text>
<xsl:element name="a">
<xsl:attribute name="href">
<xsl:value-of select="ext:get-fragmentid-for-node()"/>
</xsl:attribute>
<xsl:text>*</xsl:text>
</xsl:element>
<xsl:text>]</xsl:text>
</xsl:when>
</xsl:choose>
</xsl:for-each>
</p>
</td>
</xsl:when>
<xsl:otherwise>
<td width="60%" valign="top">
<p><xsl:value-of select="@Text"/></p>
</td>
</xsl:otherwise>
</xsl:choose>
<!-- This third choose populates the second column of our table, which reflects
every traveral target regardless of presentation semantic.
-->
<xsl:choose>
<xsl:when test="ext:is-anchored-object()">
<td width="40%" bgcolor="yellow" valign="top">
<xsl:for-each select="ext:get-traversal-targets()">
<xsl:variable name="rolename"
select="ext:get-object-property(ext:get-property-value('anchor'),
'getRoleName')"/>
<xsl:choose>
<xsl:when test="$rolename = 'more-info'">
<xsl:for-each select="ext:get-property-value('traversals')[1]">
<xsl:element name="a">
<xsl:attribute name="href">
<ext:traversaltargetnode
outputdir="./website"
outputsuffix=".html"
style="link"/>
</xsl:attribute>
<xsl:text>More info...</xsl:text><br/>
</xsl:element>
</xsl:for-each>
</xsl:when>
<xsl:otherwise>
<xsl:text>[</xsl:text><xsl:value-of select="$rolename"/><xsl:text>] </xsl:text>
<xsl:for-each select="ext:get-property-value('traversals')">
<xsl:value-of select="@ClassName"/><xsl:text>: </xsl:text>
The sample hyperdocument consists of four documents: the Word document, a trivial Excel spread sheet, an XML document containing “annotation” links, and an XML “link set” document that serves as the hub document and contains “transclusion” and “more-info” links among nodes in the Word and Excel documents.
The Excel document is shown in 7. It contains just a few cells to demonstrate the ability to construct groves from spreadsheets. The property set used here for Excel documents is what you would expect: Spreadsheet contains Row nodes. Each Row node has a “Cells” property whose value is a list of “Cell” nodes. Each Cell node has a string property containing its text.
The “link set” document is shown in 8. It is a HyTime hyperdocument. It declares as unparsed entities the other documents in the hyperdocument, thus establishing the “bounded object set” of the hyperdocument. The “linkset” element contains two links, a “transclusion” link, that links a node in the Word document (the 9th paragraph) to a node in the Excel spreadsheet (the first cell of the second row). The “more-info” link relates the 6th node in the Word document to another node in the Excel spreadsheet. The intent of this link is to enable navigation from the base information (where the reader is) to information that provides more information on the subject of the base information.
<?xml version="1.0" ?>
<!DOCTYPE linkset PUBLIC "urn:datachannel:samples:linkset DTD" [
<!ENTITY ottaviano
PUBLIC "urn:datachannel:samples:non-sgml:ottaviano.doc"
NDATA Word
>
<!ENTITY excel001
PUBLIC "urn:datachannel:samples:non-sgml:excel001.xls"
NDATA Excel
>
<!ENTITY myAnnotations
PUBLIC "urn:datachannel:samples:non-sgml:myannotations.xml"
NDATA sgml
>
]>
<linkset>
<transclusion using-node="1 9" using-doc="ottaviano"
used-node="1 2 1" used-node-doc="excel001"/>
<more-info base-info="1 6" base-doc="ottaviano"
more-info="1 1 2" more-info-doc="excel001"/>
</linkset>
The final document in the set, “myAnnotations.xml”, contains a set of “annotation” links that have the semantic of applying comments to nodes. This is the sort of thing online reviewers might do, for example. The document is shown in 9. This document is also a HyTime hyperdocument and also declares as an unparsed entity the document it is linking to. It contains one “annotation” link, which binds the 6th paragraph in the Word document to the “comment” element.
Note that both this link and the more-info link point to the 6th paragraph of the Word document. This is an example of how the use of extended linking can make a node a member of multiple links. This is one reason that the style sheet in this case is as complex as it is: it has to account for all the possible link types and anchor roles that might be associated with a node. In most cases the style sheet writer knows what link types are available to document authors—link types are normally designed and managed with the same care as base document types. In this case we have written the style sheet to provide default behaviors for any unaccounted for link types or anchor roles (the “otherwise” clause in the last choice group in the style sheet).
<?xml version="1.0" ?>
<!DOCTYPE annotations PUBLIC "urn:datachannel:samples:annotation hytime DTD" [
<!ENTITY ottaviano
PUBLIC "urn:datachannel:samples:non-sgml:ottaviano.doc"
NDATA Word
>
]>
<annotations>
<annotation target-doc="ottaviano"
target="1 6">
<comment>
<p>Steve is one of the editors of the HyTime standard and is also
a co-chair of this conference.
</p>
</comment>
</annotation>
</annotations>
This example has shown a number of interesting things: applying XSL style sheets to non-XML data, using XSL style sheets to condition presentation based on the linking status of nodes, and the use of extended links to impose a variety of relationships onto data that is otherwise largely incapable of doing sophisticated linking (and is, in any case, incapable of doing it in non-proprietary ways). These abilities have an almost limitless scope of applicability and enable the satisfaction of a number of important information management and presentation requirements that have, to date, been prohibitively expensive to satisfy for all but a highly motivated few.
The challenge of using XSL in the context of generalized link management has two parts: First you must bind XSL processing to groves (because groves are the standard by which all data content is represented for the purposes of linking and addressing in the link management system). Second you must map XSL to hyperdocuments so that individual documents can be processed with respect to the links among the components of the documents. This section describes how the binding of groves to XSL was accomplished. The next section describes the binding of XSL to hyperdocuments.
The basic data model for groves is quite simple: a grove consists of one or more nodes. One key difference between groves and the DOM is that the DOM has a fixed set of node types reflecting the data model for XML documents. The grove mechanism is more general, allowing the definition of arbitrary node types, which allows a given grove to represent data of any type. Using this general mechanism, the HyTime standard defines a specific grove type for SGML (and by extension, XML) documents, the “SGML property set”. Groves that conform to the SGML property set are directly analogous to XML DOMs and have essentially the same information content. Of course, the SGML property set, which predates XML, does not directly reflect some XML-specific concepts, such as name spaces.
The grove mechanism was defined in order to provide a common data model for use in both linking applications (e.g., HyTime) and transformation and styling applications (e.g., DSSSL). The grove mechanism had to be generic over all possible data types so that hyperdocuments could involve data of any type, not just XML data (and without requiring that non-XML data first be converted to XML). When using groves, all data, regardless of its specific semantics, is represented using a consistent underlying data model. This enables generic addressing using common semantics and syntaxes. By contrast, in the current Web world, every data type must define its own addressing semantics and syntax2.2
A grove is a directed graph of nodes. Each node has a specific class and a set of properties, as defined in the grove's “property set” (the schema for the grove). The properties of a node may be primitive data types (string, integer, Boolean, etc.), lists of primitives, singleton nodes, lists of nodes, or dictionaries of nodes (“named node lists”). Unlike DOMs, which are strict trees, groves can represent arbitrary directed graphs because any node's properties may point to any other nodes in the same grove. In addition, groves can be connected to each other when a property of a node in one grove points to a node or nodes in another grove (an “unrestricted reference”).
Groves can be viewed in such a way that they are strict trees. This design was specifically geared toward the needs of representing SGML documents and thus provides convenience features that make working with things like XML documents as easy and intuitive as possible. When an XML document grove is viewed as a strict tree, the structural organization of the grove is essentially the same as an XML DOM.
The grove mechanism provides several convenience features, the first being the ability to view the grove as a strict tree.
Two other key conveniences are “content” properties and “data” properties. Content properties are properties that have been defined in the grove's property set as containing the content of the node. For example, in an XML element node, the property named “children”, which contains the nodes constructed from the syntactic content of the element, is declared to be the “content” property of the node. By contrast, the property named “attributes”, which contains the attribute nodes for the element, is not content of the node. The “data” property is that property that has been defined as holding the “data” for the node. Thus, for any node you can ask for its “content” or its “data” and will get back the value of whatever property has been identified as the data or content property, if any (there is no requirement that a node have a data or content property). The content and data properties correspond to the “childNodes” and “value” properties in DOM nodes.
It should be stressed that there is no magic to groves. They are simply an application of the general concept of nodes with properties that has been specialized to meet the specific requirements reflected in the HyTime and DSSSL standards, that is, the requirements of generalized hyperlinking and transformation systems. Hyperlinking and styling require a consistent fundamental data model that their processing semantics can be defined in terms of. Groves provide such a model but are not the only possible model. It is likely that the same forces that led to the development of XSL as a refinement of DSSSL and other style languages will lead to a refinement of the grove concept once the community at large realizes the need for data model that is more general than the XML DOM.
The mapping of XSL to groves is straightforward. First, we treat SGML document groves essentially as though they were DOM trees, so that all the normal XSL operations that are bound to XML constructs (elements, attributes, data characters) work in exactly the same way for SGML document groves. Note that SGML document groves can be constructed from XML documents because XML is SGML. The only potential problem is the use of name spaces—the SGML property set predates the XML name space specification and therefore has no specific support for it. However, this is the same problem as for DOM 1, and can be solved simply by enhancing the grove-to-XSL binding to add the necessary name space support, if required. For a given XML document, an XSL style sheet should produce identical results for both a DOM-based and grove-based implementation.
For all other grove types, the “apply templates” and “for each” operations of XSL iterate over node lists of grove nodes. All nodes are treated as though they were Element nodes in a DOM. Nodes can be selected by class name (instead of element type name). Node properties are selected as though they were attributes. All primitive values are converted to strings.
The Bonnell system is implemented primarily in Python. Thus we selected a Python-based XSLT engine. The 4Suite package from Fourthought, Inc. was the most complete package we could find and had no licensing constraints on its use. The GroveMinder system also provides a Python API, so there was no barrier to combining the two tools.
Because Python is a weakly-typed, interpreted language it made the task of modifying the XSLT engine to act on grove nodes instead of DOM nodes much easier than it would be in a strongly-typed language. The advantage of Python in this case is that XSLT implementation methods that take or return nodes or node lists are perfectly happy to directly accept or return grove nodes (or in fact, any object). In a Java or C++ implementation, we would be forced to either modify the class hierarchy of the XSLT implementation to accomodate grove nodes in addition to DOM nodes or wrap grove nodes in classes that implement the DOM API.
Implementing the XSL-to-grove binding required the following modifications to the 4Suite tools:
This required us to make modifications in 9 separate Python modules in the 4Suite package (out of scores of modules) and to add a grove-specific utility module that provides the functions needed to implement XSL semantics, such as node match, for groves. A team of two programmers working as a pair spent about 4 days getting this initial binding implemented and tested, including finding and fixing bugs in the 4Suite code itself. The total amount of existing 4Suite code modified or added is about 100 lines. The utility module is about 300 lines (of which 100 is a brute-force parser for formal system identifiers that we already had lying about and that didn't warrant the attention needed to make it smaller).
The use of a different grove implementation would simply require reworking the grove utility module to use the new implementation's API (there is no official standard API for groves, although the GroveMinder API serves somewhat as a de-facto standard).
A comparable implementation in other languages should be of roughly the same magnitude.
A C++ implementation would be complicated by the need to do more with either the base implementation's class hierarchy to provide a common superclass over DOM nodes and grove nodes or wrapping of grove nodes in a DOM API.
A Java implementation would require a Java grove implementation because GroveMinder does not have a Java binding. A Java grove implementation would not be difficult to develop (Alex Milowski demonstrated one at SGML '97 but subsequently lost the rights to it through corporate acquisition) but we are not aware of any available Java grove implementations that are not themselves DOM based. Given a grove implementation, a Java XSL-to-grove binding would have the same additional complications as a C++ implementation.
To implement the mapping of XSL and XPath to groves, we had to modify the base 4Suite XSLT and XPath code to select a DOM or grove processing path based on the type of node being processed. We also had to implement the grove-specific processing logic needed to provide XSL and XPath semantics for grove nodes.
This section reflects our modifications made to the 0.11.1a1 version of the 4Suite package. Our modifications were initially written against an earlier version of the 4Suite code and then were migrated to the latest released version (at the time of writing) when we moved from Python 1.5.2 to 2.1.
The XSLT and XPath implementations comprise about 80 separate Python modules. Of these 80 we had to modify 9: 6 for XSLT and 3 for XPath.
To hook grove-specific processing into the XSLT processor, we had to initially modify the following modules:
Most of the modifications are, not surprisingly, in Processor.py, which implements the main processing loop that iterates over the input document tree. The modifications to the other modules are minor, being checks to see if the processing is being applied to a DOM node or a grove node.
Note: Only the methods that have been changed from the base 4Suite code are presented here. We presume that interested readers can get a copy of the 4Suite code for reference if needed.
The Processor.py module defines the Processor class, which provides the API for applying XSLT processing to documents.
The main addition to Processor is the runGroveNode() method, shown in 10. This method parallels the normal runNode() method but applies it to groves. It sets up the grove for processing by first using the maximum “grove plan”, which ensures in this case that any processing instruction nodes in the document are visible (by default, processing nodes are not exposed in an SGML grove). If there are processing instructions, it looks for any style sheet PIs. The call to the grove-specific grove_execute() method is a workaround for a problem in the latest version of the 4Suite code that is exposed by hyperdocument-specific processing.
The __checkGroveStyleSheetPis() method called from runGroveNode() simply implements the same business logic as the normal __checkStyleSheetPis() but against SGML document groves.
def runGroveNode(self, node, ignorePis=0, topLevelParams=None, writer=None,
baseUri='', outputStream=None, startAndEndDocument = 1 ):
"""
Run the stylesheet processor against the given grove node with the
stylesheets that have been registered.
If writer is None, use the TextWriter, otherwise, use the supplied writer.
"""
topLevelParams = topLevelParams or {}
rootNode = node.GroveRoot
if ignorePis == 0:
rootNode = GroveUtility.assignMaxGrovePlan( "SGML", rootNode )
if ( GroveUtility.hasPIs(rootNode) ):
self.__checkGroveStylesheetPis( rootNode, baseUri )
result = self.grove_execute(node, ignorePis, topLevelParams, writer,
baseUri, outputStream, startAndEndDocument)
return result11 shows the applyBuiltins() and _applyGroveBuiltins() method. applyBuiltins() is the original method. It has been modified to add a check to see if the node being processed is a grove node. If it is, processing is redirected to the _applyGroveBuiltins() method. The _applyGroveBuiltins() method differs from applyBuiltins() in the way that the node content is accessed and avoidance of the check for attribute nodes.
def applyBuiltins(self, context, mode):
if GroveUtility.isGroveNode( context.node ):
self._applyGroveBuiltins( context, mode)
return
if context.node.nodeType == Node.TEXT_NODE:
self.writers[-1].text(context.node.data)
elif context.node.nodeType in [Node.ELEMENT_NODE, Node.DOCUMENT_NODE]:
origState = context.copyNodePosSize()
node_set = context.node.childNodes
size = len(node_set)
pos = 1
for node in node_set:
context.setNodePosSize((node,pos,size))
self.applyTemplates(context, mode)
pos = pos + 1
context.setNodePosSize(origState)
elif context.node.nodeType == Node.ATTRIBUTE_NODE:
self.writers[-1].text(context.node.value)
return
def _applyGroveBuiltins(self, context, mode):
# applyBuiltins -- calls this function if it determines
# we are in grove land instead of the DOM.
nodeContent = GroveUtility.getNodeContent( context.node )
if type( nodeContent ) == types.StringType:
self.writers[-1].text(nodeContent)
elif nodeContent:
origState = context.copyNodePosSize()
size = len(nodeContent)
pos = 1
for node in nodeContent:
context.setNodePosSize((node,pos,size))
self.applyTemplates(context, mode)
pos = pos + 1
context.setNodePosSize(origState)
#elif context.node.nodeType == Node.ATTRIBUTE_NODE:
# pass
return
12 shows the __checkGroveStylesheetPis() method called by runGroveNode(). It simply does the same processing as the DOM-based __checkStyleSheetPis() method against PIs as represented in SGML document groves. The _lookupUrlForPath() method translates the system paths GroveMinder maintains for grove source documents into the URLs expected by the XSLT processor.
def __checkGroveStylesheetPis(self, node, baseUri):
#
# Note: A Stylesheet PI can only be in the prolog, acc to the NOTE
#
# http://www.w3.org/TR/xml-stylesheet/
# NOTE: If the xml-stylesheet processing instruction occurs in the
# external DTD subset or in a parameter entity, it is possible
# that it may not be processed by a non-validating XML processor
#
pis_found = 0
piSet = GroveUtility.getPIs( node )
for child in piSet:
if string.find( child.SystemData, 'xml-stylesheet') != -1:
data = child.SystemData
if data[-1] == '?':
data = data[:-1]
data = string.splitfields(data,' ')
sty_info = {}
for d in data:
seg = string.splitfields(d, '=')
if len(seg) == 2:
sty_info[seg[0]] = seg[1][1:-1]
if sty_info.has_key('href'):
if not sty_info.has_key('type') or sty_info['type'] in XSLT_IMT:
path = self._lookupUrlForPath(node, sty_info['href'])
if path:
self.appendStylesheetUri(path)
pis_found = 1
else:
print "Unable to load stylesheet: %s" % sty_info['href']
return pis_found
def _lookupUrlForPath( self, rootGroveNode, path):
"""
"""
if os.path.isfile( path ):
return GroveUtility.filename2url(os.path.abspath(path))
uriResolver = Ft.Lib.Uri.BaseUriResolver()
try:
uriResolver.resolve(path)
return path
except:
pass
source = GroveUtility.soi( rootGroveNode.groveDefinition().sourceData() )
absPath = os.path.join(os.path.dirname(source),path)
if os.path.isfile(absPath):
return GroveUtility.filename2url(os.path.abspath(absPath))
return NoneThe grove_execute() method of Processor, shown in 13, applies XSLT processing to grove nodes instead of to DOM nodes. The main difference here is the addition of a hyperdocument object to the constructed context node. The hyperdocument object is the direct binding of the base XSLT processor to the Bonnell hyperdocument management system. This binding is explained in “Binding XSL To Abstract Hyperdocuments”.
def grove_execute(self, node, ignorePis=0, topLevelParams=None, writer=None,
baseUri='', outputStream=None, startAndEndDocument = 1 ):
"""
Run the stylesheet processor against the given grove node with the
stylesheets that have been registered.
If writer is None, use the TextWriter, otherwise, use the supplied writer.
"""
if len(self._stylesheets) == 0:
raise XsltException(Error.NO_STYLESHEET)
self._outputParams = self._stylesheets[0].outputParams
if writer:
self.writers.append(writer)
else:
self.addHandler(self._outputParams, outputStream, 0)
self._namedTemplates = {}
tlp = topLevelParams.copy()
for sty in self._stylesheets:
sty.processImports(node, self, tlp)
named = sty.getNamedTemplates()
for name,template_info in named.items():
if not self._namedTemplates.has_key(name):
self._namedTemplates[name] = template_info
for sty in self._stylesheets:
tlp = sty.prime(node, self, tlp)
#Run the document through the style sheets
if startAndEndDocument == 1:
self.writers[-1].startDocument()
if node.ClassName == "SgmlDocument":
node = node.DocumentElement
context = XsltContext.XsltContext(node, 1, 1, None,
hyperdoc=self._getHyperDocument())
self.applyTemplates(context, None)
if startAndEndDocument == 1:
self.writers[-1].endDocument()
Util.FreeDocumentIndex(node)
result = self.writers[-1].getResult()
if startAndEndDocument == 1:
self._reset()
context.release()
return resultStylesheet.py defines the top-level classes for style sheets, in particular, StyleSheetElement. To accomodate groves, we modified the prime() method of StyleSheetElement to get the grove root rather than the owner document as for DOM processing.
def prime(self, contextNode, processor, topLevelParams):
#######################################################
# Grove impl -- changes
#
if GroveUtility.isGroveNode( contextNode ):
primingContext = contextNode.GroveRoot,
else:
primingContext = contextNode.ownerDocument or contextNode
self._primedContext = context =\
XsltContext.XsltContext(primingContext,
1,
1,
processorNss=self.namespaces,
stylesheet=self,
processor=processor)
(rest of method is unchanged)The XPatterns.py module defines classes that implement the processing of patterns used in template matches. The only modification needed here was to the match() method of the DocumentNodeTest class to redirect match processing for grove nodes to the groveNodeMatch() function in the grove utility module.
The AttributeValueTemplate.py module defines the AttributeValueTemplate class. The only modification required here was to modify the evaluate() method to handle the processing of attribute values from grove nodes.
def evaluate(self, context):
if GroveUtility.isGroveNode( context.node ):
expansions = []
for pPart in self._parsedParts:
returnValue = GroveUtility.getStringValue( pPart.evaluate( context ))
expansions.append( returnValue )
else:
expansions = map(
lambda x, c=context: Conversions.StringValue(x.evaluate(c)),
self._parsedParts
)
(rest of method is unchanged)The ApplyTemplatesElement.py module defines the ApplyTemplatesElement class. The only modification required here was to modify the instantiate() method to get the child nodes from the context grove node instead from from a DOM node.
def instantiate(self, context, processor):
origState = context.copy()
context.setNamespaces(self._nss)
params = {}
mode = self._instantiateMode(context)
for param in self._params:
(name, value) = param.instantiate(context, processor)[1]
params[name] = value
if self._expr:
node_set = self._expr.evaluate(context)
else:
#############################################################
# Grove impl -- changes
if GroveUtility.isGroveNode( context.node ):
node_set = GroveUtility.getChildren( context.node )
else:
node_set = context.node.childNodes
(rest of method is unchanged)The ValueOfElement.py module defines the ValueOfElement class. The only modification required here was to the instantiate() method. The ValueOf element returns the text value of the context node. The grove-specific modification simply implements the logic needed to get the text value of different node types.
def instantiate(self, context, processor):
original = context.copy()
context.processorNss = self._nss
text = ""
result = self._expr.evaluate(context)
if GroveUtility.isGroveNode( context.node ):
if type( result ) == types.StringType:
text = result
elif ((type(result) == types.ListType) or \
(type(result) == types.TupleType)) and \
len(result) > 0:
for singleReturn in result:
if type(singleReturn) == types.StringType:
text = text + GroveUtility.getStringValue(context.node,
singleReturn)
else:
text = text + GroveUtility.getStringValue(singleReturn)
else:
text = Conversions.StringValue(result)
(rest of method is unchanged)In order to apply XPath to groves, we had to modify three modules in the XPath implementation: ParsedNodeTest.py, ParsedAbsoluteLocationPath, and ParsedAxisSpecifier.py. As with the XSL processing, these modifications either redirect to grove-specific processing logic or directly implement the XPath semantics for grove nodes.
Parsed node test required modifications to the match() methods of two classes: NodeTest and NodeNameTest, both shown in 19. In both cases, the match method checks to see if the node is a grove node, and if it is, redirects to the groveNodeMatch() from the grove utilities module.
class NodeNameTest(NodeTestBase):
def match(self, context, node, principalType=Node.ELEMENT_NODE):
if GroveUtility.isGroveNode(node):
return GroveUtility.groveNodeMatch( context, node, self, principalType )
if node.nodeType == principalType:
return node.nodeName == self._nodeName
return 0
class NodeNameTest(NodeTestBase):
def match(self, context, node, principalType=Node.ELEMENT_NODE):
if GroveUtility.isGroveNode(node):
return GroveUtility.groveNodeMatch( context, node, self, principalType )
if node.nodeType == principalType:
return node.nodeName == self._nodeName
return 0The parsed axis specifier module required modification to the select() methods of the ParsedAncestorOrSelfAxisSpecifier, ParsedAttributeAxisSpecifier, ParsedChildAxisSpecifier, and ParsedPrecedingSiblingAxisSpecifier. In each case, the only differences between the DOM and grove paths are those caused by detail differences in the DOM and GroveMinder APIs. Otherwise, the processing logic is identical. Note that the GroveMinder API predates the DOM by at least two years. However, there's no reason a grove implementation couldn't emulate the DOM API. (Note: at the time of writing the modifications had not been exhaustively tested over all the XPath axes—it is likely that a few more modules or methods will need to be modified to account for the remaining axes. Because we follow the Extreme Programming principal of use-case-driven implementation and because we had not had time to develop an extensive set of test cases, we had only tested those axes actually used in our test style sheets.)
class ParsedAncestorOrSelfAxisSpecifier(AxisSpecifier):
def select(self, context, nodeTest):
"""Select all of the ancestors including ourselves through the root"""
node = context.node
if nodeTest(context, node, self.principalType):
nodeSet = [node]
else:
nodeSet = []
#################################
# Grove impl
if GroveUtility.isGroveNode( node ):
parent = node.Parent
while parent:
if nodeTest(context, parent, self.principalType):
nodeSet.append(parent)
parent = parent.Parent
else:
parent = ((node.nodeType == Node.ATTRIBUTE_NODE) and
node.ownerElement or node.parentNode)
while parent:
if nodeTest(context, parent, self.principalType):
nodeSet.append(parent)
parent = parent.parentNode
nodeSet.reverse()
return (nodeSet, 1)
class ParsedAttributeAxisSpecifier(AxisSpecifier):
principalType = Node.ATTRIBUTE_NODE
def select(self, context, nodeTest):
"""Select all of the attributes from the context node"""
###################################
# Grove impl
if GroveUtility.isGroveNode(context.node):
attrs = GroveUtility.getAttributes( context.node )
rt = filter(lambda attr, test=nodeTest,
context=context, pt=self.principalType:
test(context, attr, pt),
attrs or [])
else:
attrs = context.node.attributes
rt = filter(lambda attr, test=nodeTest,
context=context, pt=self.principalType:
test(context, attr, pt),
attrs and attrs.values() or [])
return (rt, 0)
class ParsedAttributeAxisSpecifier(AxisSpecifier):
principalType = Node.ATTRIBUTE_NODE
def select(self, context, nodeTest):
"""Select all of the attributes from the context node"""
###################################
# Grove impl
if GroveUtility.isGroveNode(context.node):
attrs = GroveUtility.getAttributes( context.node )
rt = filter(lambda attr, test=nodeTest, context=context, pt=self.principalType:
test(context, attr, pt),
attrs or [])
else:
attrs = context.node.attributes
rt = filter(lambda attr, test=nodeTest, context=context, pt=self.principalType:
test(context, attr, pt),
attrs and attrs.values() or [])
return (rt, 0)
class ParsedPrecedingSiblingAxisSpecifier(AxisSpecifier):
def select(self, context, nodeTest):
"""Select all of the siblings that precede the context node"""
result = []
###################################
# Grove impl
if GroveUtility.isGroveNode(context.node):
parent = context.node.Parent
if parent:
siblings = GroveUtility.getChildren( parent )
for sibling in siblings:
if context.node == sibling:
break
if nodeTest(context, sibling, self.principalType):
result.append(sibling)
else:
sibling = context.node.previousSibling
while sibling:
if nodeTest(context, sibling, self.principalType):
result.append(sibling)
sibling = sibling.previousSibling
# Put the list in document order
result.reverse()
return (result, 1)The GroveUtil.py module implements the functions used by the preceding grove-specific code. These utilities primarily serve to provide methods analogous to the generic DOM methods such as childNodes().
The module prolog imports the base requirements and the grove implementation, in this case, GroveMinder. It also defines some constants. The __groveNodeTypes list is used to do type comparisons on GroveMinder grove nodes.
import sys, os, string
import types
import GroveMinder
__groveNodeTypes = [
"<type 'GroveNode'>",
"<type 'GroveNodeList'>",
"<type 'GroveNodeNamedNodeList'>",
"<type 'GroveStringNamedNodeList'>"
]
The assignMaxGrovePlan() function is a convenience method that ensures that the grove node is being viewed with respect to all the properties the grove implementation can make available. One feature of groves is the ability to hide selected classes or properties of a grove. In the case of SGML, there are many properties, such as the original markup, that many applications do not care about and that are hidden by default.
def assignMaxGrovePlan( builderType, node):
"""
Assigns the system maximum grove plan to the given node.
"""
groveBuilder = GroveMinder.makeGroveBuilder( builderType )
grovePlan = groveBuilder.systemMaximumGrovePlan()
return node.withGrovePlan( grovePlan )The isGroveType() and isGroveNode() functions perform basic checks needed to delegate to the appropriate type of processor.
def isGroveType( node ):
"""
Returns 1 if the given node's type is a valid grove node types.
Returns 0 otherwise.
"""
# we could ask:
# hasattr( node, "ClassName" )
# -- which is an intrinsic property... or ?
nodeType = str( type( node ) )
if nodeType in __groveNodeTypes:
return 1
return 0
def isGroveNode( node ):
"""
Returns 1 if the given node's type is GroveNode.
Returns 0 otherwise.
"""
if str( type( node ) ) == "<type 'GroveNode'>":
return 1
return 0The groveNodeMatch() function is the heart of the grove-specific XPath processing, implementing the various match semantics of XPath. The __normalizeName() function handles the fact that in SGML names in markup may or may not be case sensitive depending on the specific syntax rules in effect for a given document. In SGML groves, the root node may have a RefSyntax property which contains a Syntax node. The Syntax node has properties indicating the case substitutions in effect: general names (elements, attributes, IDs, notations) or entities. In groveNodeMatch() only general names are matched on, so the code only looks at the general name substitution setting. In XML, all names are case sensitive, so this issue doesn't arise. As a side note, one subtlety in processing XML documents with SGML tools is the use of SGML declarations that turn off case sensitivity—for example, with GroveMinder, an XML document processed with such an SGML declaration will have all of its names normalized, which will often lead to unexpected results either within the SGML tool context (such as matches of names that have different case in the source) or unexpected results when pure XML processing is used (such as matches failing that were not failing in the SGML context because the case was being normalized). To maintain sanity, SGML environments should turn case sensitivity on.
def groveNodeMatch( context, node, pattern, principalType="Element"):
"""
This method is used by Stylesheet.applyTemplates() to perform a grove node match.
"""
if ( node == node.GroveRoot ):
return 1
normcase = 0 # XML default. May result in false negatives in some cases
if hasattr(node.GroveRoot, "RefSyntax"):
syn = node.GroveRoot.RefSyntax
if hasattr(syn, "SubstGeneralNames"):
normcase = node.GroveRoot.RefSyntax.SubstGeneralNames
if pattern and ( pattern.__class__.__name__ == "DocumentNodeTest" ):
if node.ClassName == "Element" and\
__normalizeName( node.Gi, normcase ) == \
__normalizeName( node.PrincipalTreeRoot.Gi, normcase ):
return 1
if ( node.ClassName != "Element" ):
pTreeRoot = node.PrincipalTreeRoot
if pTreeRoot and ( node.ClassName == pTreeRoot.ClassName ):
return 1
if ( node.ClassName == node.GroveRoot.ClassName ):
return 1
if pattern and ( pattern.__class__.__name__ == "NodeNameTest" ):
if node.ClassName == "Element" and\
__normalizeName( pattern._nodeName, normcase ) == \
__normalizeName( node.Gi, normcase ):
return 1
elif node.ClassName == "AttributeAssignment" and\
__normalizeName( node.Name, normcase ) == \
__normalizeName( pattern._nodeName, normcase ):
return 1
elif type( node ) == types.StringType and \
hasattr( context.node, node ):
return 1
if pattern._nodeName == node.ClassName:
return 1
return 0
def __normalizeName( name, normcase = 0 ):
"""
Converts and returns name in lowercase.
"""
if normcase == 1:
return string.lower( name )
return nameThe getStringValue() function returns the string value of a node. In SGML groves, character data content is not held as a string as it is in the DOM but as a sequence of character nodes. Likewise, tokenized attribute values are held as a sequence of token nodes, not as a character string (but character data attribute values are held as strings). The data() method of grove nodes returns the string value of the “data” of the node, if any. In a grove's property set, you can designate one property to be the “content” of the node. Given such a property, if you ask the node for its data, you will get the concatenation of the data values of all of the nodes in the content property (if the property is nodal) or simply the value of the data property (if it is a primitive value). Likewise, if you ask the node for its children, you will get the value of the content property if it is nodal (if it is not nodal, then the node has no children). This bit of indirection allows different property sets to use whatever name they want for properties that are semantically the content of the node, instead of requiring them to be named “content” or “children”. Also, the grove specification is defined only in terms of named properties, not methods (such as childNodes(), as in the DOM). However, the fact that a property set can designate properties as content properties implies the need for the methods data() and children(), which GroveMinder provides.
def getStringValue( node, attributeName = None ):
"""
Returns the string value data of the given node.
"""
text = ''
if type( node ) == types.StringType:
text = node
elif node and attributeName:
text = getattr( node, attributeName )
elif type( node ) == types.ListType:
for obj in node:
text = text + getStringValue( obj, attributeName )
elif node and ( ( not attributeName ) or ( attributeName == '' ) ):
text = node.data()
return textThe getNodeContent() function returns the appropriate “content” value based on the node type. Here grove node processing is complicated by the fact that the DOM API distinguishes between node lists and strings where the SGML grove does not. In particular, character data content in the DOM is represented by text nodes, whereas in groves, character data content is represented as node lists of character data nodes. The GroveMinder API provides an optimization class, CharData (as opposed to the standard-defined DataChar node, which represents a single character), that acts more like DOM text nodes, that is, a single object that holds a single string. Note that the SGML grove design decision to use individual nodes for each character was driven by the need to address individual nodes for linking and styling purposes. The GroveMinder approach appears to be a reasonable compromise that does not eliminate the ability to treat strings as node lists of characters but provides the convenience that the DOM provides and that is all that most processing applications need. In any case, a grove implementation need not literally produce nodes for characters until they are specifically asked for. The special case for the PseudoElement node class, which is used in SGML groves, optimizes the default processing that would otherwise occur. PseudoElement nodes are directly analogous to DOM Text nodes in that they always represent continguous strings of data characters in element content, but unlike Text nodes, their content is a node list of DataChar nodes. But, PseudoElement nodes are specific to the SGML property set and are not a generic node type. However, in this case we can shortcut the processing of each individual node by calling the data() property directly, providing a significant preformance improvement over the default behavior.
def getNodeContent( node ):
"""
Returns the children or data for the input node.
"""
if node == node.GroveRoot and node.PrincipalTreeRoot:
node = node.PrincipalTreeRoot
# NOTE: We have to special-case PseudoElement because the default behavior
# would cause us to iterate over a bunch of datachar nodes when we can
# return the data() directly.
if node.ClassName == "PseudoElement":
return node.data()
if node.DataPropertyName:
return node.data()
if node.ChildrenPropertyName:
return node.children()
return ""The hasPIs() and getPIs() functions are helpers that first determine if a node might have processing instructions and then gathers them up for easy processing.
def hasPIs( node ):
"""
Returns 1 if the node is an SgmlDocument, or 0 if not.
"""
if node and node.ClassName == 'SgmlDocument':
return 1
return 0
def getPIs( node ):
"""
Returns the list of PI's for the given node.
Returns an empty list if there are none.
"""
piList = []
if node:
#groveBuilder = GroveMinder.makeGroveBuilder( "SGML" )
#grovePlan = groveBuilder.systemMaximumGrovePlan()
#node.withGrovePlan( grovePlan )
for child in node.Prolog:
if child.ClassName == "Pi":
piList.append(child)
return piListThe hasAttributes() and getAttributes() functions simply determine if a node has attributes and, if so, returns them.
def hasAttributes( node ):
"""
Returns 1 if the given node has an Attributes attribute, or 0 otherwise.
"""
if hasattr( node, "Attributes" ):
return 1
return 0
def getAttributes( node ):
"""
Returns the list of the given node's attributes.
Returns an empty list if there are no attributes.
"""
attList = []
if hasattr( node, "Attributes" ):
attList = node.Attributes
return attListThe getChildren() function emulates the childNodes() DOM method. If the ChildrenPropertyName property evaluates to true (the value is actually the name of the property) then in the GroveMinder API there will be a children() property that returns the value of the property designed in the property set as the content property.
def getChildren( node ):
"""
Returns a list of the children of the given node.
Returns an empty list if there are no children.
"""
if node.ClassName == "SgmlDocument" and node.DocumentElement:
node = node.DocumentElement
if node.ChildrenPropertyName:
return node.children()
else:
return []The soi() function (storage object identifier) is a brute-force parser for turning formal system identifiers into normal file path strings. James Clark's SP parser, on which GroveMinder is based, normalizes all system identifiers into formal system identifiers. Formal system identifiers are part of the SGML Extended Facilities, defined in Annex A of ISO/IEC 10744:1997. There's probably a more efficient or compact way to write this parser but we haven't had a need to optimize it—this code was just lying about so we used it as is.
def soi(fsistr):
"""Given an FSI string in SP/GroveMinder form, return the SOI (filename) part"""
ncro = None # Numeric character reference open. Will be set in start tag.
inncr = 0
intag = 0
ingi = 0
inattspec = 0
inattval = 0
gi = ""
attname = ""
attval = ""
ncr = ""
rsoi = ""
for i in range(0, len(fsistr)):
c = fsistr[i]
if c == "<":
intag = 1
ingi = 1
elif c == ">":
intag = 0
ingi = 0
elif c == ncro:
if not intag:
inncr = 1
ncr = ""
else:
rsoi = rsoi + c
elif c == "'":
if intag:
if inattspec:
if not inattval:
inattval = 1
lit = c
else: # Must be in attval
if lit == c:
inattval = 0
inattspec = 0
if attname == "SMCRD":
ncro = attval
else:
attval = attval + c
else:
errmsg(__name__, "W",
"Lita (') found where not allowed in tag in FSI at character %d in FSI '%s'" %
(i, fsistr))
else:
rsoi = rsoi + c
elif c == chr(66):
if intag:
if inattspec:
if not inattval:
inattval = 1
lit = c
else: # Must be in attval
if lit == c:
inattval = 0
inattspec = 0
else:
attval = attval + c
else:
debug("Lit %s found where not expected in tag in FSI at character %d in FSI '%s'" %
(chr(66), i, fsistr))
else:
rsoi = rsoi + c
elif c == "=":
if intag:
if inattval:
attval = attval + c
else:
if inattspec:
inattname = 0
else:
debug("Value indicator (=) found where not expected in tag in FSI at character %d in FSI '%s'" %
(i, fsistr))
else:
rsoi = rsoi + c
elif c == " ":
if intag:
ingi = 0
if inattspec:
if inattname:
inattname = 0
if inattval:
attval = attval + c
else:
rsoi = rsoi + c
else:
if inncr:
if c == ";":
c = chr(int(ncr))
ncr = ""
inncr = 0
else:
ncr = ncr + c
c = ""
if intag:
if inattspec:
if inattval:
attval = attval + c
if not inattval and not inattname:
inattval = 1
attval = c
if inattname:
attname = attname + c
else: # must not be in attspec
if ingi:
gi = gi + c
if (not ingi) and (not inattspec):
inattspec = 1
inattname = 1
attname = c
else: # Must be in rsoi
rsoi = rsoi + c
return rsoiThe filename2url() function just serves to turn a filename (such as returned by the soi() function), into a URL, as required by certain parts of the XSLT processor.
def filename2url(filename):
"""
Given a filename to a local file, returns the equivalent
'file:' URL.
This function ensures that filenames are consistently
converted to URLs as there seems to be some inconsistency
in how the different URL-related packages do this.
"""
if filename[1] == ":":
filename = "%s|%s" % (filename[0], filename[2:])
if string.find(filename, "\\") > 0:
filename = string.replace(filename, "\\", "/")
return "file:/" + filenameIt should not be surprising that the conceptual mapping of XSL and XPath to groves is fairly straightforward. The DOM and groves are both based on the same basic idea of nodes with properties. The DOM and SGML groves both have very similar data structures as one would expect. The challenges are mostly in the handling of strings, which are pre-optimized in the DOM but left to implementations to optimize in groves. In addition, the greater generality of the grove approach, coupled with SGML's larger set of choices for things like case normalization, add some complexity to the mapping, but not much.
Even with these obvious similarities, we expected the implementation task to be more difficult than it was. Our initial assumption was that we would have to wrap a DOM API around the grove API in order to plug our objects into the existing DOM-based processing framework. However, we discovered that Python's lack of strong typing plus a few well-placed redirections allowed us to use grove nodes directly. The GroveUtility module provides as much DOM API mapping as we needed, being nothing more than convenience functions that concentrate the details of accessing grove-based data in a DOM-like way.
Implementing this same functionality in a strongly typed language such as Java would require more work to map the grove API to the DOM API or significant reworking of the XSL implementation's class hierarchy to allow grove nodes to be used with DOM nodes. As our performance requirements increase, we will likely be forced to move to a different implementation language, probably C++. Another alternative would be to implement our own grove system in Java (GroveMinder has no Java binding), at which point we could have the grove implementation emulate the DOM API directly.
The implementation of the XSL-to-grove mapping gave us the first part of what we needed: the ability to apply XSL style sheets to arbitrary groves regardless of their data type (SGML, XML, Word, etc.). However, we still needed to be able to apply XSL style sheets to entire hyperdocuments, not just single documents. For example, it would enable style sheets that act on compound documents composed of elements (or other node types) used by reference (transcluded) from many individual documents.
Having bound XSL processing to generic groves the next challenge was to expose the hyperlinking information provided by the hyperdocument manager component of the Bonnell system so that style sheets can act on it. In particular, the following actions needed to be enabled:
Given this set of functions it would be possible to produce whatever output result is desired based on the linking properties of nodes.
The hyperdocument provided by the Bonnell hyperdocument manager is an abstract hyperdocument that conforms to the data model shown in 21. This data model is an abstraction over most reasonable ways to express hyperdocuments, including XLink, HyTime, HTML, proprietary or purpose-built hyperdocuments, and information systems that can be viewed as hyperdocuments (e.g, Microsoft Project). This data model is exposed through an API that provides a number of convenience functions for interrogating the hyperdocument, such as “getAllTraversalsFromNode()”. The intent with this API is to provide generic hyperdocument access to business logic such that the business logic is protected from the details of hyperdocument storage, syntactic representation, and management. This API also enables the direct programmatic creation of hyperdocuments.
21 is a UML diagram that reflects the fundamental hyperdocument data model. Each box represents a type or class. The lines represent relations between the types. The numbers and stars indicate the repeatabilty of the values. Reading from the upper left, a hyperdocument consists of zero or more hypererlinks (the black diamond represents containment or ownership). A hyperlink has two or more anchors as well as exactly one link type. An anchor has a role, which is defined within the context of a link type (and must be unique within the link type). An anchor aggregates (open diamond) zero more members, which are “information objects”. At this level of abstraction, an information object can be anything you can point to (i.e., a “resource” in XLink terms). In the Bonnell implementation, information objects are grove nodes, following the HyTime model.
A hyperdocument is constructed from the data in a set of one or more documents, called the “bounded object set”, emphasing the fact that the set of documents involved is finite. The use of a bounded set of source documents is a prerequisite for processing unconstrained extended links. This is for the simple reason that you cannot completely enable traversal from a node until you know what anchors it is a member of. Thus, in the general case, you must be able to fully resolve all the links before fully-functional use of the hyperdocument can be made. The links can be fully resolved in a useful amount of time only if the set of documents involved is finite and invariant over the course of the processing.
Note that this is a fundamentally different model from that used for the Web, where it is taken as a given that the set of documents involved is essentially unbounded. Thus, HTML links are one-way simple links and there is no general expectation that one can navigate from the target of an HTML link to the start of the link without having first navigated from the link to the target. By contrast, the hyperlinking model used by Bonnell reflects the requirements of more-or-less closed information systems, such as technical documentation authoring and delivery systems, where the value of extended linking outweighs the cost of the systems that enable it. In particular, there are problems of managing hyperlinks among components of versioned documents that cannot be solved satisfactorily without this type of closed-system linking.
Note the subtle shift in terminology of links and anchors from that normally used in discussing HTML. Because a given anchor may contain many nodes it is not sufficient for “anchors” to mean “elements that are anchors of hyperlinks”. Rather, a given element (or arbitrary node) may be a member of any number of anchors. Because both HyTime and XLink support extended linking, it may be possible for any node to be an anchor member regardless of whether or not it was originally intended for linking. This also means that there is not always a direct relation between the syntactic representation of a hyperlink and the node-based representation of it.
For example, an XLink simple link is a single element that defines a two-anchor link. When translated into the abstract hyperdocument data model, the simple link element becomes a node that is the only member of one anchor of the simple link. The target resources of the “href” attribute are the members of the other anchor. When an XPointer is resolved it may end up addressing multiple nodes at the grove or DOM level. Because the simple link element defines a hyperlink, it also results in the creation of a Hyperlink object in the abstract hyperdocument, which maintains a pointer back to the original simple link element as it's “data source”.
One of the motivating reasons for using XML is that XML markup is descriptive: tags and attributes add descriptive information to otherwise unstructured and undifferentiated text. This descriptive markup enables the use of declarative style sheets like XSL to use the descriptive markup to produce a variety out output forms from a single input source. This is XML motherhood.
Hyperlinks as provided by the above hyperlink model also provide descriptive information that can be used by style sheets to further qualify presentation, in addition to the value of simply having different information components connected together. That is, hyperlinks can do much more than simply enable navigation or aggregation: they can add a whole other layer of semantic characterization to information sets.
Most importantly, hyperlinks can impose this additional layer of descriptive information onto existing information unilaterally, leaving the original data untouched. Existing bodies of data can be annotated using hyperlinks to whatever additional degree of detail is desired. In extreme cases, links can even be used to impose structure onto data to which markup cannot be added directly, so-called “standoff markup”.
Hyperlinks provide two main ways to further classify information components: anchor roles and link types. Hyperlink anchors have named anchor roles that are unique within a given hyperlink. Hyperlinks have hyperlink types. The anchor role names serve to further classify their member nodes and provide an important selector for qualifying presentation style. For example, a given element type might have different presentation characteristics depending on what anchor role it is a member of or even whether or not it is a member of an anchor. Anchor roles can play the same role as element types. For example, an otherwise generic element type like “paragraph” might have different presentation styles based on what anchors it is a member of, as if it were a different element type.
Link types provide additional qualification information that may affect presentation style or that may need to be reflected in the rendered output (for example, in a dialog or intermediate page that enables traversal to multiple targets from a single anchor member).
Thus, anchor roles and link types provide two additional possible dimensions of qualification in addition to the normal element in context.
The implementation involved implementing XSLT extension elements and a set of XPath extension functions that operate directly on hyperdocuments. The implementation consisted of the following Python modules:
These four packages represent about 400 total lines of active code, much of which is simply implementing the 4Suite-defined API for extension elements. The actual business logic reflected in this code is pretty simple. All of the hard work of interacting with and getting access to the information in the hyperdocument is done by the underlying hyperdocument management system exposed through the Bonnell hyperdocument API. Thus these extensions represent a very thin layer integrating the underlying information management system to the XSLT processor.
In addition to these new modules, we also extended the built-in 4Suite “context” object to take a hyperdocument object so that our extensions can get access to the hyperdocument. This addition can be seen in 13. This addition of the Bonnell hyperdocument object to the XSLT context object is the integration point that the rest of the hyperlinking extensions depend on. This form of extension could probably be further generalized in the 4Suite implementation but we did not pursue any such generalizations. (Although in the writing of this paper we started experimenting with a “helper” object that we attach to the Processor object at run time, with an eye towards refactoring away from direct modification of the Processor constructor.)
The techniques used here could be similarly applied to other information systems with similar ease (assuming a comparable API is provided by the underlying system).
Note that the hyperdocument provided by the hyperdocument manager is not a grove. One implementation approach would be to wrap a grove API around the hyperdocument nodes. However, this turned out to be unnecessary. The 4Suite implementation can process any Python object. All we had to do was extend the “getProperty” processing to interrogate the properties of any Python object (including methods that take no arguments). Given this very simple extension, it becomes possible to apply XSL style sheets to any set of Python objects, not just grove nodes. Thus, a style sheet can apply templates and XPath expressions to the abstract hyperdocument objects as though they were grove nodes. Given this, all things are possible through the style sheet. The only other requirement may be to provide XPath extension functions that perform complex queries (e.g., GetTraversalTargets) or additional parameterless object methods that can be interrogated through XPath property checks.
This package provides the following five XPath extension functions: