Abstract
We tend to look the other way when content gets replicated, thinking nothing of duplicating assets in our content management system in the name of content re-use. And yet this same practice would be ridiculed in software engineering circles, where the technique of refactoring was developed precisely to address gratuitous duplication [Fowler, 1999].
This paper discusses methods of content re-use that lead to fewer cases of content duplication. The methods are presented using the technical reference documentation for a function library as a case study. The demonstrates that a careful application of these methods requires a certain amount of user re-education, but results in less content duplication, thereby side-stepping many of the issues associated with content anomalies. This can lead to a more consistent documentation set, reduced workflow, and leave the editorial staff to do editorial work, instead of enforcing either content or presentational continuity.
Keywords
Table of Contents
Recently, there has been some interest in the industry in maximizing the return on investment in content, be it newly created or legacy (for example, see [BELLE]). Content re-use and content single-sourcing are two approaches which, when combined, can greatly increase data synchronicity, and improve overall work flow and throughput.
However, content re-use can be pushed much further than the industry currently seems to be willing to do. This is shortsighted, since it limits content re-use to its adaptation to different presentation media. In fact, consideration should be given to the adaptation of content for different contexts, be they presentational or semantic. This technique is referred to as content re-purposing: literally giving the content new purpose.
The technical reference documentation of a programming library motivates two aspects of fundamental importance to content re-purposing: single-sourcing through content normalization, and content dependences and their resolution. These aspects can reduce significantly the amount of content duplication that occurs in this sort of authoring environment.
Content re-use is using document components in multiple instances: for example, XML source documents could be processed many different ways to generate output for different media. (Document components, or simply components, are portions of a complete document. For the purposes at hand, a complete document is itself another component.) How the re-use is to occur is not specified; only that it is to occur. File system copies, copy and paste, XML entities, and brute-force re-entry are all examples of content re-use, with some of them being more effective and/or maintainable than others.
As its name implies, a content re-use strategy facilitates the re-use of content. This is an aid to content creators: if content can be re-used, it does not have to be regenerated from scratch. However, a content re-use strategy can only help reduce duplication: it cannot eliminate it. There is nothing inherent in the system to prohibit a content creator from reproducing content.
Single-sourcing, on the other hand, attempts to ensure that every piece of content is obtained from one and only one source. The greatest strength of a single-sourcing system is the reduction of content anomalies. Essentially, a pure single-sourcing system cannot suffer from
update anomalies,
deletion anomalies, and
insertion anomalies.
These are in complete analogy with the standard data anomalies from relational database theory [Korth and Silberschatz, 1991]. Update anomalies occur when changes in one component leave identical content in another component unsynchronized. Deletion anomalies occur when the removal of one or more components results in a net information loss, despite accounting for the content of the removed components; the most common way this can occur is when components are orphaned, leaving no reference to them despite their legitimate existence. Insertion anomalies occur when it is impossible to insert content due to the choice of structure for the components.
Content anomalies result from a combination of content replication and improper content normalization. Single-sourcing helps avoid content anomalies by eliminating content replication.
In a pure single-sourcing environment, the content is never replicated. Instead, components are converted from one structure to another to yield the desired output, as the need may be. In this environment, it is the content transformations that are maintained, rather than the result of these transformations: content management becomes the management of both transformations as well as components. For instance, if an XSLT stylesheet can extract the desired information from a component (i.e., if it can transform a component from a source type to the desired target type), then it is the stylesheet that is stored in the content management system, rather than the output of its application to one or many components. Once content is properly stored in a single-sourcing environment, it must all necessarily be synchronous, since the data appears only once.
Single-sourcing is not a property of a content management system, but rather a goal to be achieved via a content management system. It is a result of the proper and consistent application of the single-sourcing methodology. These methods include designing a correctly normalized representation of the content from the outset, as well as enforcing it once the system is in place. Just as in database theory, the content management system can only go so far in enforcing single-sourcing: content has semantic value, and this semantic value is beyond the grasp of any content management system. In the case of a database system, using different spellings to represent identical pieces of information could circumvent a relation's uniqueness criteria: the database system is not capable of identifying different spellings as equivalent, unless it is augmented with application-specific rules (in this case, a dictionary) that supply it with additional constraints. Similarly, the content management system is incapable of comparing content based on its semantic value and drawing any conclusions. It is the responsibility of the content creators to ensure that content is not surreptitiously duplicated.
To illustrate the single-sourcing design methods involved, consider the prototypical example of the reference documentation for a programming library or API. In a topic-based authoring environment, such a documentation set would contain a collection of short document components, each of which describes, say, a single function. For example, the documentation for the tcp-accept-connection external function in OmniMark's OMTCP external function library might resemble the following (considerably abbreviated) document:
Function: tcp-accept-connection
Available in: Professional, Enterprise
Library: omtcp - TCP/IP client and server support
Return type: tcp-connection
Include file: omtcp.xin
Usage note:
You must include the following line at the beginning of your script:
include "omtcp.xin"
Purpose:
Use tcp-accept-connection in a server program to accept incoming
client-initiated service requests.Traditionally, the XML source for this document might look something like
<?xml version="1.0" encoding="iso-8859-1"?>
<function-topic>
<name> tcp-accept-connection </name>
<packaged-with>
<level> Professional </level>
<level> Enterprise </level>
</packaged-with>
<library> omtcp - TCP/IP client and server support </library>
<return-type> tcp-connection </return-type>
<include-file> omtcp.xin </include-file>
<usage-note>
<p> You must include the following line at the beginning of your
script: </p>
<codeblock>
include "omtcp.xin"
</codeblock>
</usage-note>
<purpose>
<p> Use tcp-accept-connection in a server program to accept
incoming client-initiated service requests. </p>
</purpose>
</function-topic>Much of the information present in this document component is not information about the tcp-accept-connection function itself, but rather information about the OMTCP library. In a single-sourcing environment, these common pieces of data are factored out into a new component, leaving behind only the data that truly describes tcp-accept-connection: for the component describing the function, there remains
<function>
<name> tcp-accept-connection </name>
<contained-in> omtcp </contained-in>
<return-type> tcp-connection </return-type>
<purpose>
<p> Use tcp-accept-connection in a server program to accept
incoming client-initiated service requests. </p>
</purpose>
</function>while the new component supplies the content that is common:
<library>
<name> omtcp </name>
<packaged-with>
<level> Professional </level>
<level> Enterprise </level>
</packaged-with>
<gloss> TCP/IP client and server support </gloss>
<include-file> omtcp.xin </include-file>
<purpose>
<p> The omTCP library is a set of functions that allow TCP
connections to and from other processes and machines. </p>
</purpose>
</library>This component, referred to as a library description, does not reference the components of the functions that belong to the library. Rather, these are specified via the metadata: in the context of processing the content, the value of the node at XPath /function/contained-in is not actually content, but rather metadata about the function named by /function/name, indicating which library the function is contained in. In fact, this node plays a dual role of content and metadata: its value appears in the field labeled Library: in the formatted document. Furthermore, there is nothing in the model prohibiting the use of a sequence of /function/contained-in nodes, which would allow a function to appear in the documentation set of more than one library. On the other hand, the traditional breakdown shown earlier would have required duplicating the entire function description document and making minor modifications.
The single-sourcing process leaves the content repository in a radically different state than it was before: the individual components no longer form complete documents. Rather, the pieces must be combined to form more usable components. This process is known as document synthesis. With the identification of the content and the metadata, it is a simple matter to re-construct the full document component.
The first step in the process is to use the metadata as join fields. This yields a composite component which, for all intents and purposes, consists of the concatenation of all the single-sourced components that are related to each other. (In practice, a subsidiary component, aptly named the process document, is used to control which library documentation is to be processed:
<documentation> <include-library name="omtcp"/> </documentation>
The value of /documentation/include-library/@name is used to determine which library documentation to produce.)
Next in the document synthesis process, the following XSLT stylesheet is used to perform the necessary operations on the resulting composite component:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output encoding="iso-8859-1" indent="yes" omit-xml-declaration="no"/>
<xsl:template match="/">
<xsl:apply-templates select="/documentation/function"/>
</xsl:template>
<xsl:template match="function">
<xsl:element name="function-topic">
<xsl:copy-of select="name"/>
<xsl:copy-of select="//library/packaged-with"/>
<xsl:element name="library">
<xsl:value-of select="//library/name"/>
<xsl:text>-</xsl:text>
<xsl:value-of select="//library/gloss"/>
</xsl:element>
<xsl:copy-of select="return-type"/>
<xsl:copy-of select="//include-file"/>
<xsl:element name="usage-note">
<xsl:element name="p">
<xsl:text> You must include the following line </xsl:text>
<xsl:text> at the beginning of your script: </xsl:text>
</xsl:element>
<xsl:element name="codeblock">
<xsl:text> include "</xsl:text>
<xsl:value-of select="//include-file"/>
<xsl:text>" </xsl:text>
</xsl:element>
</xsl:element>
<xsl:copy-of select="purpose"/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>Conceptually, this simple transformation is nothing more than a type conversion from an object of type functionn× library to one of type array of function-topic (for a library that contains n functions). This conversion function can be used in any context where an array of components of type function-topic is required, but only one of type functionn× library is available. Data types and conversion functions are discussed in some detail in [Cleaveland, 1986]; their applicability to the domain of content management is the subject of current research.
In the previous section, it was argued that the adoption of a single-sourcing policy results in documents that are more stable, due to the elimination of content anomalies. The price of this stability is that the total number of components stored in the content management system will typically grow. In some cases, this growth can be a significant fraction (if not multiple) of the original repository size. In the example of the OMTCP library documentation, the increase in the number of components was actually quite small: 1/n, for a library of n functions. However, there is no reason to expect such a small increase in a more realistic case.
The normalization of the content in the repository introduces a fundamental difficulty in the document synthesis process: there is now a dependence between every function component and its associated library component. The reason is quite simple: each function component is no longer a complete document, but rather depends on the presence of the library component. The treatment of this sort of dependence has been studied in software engineering circles, and is handled very effectively by Unix make-like tools; see [Jørgensen, 1999] for a detailed discussion. From the perspective of content management, this dependence is evident and somewhat trivial.
For the purposes at hand, and for those of content management in general, the dependence that sibling components might have on each other is more interesting, since such dependencies are indirect.
Consider the case of the component describing tcp-accept-connection. This function returns a value of type tcp-connection:
<type>
<name> tcp-connection </name>
<contained-in> omtcp </contained-in>
<purpose>
<p> The tcp-connection OMX component represents a connection to a
remote machine using the TCP/IP (Transmission Control/Internet)
protocol. </p>
</purpose>
</type>Consequently, the final documentation set must contain a component describing this type, or else the documentation will be incomplete. For instance, if linking is used to connect topics within a documentation set, this particular link will be broken unless the component is present. Therefore, there is a dependence between the tcp-accept-connection and tcp-connection components. This dependence is said to be implicit because of the indirection involved: the node /function/return-type in one component generates a dependence on the node /type/name in another component. Another way of looking at it is that the dependence exists without the content creator ever being aware of it. Rather, it falls upon another party to decree that every type appearing in a /function/return-type node should be described by a separate component having a /type/name.
Dependences can be separated into two categories: positive and negative. A component depends positively on another if its presence requires the presence of the second. Similarly, a component depends negatively on another if its presence excludes the presence of the second. These can be referred to as inclusion requirements and exclusion requirements, respectively.
A priori, there is no difficulty in defining content dependences, not that the content management system could help in the matter. The task of defining dependences falls squarely upon the shoulders of an outside party. The tedium lies in determining if the selected components are internally compatible with each other, a content management task that can be handled by the content management system. A set of components (assembled via a document synthesis process, for instance) is said to be internally compatible if all of the inclusion and exclusion requirements of the components are satisfied. The process of verifying if a set of components is internally compatible is referred to as an internal compatibility process.
Inclusion and exclusion requirements can be extracted from components as they are being assembled by the document synthesis process. For instance, the inclusion requirements for the function components of the OMTCP library consist of the components for the return types. These can be collected using
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output encoding="iso-8859-1" indent="no" omit-xml-declaration="yes"/>
<xsl:template match="/">
<xsl:apply-templates select="/function"/>
</xsl:template>
<xsl:template match="function">
<xsl:element name="type">
<xsl:element name="name">
<xsl:value-of select="return-type"/>
</xsl:element>
</xsl:element>
</xsl:template>
</xsl:stylesheet>For tcp-accept-connection, there results:
<type><name> tcp-connection </name></type>
This is the dependence of tcp-accept-connection on the type tcp-connection that was mentioned earlier. The dependence should be read as "a type whose name is tcp-connection". The use of the indefinite article "a" rather than the definite article "the" is deliberate. Whatever other properties the type might have are irrelevant: the dependence is on the name. Evidently, the component for tcp-connection given earlier satisfies this requirement.
Other implicit dependences (say, on argument types) would be handled similarly, by adding templates to the stylesheet.
A document set can be checked for compatibility using a brute-force algorithm. In essence, inclusion requirements are verified as follows: for each requirement, verify that there exists at least one component that satisfies the requirement. Exclusion requirements are verified in a similar manner: for each component in the document set, verify that there does not exist an exclusion against that component. In both cases, the iteration process can terminate as soon as a component satisfying the requirement is found: only one component is needed to verify the requirement. However, even if the document set has been demonstrated to be internally incompatible, there is some value in continuing the search process, since this might uncover other inconsistencies that can be repaired in the same pass.
The implementation of requirement verification is particularly simple. What is needed is a way of comparing two XML components, to determine if one contains the other as one or more branches. For the purposes at hand, a simple scheme is sufficient. This simple scheme begins by walking through the first component (using any standard XML parser) and storing the value of the node into an associative array, indexed by the value of the XPath expression that points from the root to the node; if a node has multiple children, the value of the child's position can be used to generate a unique index. The process is repeated for the second component, using a second associative array. Then, if the second array contains the first (i.e., if every item in the first array is present in the second array, equal for both index and item values), then the second component contains the first as one or more branches. An implementation of this scheme is given in the Appendix.
The simple scheme outlined above suffices for the component sets considered herein, but can fail in more general cases. For instance, if the components can contain repeated elements at a node, the component
<a> <b>1</b> <b>2</b> </a>
could be equivalent to that obtained by interchanging the two sub-elements, and yet the above method will conclude that they are not equivalent, since it assumes that the order of sibling nodes is normalized. The simple scheme also fails to account for attribute values. These and other issues can be resolved, however. For instance, the problem of node multiplicities can be handled by eliminating the use of the position value in the XPath, and moving from an associative array to a multi-associative array, which allows for duplicated index values (for instance, see [Musser et al., 2001], section 7.2). Attributes can be handled with minor code complications. Suffice it to say that it is possible to compare two components for containment.
The strength of the single-sourcing approach is the stability that it brings to the content being managed: documents become less affected by the impact of change. For example, if the development team decides to change the name of the include file for the OMTCP library from omtcp.xin to simply tcp.xin, the technical writers do not have to be concerned with changing every function component, a potentially error-prone task. Rather, a single change to the library description needs to be made. Similarly, boilerplate text, such as the usage note in the example above, can be generated by the conversion function rather than manually entered, thereby guaranteeing a certain level of consistency. As a result, the editor does not have to manually verify that the individual pieces of the final document set have been updated: they necessarily have.
There are aspects of the dependency systems discussed in the previous section that remain to be explored. For instance, it is possible to view the entire document synthesis process as an extension of internal compatibility that was discussed earlier. Rather than simply verifying if inclusion requirements are satisfied, the system actively seeks to satisfy them by repeatedly adding components to the document set. This procedure is referred to as active compatibility, in contrast to internal compatibility. In fact, the previous section described an initial join on the metadata to select out the candidate components for the composite component: this is nothing more than an instance of active compatibility seeking to satisfy the inclusion requirement(s) of the so-called process document.
This investigation has limited the use of active compatibility to the process document, and then used (passive) internal compatibility to ensure that the resulting set is internally consistent. On the other hand, there is reason to believe that giving the active compatibility process greater leeway would allow for the creation of greater document sets, since it could repair inconsistencies, whereas internal compatibility could only flag them. In such a case, however, issues such as circular dependences, their detection, and their handling, become a primary concern. Also, it remains to be demonstrated that an active compatibility process seeded on a process document could necessarily generate any document set.
The concept of component transformations as an analogy for data type conversions was only touched upon in this work. Data type conversions are commonplace in programming language theory and practice. Almost all programmers take for granted that a language implementation will automatically insert conversion functions when the need arises: a compiler will insert a conversion from an integer type to a real type if the context demands it. The possibility of content management systems making equivalent use of conversion functions to adapt components as the context demands is exciting. As it stands, there does not seem to be an XML parser capable of supporting such an implementation in user-code; therefore, this would seem to require modifications to the parsers themselves. Future work will investigate the impact of such modifications on the usability of a content management system, as well as on the performance of the XML parser.
The code that follows is an implementation of the internal compatibility check, using the simple algorithm discussed in the main text. The processing involved in implementing even this simple scheme makes the use of XSLT somewhat inappropriate. This code sample is therefore written using the OmniMark language.
The program assumes that the inclusions, exclusions, as well as all components are stored in the keyed shelves included, excluded, and have, respectively. The shelf keys identify the origin of the shelf value: for instane, a component's public identifier might be used as a key value for the component's value. The initialization of these shelves has been excised for the sake of brevity. The program is, otherwise, self-explanatory.
define function check-exclusions value stream have
against read-only stream exclusions elsewhere
define function check-inclusions value stream inclusion
against read-only stream have elsewhere
define function encode-tree value stream tree-source
into modifiable stream tree-target elsewhere
declare catch excluded-by value stream component
declare catch satisfied-by value stream component
process
repeat over included
check-inclusions included against have
catch satisfied-by component
put #error key of included
|| " is satisfied by " || component || "%n"
again
repeat over have
check-exclusions have against excluded
catch excluded-by component
put #error key of have
|| " is excluded by " || component || "%n"
again
declare catch not-satisfying
define function check-inclusions value stream inclusion
against read-only stream have
as
local stream i-tree variable
encode-tree inclusion into i-tree
repeat over have
local stream h-tree variable
encode-tree have into h-tree
repeat over i-tree
throw not-satisfying when h-tree hasnt key key of i-tree
again
throw satisfied-by key of have
catch not-satisfying
again
define function check-exclusions value stream have
against read-only stream exclusions
as
local stream h-tree variable
encode-tree have into h-tree
repeat over exclusions
local stream e-tree variable
encode-tree exclusions into e-tree
repeat over h-tree
throw excluded-by key of exclusions when e-tree has key key of h-tree
again
again
global stream tree variable
define function encode-tree value stream tree-source
into modifiable stream tree-target
as
do xml-parse scan tree-source
output "%sc"
done
copy-clear tree to tree-target
element #implied
local stream s
open s as buffer
using output as s
repeat over current elements as e
output name of current element e
output "[" || "d" % children of parent || "]"
unless number of current elements = 1
output "/" unless #last
again
close s
set new tree{s} to "%sc"I am very grateful for the countless discussions endured by my colleagues at OmniMark Technologies and Stilo International, as well as for the innumerable ideas and suggestions that resulted from these discussions. I would particularly like to thank OmniMark Technologies for giving the time, resources, and encouragement needed to perform this and related investigations.
[Fowler, 1999] Martin Fowler, Refactoring: Improving the Design of Existing Code, Addison Wesley Longman, Inc., Reading, 1999.
[Korth and Silberschatz, 1991] Henry F. Korth and Abraham Silberschatz, Database System Concepts, second ed., McGraw-Hill, Inc., New York, 1991.
[Cleaveland, 1986] J. Craig Cleaveland, An Introduction to Data Types, Addison-Wesley Publishing Company, Boston, 1986.
![]() ![]() |
Design & Development by deepX Ltd. 2002 |