Keywords: Application architecture, Content Management, Content Repurposing, Conversion, DocBook, Electronic Publishing, Integration, Java, Knowledge Management, Ontology, PDF, Publishing, SVG, XML, XSL-FO, XSLT
Biography
After receiving his Ph.D. in Mathematical Physics from Yale University in 1989, Philip spent a year as Assistant Professor of Physics at Knox College, followed by four years as Assistant Professor of Mathematics at the University of Toronto. His background in Differential Geometry and in computer modelling of physical phenomena served as unorthodox preparation for his subsequent move into industry as a Software Engineer with an emphasis on Computer Graphics. By 1997 Philip was in charge of a software research team creating early Web technologies based on HTML, XML, CSS and Java. Philip now lives and works in Vancouver, Canada, where he is President of SchemaSoft (http://www.schemasoft.com/), a software development consulting company he co-founded in 1999. He is an Advisory Committee Representative of the World Wide Web Consortium (http://www.w3.org/), and has been a member of the W3C Scalable Vector Graphics Working Group (http://www.w3.org/Graphics/SVG/) since its inception in 1998. Philip is Chair of the BC Advanced Systems Institute International Scientific Advisory Board (http://www.asi.bc.ca/). He is also a Director of the Vancouver XML Developers Association (http://www.vanx.org/), an organization that he co-founded in 2000. He regularly writes and lectures on topics related to software engineering, XML and SVG.
Biography
Yuri Khramov has more than 20 years of experience in the software industry; he is involved in XML and other Web technologies for more than 5 years. He is one of the founding partners of SchemaSoft. Prior to that, he worked at Paradigm Development Corp. in Vancouver, Canada Graphica Corp. in Tokyo, and several industrial and Academic institutions in Moscow. He holds a Ph.D. in Computer Science from Moscow Management Institute. Yuri is a co-director of Vancouver XML Developers Association.
Biography
Ahmet Gurcan completed his BSc. in Electrical and Computer Engineering at Istanbul Techical University in 1997, and M.A.Sc. in the same field at University of British Columbia in 2000. While he was pursuing his Master's Degree, he also worked as a research and teaching assistant. His topics of interest were real-time operating systems at that time, and he built a dual-processor real-time system mainly used to control machines and robots. Since his graduation, he has been working for a Vancouver based software company, Schemasoft, a leader in file converters and XML technologies. He worked in several projects that involve manipulation of PDF file format, and its integration with XML and conversion to SVG.
Widespread, consistent use of XML to encode documents throughout an organization has well-understood advantages: content can be re-purposed, re-styled, searched, combined, transformed, rendered or otherwise processed with ease, and pre-existing software can be highly leveraged when providing a solution.
However, in the real world people do not consistently adhere to imposed standards on the use of software, and consistent software is not installed throughout large organizations. It is unrealistic to expect that an organization's documents will be widely accessible as XML. To make matters worse, many standard tools for conversion to XML operate at inconsistent semantic levels, or encode an inappropriate semantic level. To illustrate this point, one could easily convert all electronic documents to XML with a common schema by opening the documents in their originating applications, taking screen shots, and encoding these bitmaps using a long list of <pixel> elements. However, this encoding would be useless for a content management system, or for anything other than re-rendering the same view of the same documents on the same device, for that matter.
We address the problem of large-scale conversion of heterogeneous content to XML, with an emphasis on the conversions needed to produce useful and compatible semantic representations from source files in various formats. We then discuss the applicability of these ideas to content management systems and electronic publication workflows.
An improved architecture for content management systems arises from the discussion. A wide variety of content management applications can be achieved simply by building appropriate pipelines of converters to, from and between XML languages. This is a lightweight, flexible approach that does not depend on a proprietary content management server because it leverages XML-processing functionality already included in operating systems and Web servers. To make this approach practical, one needs a large and varied toolkit of converters as well as standard pipeline architecture with which to connect them. We will provide a demonstration of such a rich toolkit that takes advantage of Apache Cocoon as the pipeline architecture. As an example, we will show how our toolkit was applied to the paper publication workflows of this conference.
Cocoon pipeline definitions are created in XML "sitemap" files, allowing new processes to be configured without writing code. In order to determine what converters are needed to make useful pipelines, it is helpful to regard each standard document-processing function as a conversion. In particular,
• retrieval of content from a standard desktop application file is a binary to XML conversion
• reconstruction of semantics implied by styling is possible with a profile-driven XML to XML conversion
• data-driven graphs, charts and maps are made possible by XML to SVG conversions
• documents can be prepared for print via XML to XSL-FO to PDF conversions
Similarly, database query, Web services, content aggregation, search, classification, Web publishing and e-book publishing can all be regarded as arising from specific converters in the content management toolkit. When effectively combined, such converters can produce powerful and sophisticated solutions to real-world problems.
1. XML in Content Management and Publishing
2. Single-format Fantasy
3. Up-Conversion
4. Multi-format Solutions
4.1 Requirements
4.2 Use Cases
4.2.1 Data-Driven Graphics
4.2.2 Online Newspapers
4.2.3 Conference Proceedings
4.3 Architecture
4.3.1 XML Pipelines
4.3.2 Conversion Components
4.3.3 Processing Phases
4.3.3.1 Extract Patterns
4.3.3.2 Synthesize Patterns
4.3.3.3 Publish Patterns
5. Results
5.1 Data-Driven Graphics
5.2 Online Newspapers
5.3 Conference Proceedings
Appendix 1. Indexing XML Content
Acknowledgements
Bibliography
Footnotes
There are many reasons to use XML in content management and publishing, among them the following:
Since XML is basically just a set of conventions for encoding information, you might wonder how it can offer so many advantages. This has much more to do with the large community of users than with the specific choice of syntax with which to encode information. Along with a large community of users comes a large amount of software that follows the XML information-encoding conventions, and this is the basis for many of the listed advantages of XML.
Two of the traits that have allowed XML to achieve a large community of users are:
Ironically, these same two traits determine limits on the utility of XML-processing software:
The current paper addresses the problem of overcoming these limitations in XML-based content management and publishing systems.
A naïve information systems manager, enamoured with the advantages of XML, might think:
If I can just get my whole organization using the same XML-based authoring tools in the same way, then I will be able to build the ultimate enterprise-wide content management solution.
However, it is difficult to find any examples of success with this approach. The problem is, you can overhaul software installations but you cannot overhaul people.
Every division of a large corporation tends to act autonomously, making its own technological choices on its own schedule. Even within a division, there is the problem of getting employees to act in unison by following rigid authoring guidelines. People do not always read instructions, let alone follow them. And if software is really supposed to increase their productivity, then it should make things easier for them, not harder.
Furthermore, native authoring tool formats continue to be binary, not XML. This is because tool vendors are motivated to preserve market share by making it difficult to migrate, and proprietary binary formats tie users to the tool that knows how to read and write them.
Finally, there is the problem of rapidly-advancing software technology. By the time an enterprise has implemented a technology overhaul, there is already something better to replace it. Forward-thinking solutions should anticipate this in advance, one aspect of which is to accommodate future inclusion of as-yet-unknown file formats.
In addition to the problem of dealing with many source formats, there are often difficulties with the information being represented by these source formats. Popular document formats do not always encode structure at a useful semantic level. For example, an HR department wants name, address, past jobs, degrees, publications, etc. from résumés. Yet these categories of text are indistinguishable in the submitted word-processing documents. Likewise, spreadsheet document formats may encode row and column information, but not categories like cost, revenue, assets, date and company name for a quarterly financial report; or categories like transportation, lodging and per-diem for a travel expense report. PDF documents contain instructions for drawing absolutely-positioned text and figures, but do not specify what collection of text and figures makes up a single article in a magazine; what is a title, subtitle, author, side note, glossary term, vignette or reference; or what collections constitute an advertisement, editorial or table of contents.
Up-conversion is needed. This is independent of binary to XML conversion. Binary to XML conversion is typically just a change of syntax, in which binary-encoded objects become XML elements, attributes and text. However, up-conversion is a re-construction of semantic structure, in which content is tagged with higher-level semantic categories than were available in the original markup. Automatic up-conversion is often feasible within a given collection of documents (such as résumés, financial reports, expense reports or magazines in the above examples), but since there are so many different kinds of document collections, it is important to have a general way of profiling a given collection of documents, or encoding the rules for up-converting that collection.
To solve the problems discussed in the foregoing sections, it will be necessary to meet these high-level requirements:
Here are some use cases to bear in mind while coming up with a general architecture for content management. All are projects recently completed by SchemaSoft.
A business reporting system requires graphs and charts to be drawn on the fly from current data. Data is variously available as Microsoft Excel files, database tables and XML. The structure or schema of the data is also variable within each of those formats. It has to be possible to quickly define and hook up a new data source without additional programming or modifications to the source code.
Articles from newspapers around the world are to be automatically published as HTML pages. Newspapers are available in PDF format. Sections, articles, article continuations across pages, titles, bylines, figure captions, advertisements, etc. are to be identified in the source PDF. Each PDF file is to give rise to many HTML files, one for each article. The PDF file is to be augmented with hyperlinks to the HTML page corresponding to each article. The articles are to be indexed by title, author, section, etc.
The XML 2004 conference papers are to be published as HTML, PDF and SVG. Source documents are Microsoft Word files and DocBook [DB] XML files with links to SVG, PNG, GIF, JPEG and BMP files. HTML index pages are to be constructed listing the papers by author, city, country, keyword, organization, time, title and track. Author biography pages, paper abstract pages and other conference information pages are to be derived from data. Cross-reference hyperlinks are to be constructed wherever applicable — for example, from index entries to paper abstract pages; from paper abstract pages to papers, author biographies and companies; from author biographies to companies and abstracts of papers written by the author; etc.
Our approach is to assemble content management solutions from a toolkit of useful components, rather than by configuring a monolithic application. As long as component APIs permit virtual plug-and-play, this approach is inherently more flexible, and better able to accommodate frequent change in data sources, formats, schema, content management functions, publishing targets, and publication styling.
A pipeline architecture is used to manage data flow and order of execution of components. Pipeline definitions determine how the output stream(s) of one component feed the input stream(s) of other components. XML is normally passed between components, although the schema of each XML stream is dependent on the nature of the components that pass it.
The primary function of a component is to convert data from one form to another. For example, a binary document might be parsed and mapped to corresponding XML, an XML data set might be sorted without changing the schema, or an XML document might be up-converted to a schema with more specialized structure.
The Apache Cocoon project provides one possible framework for pipelining conversion components. Cocoon is specially tailored for Web publishing, since the pipeline implementation is integrated with the Web server. Specifically, Cocoon is implemented as a Tomcat servlet.
Pipelines are defined in sitemap files, which are written in an XML grammar. A useful feature of Cocoon is that pipelines can be triggered by URL wildcards. For example, one can make a rule that if the URL requested by a client Web browser ends in .doc, then that document is sent through a pipeline that first converts it from Microsoft Word to XML, then styles it using XSLT to HTML+CSS. Much more complex pipelines are also possible, such as ones that depend on user profile, session information, or Web service calls.
Components are classified into those that can generate, transform or serialize XML. In data conversion terminology, this means any-to-XML, XML-to-XML, and XML-to-any conversions, respectively. In general, Cocoon components are defined in Java classes, but they may take parameters that utilize other languages. Of particular interest is the XSLT transformer, since it can take an XSLT stylesheet as a parameter. When designing a pipeline, one's strategy is usually to convert incoming documents and data to XML at the first possible opportunity, and to do XML to XML transformations thereafter. The XML to XML transformations are normally done with the XSLT transformer.
NOTE: A Cocoon-based content management system called Lenya is also available from the Apache Software Foundation. Although we used Cocoon for the pipelines in one of the examples of this paper, we did not use Lenya. |
Various other pipeline technologies are possible, each appropriate for a different kind of application. A lightweight but platform-specific approach is to use batch files. A developer-centric approach is to use Ant build files, which are XML files that encode build instructions for each target. Since pipeline definitions can themselves be produced by running part of the pipeline, it is useful to choose an XML syntax.
To meet the objective of supporting a range of formats and schemas, it is good to start with a collection of ready-made readers, writers and translators of popular file formats. To handle a range of intermediate processing tasks, one needs a collection of utility XSLT programs as well. An example of such a utility XSLT program is given in Appendix 1. However, the real power comes from being able to rapidly produce new components that handle new formats or new intermediate processing tasks, in order to deal with typical scenarios in which technology and industry requirements change often. This requires rapid application development kits tuned to the problems of format translation and XSLT development. Such RAD kits have been presented in previous papers [Trans] and [Cat], and have been used to implement the content management solutions discussed herein.
XSL (XSLT + XSL-FO) is an effective language for specifying page layout in print or e-book publishing solutions. This is the standard use of XSLT, and can be implemented by connecting an XSLT translator component to a serializer component that does the formatting, with an appropriate target serialization such as Adobe PDF or IBM AFP.
However, XSLT translators are capable of much more: they can generate arbitrary data visualizations. In a previous paper [GS], we introduced the notion of Graphical Stylesheets; XSLT programs to draw data as SVG. More recently [SVG-XAML] we discussed strategies to target multiple vector graphic output formats, including Microsoft's XAML, from the same Graphical Stylesheets. Starting with XML formats as varied as MathML (Mathematical Markup Language), XBRL (eXtensible Business Reporting Language), GML (Geographic Markup Language) or X3D (eXtensible 3D), we have used Graphical Stylesheets to render diagrams of the data. Specific examples are elaborated in [DWGraphs], [SVGMaps] and [3D-SVG].
In a typical content management solution, data flow pipelines can be roughly divided into the following phases of processing:
The next three sections will show the most common patterns that occur in data flow diagrams at each of these phases. For this purpose, we will use the symbology shown in Table 1. The full data flow diagram for any particular content management solution will normally contain many of these patterns.
| Data Flow | |
|---|---|
| Run-time data flow | |
| Design-time data flow | |
| Design-time compilation | |
| Content Sources | |
![]() | Binary format document |
![]() | XML format document |
![]() | Database |
| Translation Components | |
![]() | XML generator |
![]() | XML transformer |
![]() | XML serializer |
![]() | Batch process |
Table 1
Information is often available in binary format documents or database tables, and must be converted to XML first in order to participate in the XML conversion pipeline. Extract Pattern #1 and Extract Pattern #2 are the simple patterns in which a generator component extracts the information.

XML is generated from a database by a generator that takes a query parameter such as XQuery.
Figure 2: Extract Pattern #2
As discussed in Chapter 3, up-conversion is a crucial step in recovering the valuable information needed to drive content processing pipelines. Automating this process requires programmatic reconstruction of semantics from styling. This is possible for content created from a common style template, as discussed in [PDF2XML]. The rules that associate style with semantics are encoded in an XML file called a profile, and semantic reconstruction is done by a transformer that reads in both the input XML stream and the profile. This is Synthesize Pattern #1. Another application of semantic reconstruction is reported in [TestSpec].

XML is up-converted using a transformer that takes a profile as parameter.
Figure 3: Synthesize Pattern #1
Since XSLT is frequently needed for translation components, it is useful to have a GUI application for rapidly specifying and generating the XSLT. An example of such an application is Catwalk, as described in [Cat]. Catwalk has been deployed successfully to generate Graphical Stylesheet transformations, B2B transformations, and HTML reports. Synthesize Pattern #2 is the pattern for a generated XSLT transformer.

A generator component generates XSLT used by a transformer component. The generator reads in sample input XML files used at design time to specify the mapping.
Figure 4: Synthesize Pattern #2
XSLT is neither suitable nor efficient for transforming the DOM generated by typical binary format readers. Nonetheless, it is possible to rapidly develop C++ transformers by compiling a translation specification as described in [Trans]. The translation specification is XML adhering to the schema translation.xsd of that paper. Synthesize Pattern #3 shows an XML translation specification compiled into a transformer.
Often the pipeline itself can be determined from data. An example is a batch process in which the URLs of the files to be processed are available as data. Synthesize Pattern #4 consists of a transformer generating a pipeline definition that converts a given collection of XML documents to a binary format.
Once a document has been assembled and styled, it can be published in print form or for the Web. Publish Pattern #1 shows the output of an XSLT stylesheet being passed to an XSL formatter to produce PDF for print, and Publish Pattern #2 shows the output of another XSLT stylesheet being serialized as an XHTML or SVG file for the Web.

XSLT transformer generates XSL-FO which is formatted to PDF by a serializer.
Figure 7: Publish Pattern #1
Below are Web references to the results of each of the three use cases introduced in Section 4.2.
Our Catwalk application was used to generate the XSLT transformers from XML data to SVG graphs. Using Cocoon, we were able to extract this XML data from our Microsoft Excel generator component and a database query component. Extract Pattern #1, Extract Pattern #2, Synthesize Pattern #2 and Publish Pattern #2 were utilized.
The online newspaper publishing system is available in modified form from NewspaperDirect. The key challenge of this project was to perform profile-driven up-conversion on newspaper documents available as PDF. Thus, the solution makes critical use of Synthesize Pattern #1.
The initial phases of content conversion for the XML 2004 conference are done by authoring tools made available to conference paper authors. For example, SchemaSoft provides a freeWord to DocBook Converter Web service that extracts from the Microsoft Word .doc binary format and synthesizes DocBook XML for conference submission. There are many other synthesize and publish steps leading to the indexed proceedings, such as the HTML paper publishing step achieved with our DocBook Styler XSLT.
The final result can be viewed at the IDEAlliance XML 2004 Proceedings site, including this paper in XML, XHTML, PDF and SVG forms.
Suppose you are constructing a typical index for a book. You would assign index terms to items such as pages or sections, and then list the terms at the end, in alphabetical order. Each term would be followed by a list of references to the places in which it occurred, such as page numbers. In the more general problem of indexing, the items can be anything (plant inventory, train trips, mayors, etc.) and the terms can be any properties of those items (available colours, departure times, hobbies, etc.) In the use case of Section 4.2.3, the items were papers and there were eight indices, with the terms author, city, country, keyword, organization, time, title and track.
Example 1 is a minimal DTD illustrating this idea. The XML contains an items element with item children, and an index element with term children. Each item element has a unique id attribute as well as any number of child elements that refer to the id attributes of its associated terms. Likewise, each term element has a unique id attribute as well as any number of child elements that refer back to the id attributes of its associated items.
<!ELEMENT crossref (items, index)>
<!-- items with reference to index terms -->
<!ELEMENT items (item*)>
<!ELEMENT item (termref*, content)>
<!ATTLIST item
id ID #REQUIRED
>
<!ELEMENT termref EMPTY>
<!ATTLIST termref
ref IDREF #REQUIRED
>
<!ELEMENT content (#PCDATA)>
<!-- index terms cross-referenced to items -->
<!ELEMENT index (term*)>
<!ELEMENT term (itemref*)>
<!ATTLIST term
id ID #REQUIRED
name CDATA #REQUIRED
>
<!ELEMENT itemref EMPTY>
<!ATTLIST itemref
ref IDREF #REQUIRED
>
|
Example 1: Cross-reference DTD
We will discuss the problem of writing XSLT to construct such a cross-referenced index from raw data. For our actual solutions, we have generalized the software so that it can handle multiple indices on input with arbitrary DTD, by parameterizing the XPaths used to fetch things like items and terms. However, for illustration purposes we will assume one index and the fixed DTD given.
The steps are to read in the raw data, assign IDs to items and every instance of a term in an item, separate the terms into an index table, construct the references and cross-references, sort the terms (which brings duplicate terms next to each other), and finally remove duplicate terms. These are multiple steps in a pipeline, and to keep things simple, we will restrict our attention to the last step only.
When eliminating duplicate terms, each termref IDREF has to be fixed up to point to the single term element that remains, and the itemref children of all duplicate terms have to be combined as children of the single term element that remains. Example 2 is the XSLT that eliminates duplicate terms according to this prescription.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- This stylesheet removes duplicate terms in the index, combines the
item references of duplicate terms, and resolves each IDREF to the
ID of the corresponding retained term -->
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- Start by copying the existing content -->
<xsl:template match="/">
<xsl:apply-templates mode="copy"/>
</xsl:template>
<xsl:template match="node()|@*" mode="copy">
<xsl:copy>
<xsl:apply-templates select="node()|@*" mode="copy"/>
</xsl:copy>
</xsl:template>
<!-- Replace each IDREF with the unique retained target term ID -->
<xsl:template match="item/termref" mode="copy">
<xsl:copy>
<xsl:attribute name="ref">
<xsl:apply-templates select="id(@ref)" mode="lookup"/>
</xsl:attribute>
</xsl:copy>
</xsl:template>
<!-- Look up the ID of the first term among duplicates -->
<xsl:template match="term" mode="lookup">
<xsl:if test="not(@name=preceding-sibling::term[1]/@name)">
<xsl:value-of select="@id"/>
</xsl:if>
<xsl:apply-templates mode="lookup"
select="preceding-sibling::term[@name=current()/@name][last()]"/>
</xsl:template>
<!-- Retain only the first term among duplicates -->
<xsl:template match="term" mode="copy">
<xsl:if test="not(@name=preceding-sibling::term[1]/@name)">
<xsl:copy>
<xsl:apply-templates select="node()|@*" mode="copy"/>
<xsl:apply-templates mode="combine"
select="following-sibling::term[@name=current()/@name]"/>
</xsl:copy>
</xsl:if>
</xsl:template>
<!-- Combine the content of duplicates, including item back references -->
<xsl:template match="term" mode="combine">
<xsl:apply-templates mode="copy"
select="itemref[not(@ref=../preceding-sibling::*[1]/itemref/@ref)]"/>
</xsl:template>
</xsl:stylesheet>
|
Example 2: XSLT to Make Index Terms Unique
Text formats are defined as character streams that utilize pre-existing conventions for character encoding.
XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.