Consistent Electronic Publishing from Inconsistent Sources

Track: Publishing, Knowledge Management, Integration

Audience Level: High Level/Technical View

Time: Wednesday, November 17 at 11:45

Author: Dr. Philip Mansfield , President, SchemaSoft

Author: Dr. Yuri Khramov , Director of Development, SchemaSoft

Author: Ahmet Gurcan , Senior Developer, SchemaSoft

Keywords: Application Architecture, Content Management, Content Repurposing, Conversion, DocBook, Electronic Publishing, Integration, Java, Knowledge Management, Ontology, PDF, Publishing, SVG, XML, XSL-FO, XSLT

Abstract:

Widespread, consistent use of XML to encode documents throughout an organization has well-understood advantages: content can be re-purposed, re-styled, searched, combined, transformed, rendered or otherwise processed with ease, and pre-existing software can be highly leveraged when providing a solution.

However, in the real world people do not consistently adhere to imposed standards on the use of software, and consistent software is not installed throughout large organizations. It is unrealistic to expect that an organization's documents will be widely accessible as XML. To make matters worse, many standard tools for conversion to XML operate at inconsistent semantic levels, or encode an inappropriate semantic level. To illustrate this point, one could easily convert all electronic documents to XML with a common schema by opening the documents in their originating applications, taking screen shots, and encoding these bitmaps using a long list of <pixel> elements. However, this encoding would be useless for a content management system, or for anything other than re-rendering the same view of the same documents on the same device, for that matter.

We address the problem of large-scale conversion of heterogeneous content to XML, with an emphasis on the conversions needed to produce useful and compatible semantic representations from source files in various formats. We then discuss the applicability of these ideas to content management systems and electronic publication workflows.

An improved architecture for content management systems arises from the discussion. A wide variety of content management applications can be achieved simply by building appropriate pipelines of converters to, from and between XML languages. This is a lightweight, flexible approach that does not depend on a proprietary content management server because it leverages XML-processing functionality already included in operating systems and Web servers. To make this approach practical, one needs a large and varied toolkit of converters as well as standard pipeline architecture with which to connect them. We will provide a demonstration of such a rich toolkit that takes advantage of Apache Cocoon as the pipeline architecture. As an example, we will show how our toolkit was applied to the paper publication workflows of this conference.

Cocoon pipeline definitions are created in XML "sitemap" files, allowing new processes to be configured without writing code. In order to determine what converters are needed to make useful pipelines, it is helpful to regard each standard document-processing function as a conversion. In particular,

• retrieval of content from a standard desktop application file is a binary to XML conversion

• reconstruction of semantics implied by styling is possible with a profile-driven XML to XML conversion

• data-driven graphs, charts and maps are made possible by XML to SVG conversions

• documents can be prepared for print via XML to XSL-FO to PDF conversions

Similarly, database query, Web services, content aggregation, search, classification, Web publishing and e-book publishing can all be regarded as arising from specific converters in the content management toolkit. When effectively combined, such converters can produce powerful and sophisticated solutions to real-world problems.