Consistent Electronic Publishing from Inconsistent Sources
Track: Publishing, Knowledge Management, Integration
Audience Level: High Level/Technical View
Time: Wednesday, November 17 at 11:45
Keywords: Application Architecture, Content Management, Content Repurposing, Conversion, DocBook, Electronic Publishing, Integration, Java, Knowledge Management, Ontology, PDF, Publishing, SVG, XML, XSL-FO, XSLT
Abstract:
Widespread, consistent use of XML to encode documents throughout an organization has well-understood advantages: content can be re-purposed, re-styled, searched, combined, transformed, rendered or otherwise processed with ease, and pre-existing software can be highly leveraged when providing a solution.
However, in the real world people do not consistently adhere to imposed standards on the use of software, and consistent software is not installed throughout large organizations. It is unrealistic to expect that an organization's documents will be widely accessible as XML. To make matters worse, many standard tools for conversion to XML operate at inconsistent semantic levels, or encode an inappropriate semantic level. To illustrate this point, one could easily convert all electronic documents to XML with a common schema by opening the documents in their originating applications, taking screen shots, and encoding these bitmaps using a long list of <pixel> elements. However, this encoding would be useless for a content management system, or for anything other than re-rendering the same view of the same documents on the same device, for that matter.
We address the problem of large-scale conversion of heterogeneous content to XML, with an emphasis on the conversions needed to produce useful and compatible semantic representations from source files in various formats. We then discuss the applicability of these ideas to content management systems and electronic publication workflows.
An improved architecture for content management systems arises from the discussion. A wide variety of content management applications can be achieved simply by building appropriate pipelines of converters to, from and between XML languages. This is a lightweight, flexible approach that does not depend on a proprietary content management server because it leverages XML-processing functionality already included in operating systems and Web servers. To make this approach practical, one needs a large and varied toolkit of converters as well as standard pipeline architecture with which to connect them. We will provide a demonstration of such a rich toolkit that takes advantage of Apache Cocoon as the pipeline architecture. As an example, we will show how our toolkit was applied to the paper publication workflows of this conference.
Cocoon pipeline definitions are created in XML "sitemap" files, allowing new processes to be configured without writing code. In order to determine what converters are needed to make useful pipelines, it is helpful to regard each standard document-processing function as a conversion. In particular,
• retrieval of content from a standard desktop application file is a binary to XML conversion
• reconstruction of semantics implied by styling is possible with a profile-driven XML to XML conversion
• data-driven graphs, charts and maps are made possible by XML to SVG conversions
• documents can be prepared for print via XML to XSL-FO to PDF conversions
Similarly, database query, Web services, content aggregation, search, classification, Web publishing and e-book publishing can all be regarded as arising from specific converters in the content management toolkit. When effectively combined, such converters can produce powerful and sophisticated solutions to real-world problems.
XML version
HTML version
PDF version
SVG version