XML Europe 2002 logo

Word and YAWC: A Poor Mans' XML Publishing Environment

Abstract

Buying, assembling and deploying an XML content publishing system is expensive and difficult. This presentation demonstrates how Microsoft Word and YAWC, a Word to XML converter with some extra features, together provide the basis of a cheap and cheerful alternative.

The actual realised benefits of full-scale Content Management Systems often do not justify the cost in either time or money that goes into implementing them. Rather than choose this route, a more limited goal of implementing a simple content publishing system, without all the management overhead, may be more suitable for many organisations, and more appropriate given the relative immaturity of the current CMS marketplace. A simple content publishing system can be implemented with Microsoft Word, XML, and a Word to XML converter.

Many of the cheaper Word to XML converters treat Word documents as once-off sources of XML content, rather than as ongoing master source documents. Word, both the document format and the editor, are considered part of a conversion problem to solve, rather than part of a content maintenance solution. YAWC Pro (http://www.yawcpro.com/) takes a different approach, treating Word as part of the content editing and maintenance solution for XML information. In addition to conversion, YAWC offers both a built-in authoring assistance interface for the Word editor, and an XML publishing solution based on XSLT. Together, these three components, all integrated within Word, offer a powerful but cheap and easy-to-use XML content maintenance and publishing solution.

Keywords


Table of Contents

1. Overview
2. Authoring
3. Conversion
4. Publication
5. Summary
Glossary
Biography
Organisations that publish significant amounts of information, whether online or in print, face significant information management challenges. Volume is increasing, but budgets are not, publishing cycles are shorter, but resources are tighter. Public sector organisations in particular must publish both in print and on the web, and many governments have set tough targets for making information available in a timely way online, and ensuring equal access to all citizens through the WAI Accessibility Guidelines.
Content Management Systems (CMSs) are often sold as the magic bullet solution to all publishing problems, and it would be wonderful if this were so. It is very tempting to believe that a CMS, once installed, will make your life as a publisher easier. In reality, CMSs, like much else, have been oversold and under-delivered. There are a variety of reasons for this, both technical and non-technical. There is now a recognition that CMSs really are a suite of tools which can be used build a solution tailored to a particular organisations' needs, rather than an out-of the-box solution in their own right. The cost, complexity and time required for a CMS installation are very high, and the direct benefits may not justify the effort and pain involved.
In addition, choosing a CMS is very difficult in the current immature market state. Quite apart from the technical merits of the many competing offerings, you are also choosing a relationship with a particular product vendor and integrator. There is no guarantee that the provider and solution you choose will still be around and supported in one years time, never mind the five to ten year timescale that you should be planning on.
Many of the biggest benefits of implementing a CMS solution are not directly related to the actual technology at all, but are side-effects of the process of choosing and implementing the system. The typical stages involved in a project involve needs analysis, requirements definition, system specification, learning about new technologies, evaluating systems, and defining processes, among many other things. This process, which should involve many publishing staff in an organisation, is highly beneficial, as it educates, raises awareness, prepares people for change, and allows inputs, processes and outputs to be reviewed and re-evaluated, leading to a revised and streamlined set of well-defined, well-understood set of processes and results. For example, one client reduced the number of different types of publication it produces from seven to two as a result of the preparation process.
How, therefore, can organisations realise the softer benefits of defining a formal information management strategy, without incurring the harder costs, problems and worries of choosing and implementing a formal Content Management System solution? I believe that the answer is to adopt a very long-term strategy, focusing on implementing incremental improvements using cheap technologies in the short-term (1-2 years), with the option to implement a more formal CMS in the medium and long term (3 – 5 years). Using a combination of Microsoft Word as the information creation and editing tool, and XML as the underlying information management and publishing technology, is a very cheap, but quite powerful and flexible starting point for such a strategy. This paper describes how such a system can be implemented.

1. Overview

The basic premise of using Word in combination with XML is that authors can be trusted to create highly structured content in Word, providing they get the appropriate assistance and training. If you believe that authors cannot do so, then this strategy is not for your organisation. In addition, authors would rather use a tool they are familiar with than learn something new, and organisations would rather use tools they already have than buy something new.

If you can create structured content in Word, then there are a variety of tools that can convert it into XML in an automated way. One such tool is UpCast (http://www.infinity-loop.de/), which the previous speaker has described. Once in XML, XSLT programs can be used to convert the content into HTML for web publication, and XSL:FO for low-quality print publication. Most typesetting applications, such as QuarkXPress, FrameMaker, and 3B2 can import either Word or XML content and typeset it to an acceptable standard for high-quality print publication.

Typically however, there have been a number of problems with this scenario.

  • Providing a structured environment within Word to support the authoring process is a significant task in itself, involving the development of Word templates with custom styles, VBA macros, dialog boxes, etc.

  • Converting Word into XML requires mapping styles into XML elements and attributes, according to a custom or industry standard DTD, which may involve a significant amount of programming, depending on the particular tool you choose.

  • Finally, publishing your XML content on various dissemination media, web, mobile phone, email, and print (using presentation technologies such as HTML, WML, PDF or ASCII text) requires further substantial programming effort.

These three issues: content creation, conversion, and publication; have normally been considered as separate tasks, to be carried out using different tools, and by different people. Yet Another Word Converter (YAWC) is a 3-in-1 tool, integrating all 3 steps fairly seamlessly into the Microsoft Word editing environment. This approach greatly reduces the initial setup time and cost for an XML-based publishing environment. It does not address the more complex issues of workflow, version and access control, management reporting, searching, and the host of other features of a full CMS, but it offers a quick, cheap, and relatively painless introduction to the rigours of a formal information management process, and allows authors to gain experience and expertise in markup, and see the immediate results of their work as published documents, without the need for intermediaries such as webmasters or even typesetters.

2. Authoring

The YAWC default authoring interface offers a few simple features.

  • A toolbar, menu and shortcut keys for applying styles such as Titles and Headings, Lists, in addition to the standard character level styles such as bold, italic and hyperlink

  • A simple verification tool to check that the overall hierarchical structure of a document is correct

    A dialog box to apply Dublin Core metadata, if required

    Simple commands to display markup (Style area) and structure (Document map), which most authors are unaware of

Additional customisations can be easily made using VBA in the normal way, but the basic features go a long way towards encouraging and assisting authors in creating very well marked up content. It is still possible to create a document that does not conform to the target XML DTD required, because there is no formal XML validation of the document. The authoring environment is not really XML-aware, unlike such tools as WorX SE or S4/Text. However, we have found that the quality of the markup is surprisingly good in practice.

3. Conversion

YAWC contains a built-in Word to XML converter. For authors, converting a document is a single step. There is no need to start a different application or save the document into a different format. For customisers, the process of mapping the Word content into XML structures is about as simple as it can be. The 80/20 principle applies: 80% of the conversion can be handled by means of a simple, direct mapping between Word styles and XML elements; the remaining 20% of more difficult constructs can be handled using an XSLT script to re-organise the relevant information into the required XML structure.

Although Word, like HTML, is effectively a linear structure with no hierarchy, the conversion process deduces the hierarchy based on the Heading styles uses, and makes this explicit in the XML output. Tables are converted into the HTML table model, retaining cell spanning and alignment information. Nested lists are supported, although they must be carefully marked up to convert correctly, and custom document properties and form fields are also converted into XML.

Some Word artifacts are deliberately not converted, in order to keep things simple. Colours, fonts, point sizes, headers, footers, etc. are all ignored.

Two default output target DTDs are supported by default, HTML and DocBook. These are the most widely used DTDs, and can be used as the basis for developing custom output DTDs if required.

As an aside, many organisations tend to engage in a long and tedious process of document modelling as part of the move to a more structured information management strategy. I would suggest that ignoring or postponing this step, and simply choosing to start with DocBook, allows you to get going in a very short time, which is worth much more than the possible slight loss of precision from using a generic document model.

4. Publication

Publishing content is sometimes just a simple matter of converting Word into HTML. More often however, multiple presentation outputs are required, such as PDF, HTML, and even WML or text. YAWC supports the generation of multiple outputs in a single operation. The publishing process involves a number of steps.

  1. Create structured content in Word.

  2. Convert content to 'raw' XML format using style to element mapping

    Convert 'raw' XML to target XML DTD (e.g. DocBook) using XSLT post-processing script

    Convert target XML to output presentation format (HTML, WML, text, PDF) using further XSLT scripts.

Any number of XSLT post-processing steps are supported, so the conversion process doesn't just deliver the target XML you want, but also any further downstream outputs you need. For example, you can generate DocBook XML, and publish it in HTML and PDF format using Norm Walshs DocBook XSL stylesheets, all in a single step.

Some additional features of the HTML publication stage are worth a mention.

  • The default HTML output is compliant with the WAI Accessibility Guidelines Level 2 (http://www.w3.org/WAI/). This ensures that everyone can read your web pages, which is particularly important for public sector websites in Europe, as the eEurope Action Plan (http://europa.eu.int/information_society/eeurope/index_en.htm) specifies that all EU government websites comply with WAI by 2002.

  • The Dublin Core Metadata Element Set (http://dublincore.org/documents/1999/07/02/dces/) is supported, and authors can maintain the relevant fields in a dialog box in Word. This improves the searchability of information, as search engines use DC elements when indexing web pages.

    True one-click web publishing is supported. Using the technique of style-free stylesheets (http://www.xml.com/pub/a/2000/07/26/xslt/xsltstyle.html), YAWC automatically places a HTML template containing your specific site navigation furniture around the content, so that it is ready for immediate publication on a website. FTP support is also built-in, so YAWC can upload the page directly to its final destination on a webserver. YAWC uses the DC.Identifier field to calculate the exact destination folder for a given document.

These benefits are realisable because the information is first converted to XML before generating HTML. The XSLT stylesheets packaged with YAWC do all the work of ensuring WAI compliance and placing DC metadata in the HTML head element. Although many Word to HTML converters exist, they generally do not support these particular features. Because they do not use XML as the intermediary, they focus narrowly on web publication, which is OK for small corporate websites, but not for larger information publishers.

5. Summary

CMSs are expensive, difficult to implement, and involve taking a gamble about which vendors will still be around in 5 years. They do not always deliver the benefits we are led to expect. A low-risk and low-cost alternative is to use Word for content creation and XML for content publication, in order to learn about the information management process, and develop valuable experience in managing and publishing XML information. After all, the real goal for most organisations is to be able to publish efficiently and cost-effectively.

YAWC, a plug-in for Word, is designed to address the need to publish quickly and cheaply, and is already used by a variety of publishers, including newspapers, government departments, research institutes, and academic publishers.

Glossary

CMSs

Content Management Systems

YAWC

Yet Another Word Converter

Biography

Eoin Campbell