Abstract
Buying, assembling and deploying an XML content publishing system is expensive and difficult. This presentation demonstrates how Microsoft Word and YAWC, a Word to XML converter with some extra features, together provide the basis of a cheap and cheerful alternative.
The actual realised benefits of full-scale Content Management Systems often do not justify the cost in either time or money that goes into implementing them. Rather than choose this route, a more limited goal of implementing a simple content publishing system, without all the management overhead, may be more suitable for many organisations, and more appropriate given the relative immaturity of the current CMS marketplace. A simple content publishing system can be implemented with Microsoft Word, XML, and a Word to XML converter.
Many of the cheaper Word to XML converters treat Word documents as once-off sources of XML content, rather than as ongoing master source documents. Word, both the document format and the editor, are considered part of a conversion problem to solve, rather than part of a content maintenance solution. YAWC Pro (http://www.yawcpro.com/) takes a different approach, treating Word as part of the content editing and maintenance solution for XML information. In addition to conversion, YAWC offers both a built-in authoring assistance interface for the Word editor, and an XML publishing solution based on XSLT. Together, these three components, all integrated within Word, offer a powerful but cheap and easy-to-use XML content maintenance and publishing solution.
Keywords
Table of Contents
The basic premise of using Word in combination with XML is that authors can be trusted to create highly structured content in Word, providing they get the appropriate assistance and training. If you believe that authors cannot do so, then this strategy is not for your organisation. In addition, authors would rather use a tool they are familiar with than learn something new, and organisations would rather use tools they already have than buy something new.
If you can create structured content in Word, then there are a variety of tools that can convert it into XML in an automated way. One such tool is UpCast (http://www.infinity-loop.de/), which the previous speaker has described. Once in XML, XSLT programs can be used to convert the content into HTML for web publication, and XSL:FO for low-quality print publication. Most typesetting applications, such as QuarkXPress, FrameMaker, and 3B2 can import either Word or XML content and typeset it to an acceptable standard for high-quality print publication.
Typically however, there have been a number of problems with this scenario.
Providing a structured environment within Word to support the authoring process is a significant task in itself, involving the development of Word templates with custom styles, VBA macros, dialog boxes, etc.
Converting Word into XML requires mapping styles into XML elements and attributes, according to a custom or industry standard DTD, which may involve a significant amount of programming, depending on the particular tool you choose.
Finally, publishing your XML content on various dissemination media, web, mobile phone, email, and print (using presentation technologies such as HTML, WML, PDF or ASCII text) requires further substantial programming effort.
These three issues: content creation, conversion, and publication; have normally been considered as separate tasks, to be carried out using different tools, and by different people. Yet Another Word Converter (YAWC) is a 3-in-1 tool, integrating all 3 steps fairly seamlessly into the Microsoft Word editing environment. This approach greatly reduces the initial setup time and cost for an XML-based publishing environment. It does not address the more complex issues of workflow, version and access control, management reporting, searching, and the host of other features of a full CMS, but it offers a quick, cheap, and relatively painless introduction to the rigours of a formal information management process, and allows authors to gain experience and expertise in markup, and see the immediate results of their work as published documents, without the need for intermediaries such as webmasters or even typesetters.
The YAWC default authoring interface offers a few simple features.
A toolbar, menu and shortcut keys for applying styles such as Titles and Headings, Lists, in addition to the standard character level styles such as bold, italic and hyperlink
A simple verification tool to check that the overall hierarchical structure of a document is correct
A dialog box to apply Dublin Core metadata, if required
Simple commands to display markup (Style area) and structure (Document map), which most authors are unaware of
Additional customisations can be easily made using VBA in the normal way, but the basic features go a long way towards encouraging and assisting authors in creating very well marked up content. It is still possible to create a document that does not conform to the target XML DTD required, because there is no formal XML validation of the document. The authoring environment is not really XML-aware, unlike such tools as WorX SE or S4/Text. However, we have found that the quality of the markup is surprisingly good in practice.
YAWC contains a built-in Word to XML converter. For authors, converting a document is a single step. There is no need to start a different application or save the document into a different format. For customisers, the process of mapping the Word content into XML structures is about as simple as it can be. The 80/20 principle applies: 80% of the conversion can be handled by means of a simple, direct mapping between Word styles and XML elements; the remaining 20% of more difficult constructs can be handled using an XSLT script to re-organise the relevant information into the required XML structure.
Although Word, like HTML, is effectively a linear structure with no hierarchy, the conversion process deduces the hierarchy based on the Heading styles uses, and makes this explicit in the XML output. Tables are converted into the HTML table model, retaining cell spanning and alignment information. Nested lists are supported, although they must be carefully marked up to convert correctly, and custom document properties and form fields are also converted into XML.
Some Word artifacts are deliberately not converted, in order to keep things simple. Colours, fonts, point sizes, headers, footers, etc. are all ignored.
Two default output target DTDs are supported by default, HTML and DocBook. These are the most widely used DTDs, and can be used as the basis for developing custom output DTDs if required.
As an aside, many organisations tend to engage in a long and tedious process of document modelling as part of the move to a more structured information management strategy. I would suggest that ignoring or postponing this step, and simply choosing to start with DocBook, allows you to get going in a very short time, which is worth much more than the possible slight loss of precision from using a generic document model.
Publishing content is sometimes just a simple matter of converting Word into HTML. More often however, multiple presentation outputs are required, such as PDF, HTML, and even WML or text. YAWC supports the generation of multiple outputs in a single operation. The publishing process involves a number of steps.
Create structured content in Word.
Convert content to 'raw' XML format using style to element mapping
Convert 'raw' XML to target XML DTD (e.g. DocBook) using XSLT post-processing script
Convert target XML to output presentation format (HTML, WML, text, PDF) using further XSLT scripts.
Any number of XSLT post-processing steps are supported, so the conversion process doesn't just deliver the target XML you want, but also any further downstream outputs you need. For example, you can generate DocBook XML, and publish it in HTML and PDF format using Norm Walshs DocBook XSL stylesheets, all in a single step.
Some additional features of the HTML publication stage are worth a mention.
The default HTML output is compliant with the WAI Accessibility Guidelines Level 2 (http://www.w3.org/WAI/). This ensures that everyone can read your web pages, which is particularly important for public sector websites in Europe, as the eEurope Action Plan (http://europa.eu.int/information_society/eeurope/index_en.htm) specifies that all EU government websites comply with WAI by 2002.
The Dublin Core Metadata Element Set (http://dublincore.org/documents/1999/07/02/dces/) is supported, and authors can maintain the relevant fields in a dialog box in Word. This improves the searchability of information, as search engines use DC elements when indexing web pages.
True one-click web publishing is supported. Using the technique of style-free stylesheets (http://www.xml.com/pub/a/2000/07/26/xslt/xsltstyle.html), YAWC automatically places a HTML template containing your specific site navigation furniture around the content, so that it is ready for immediate publication on a website. FTP support is also built-in, so YAWC can upload the page directly to its final destination on a webserver. YAWC uses the DC.Identifier field to calculate the exact destination folder for a given document.
These benefits are realisable because the information is first converted to XML before generating HTML. The XSLT stylesheets packaged with YAWC do all the work of ensuring WAI compliance and placing DC metadata in the HTML head element. Although many Word to HTML converters exist, they generally do not support these particular features. Because they do not use XML as the intermediary, they focus narrowly on web publication, which is OK for small corporate websites, but not for larger information publishers.
CMSs are expensive, difficult to implement, and involve taking a gamble about which vendors will still be around in 5 years. They do not always deliver the benefits we are led to expect. A low-risk and low-cost alternative is to use Word for content creation and XML for content publication, in order to learn about the information management process, and develop valuable experience in managing and publishing XML information. After all, the real goal for most organisations is to be able to publish efficiently and cost-effectively.
YAWC, a plug-in for Word, is designed to address the need to publish quickly and cheaply, and is already used by a variety of publishers, including newspapers, government departments, research institutes, and academic publishers.
![]() ![]() |
Design & Development by deepX Ltd. 2002 |