XML 2003 logo

WordML for Typesetters, Compositors, and Stylesheet Designers

Abstract

WordML is the W3C XML schema behind Microsoft Word 2003. More than just a style description language, it encompasses every aspect of Microsoft Word and has become the native format for Word. If you are planning on implementing an XML solution in Office 2003, you'll need to learn WordML. While similar to other style description languages, in order to create a fully-functional XML authoring environment in Word 2003 you'll need to create XSL transforms that will merge your XML instance with WordML markup. Multiple transforms can be created to show different "views," each applying a specific set of styles. In this session we';ll talk about what can and can't be done using WordML, review each of the major element groups, and discuss in detail paragraph styles, character styles, and tables.

Keywords


Table of Contents

1. Overview of WordML
2. Core Structures
2.1. w:lists
2.2. w:styles
2.3. w:body
2.3.1. w:p
2.3.1.1. w:pPr
2.3.1.2. w:r
2.3.1.3. w:t
2.3.2. w:tbl
2.3.3. w:sectPr
3. Pitfalls
Acknowledgements
Bibliography
Biography

1. Overview of WordML

WordML, http://schemas.microsoft.com/office/word/2003/wordml, and its sibling, urn:schemas-microsoft-com:office:office, codify virtually each and every piece of information related to a Microsoft Word 2003 document instance. The easiest way to learn about the structure of a Word 2003 document is to save it as XML then view it in a text editor. However, be forewarned - there is a significant amount of overhead associated with each file. If you have ever examined the RTF version of a Word document, or saved a Word document as HTML, you may remember the enormous size of a file that only contains a line or two of text. The overhead is what makes it possible for another user to open your Word document and see it exactly as it appears on your screen; while unwieldy, it has its purpose.

There are ten basic structures contained within the w:wordDocument root-level element (there are sixteen total possible child elements). All but the last can be considered metadata; that is, they contain information about the document rather than the document content. The main structures are as follows:

o:DocumentProperties

part of the office namespace, it includes all of the standard document properties such as title, author, creation date, statistics, etc.

o:CustomDocumentProperties

part of the office namespace, it includes custom document properties as defined on the custom properties tab of the document properties panel.

fonts

available fonts

lists

list definitions - contains each of the possible list styles as well as any modifications made by the stylesheet designer or end user

styles

style definitions - contains all of the default styles as well as any document-specific styles created by the stylesheet designer or end user; includes paragraph, character and table styles

divs

docOleData

storage for OLE objects

docSuppData

toolbar customizations, VBA

docPr

storage for settings in each of the various options panels such as view, print, save, page setup, etc.

body

the actual content of the instance

2. Core Structures

The areas that are of most concern when trying to migrate an existing XML instance into Word 2003, or to transform a Word 2003 document instance into another vocabulary (such as a composition system or a particular XML schema) concern lists, styles, and body content.

When writing a transform to WordML, you may find it easier to simply create your template document in Word. Once you have set up all of your styles, lists, and any other document settings you want to show up in your result document, save the template as an xml file. You will then have all of the information created in WordML for you.

2.1. w:lists

2.2. w:styles

2.3. w:body

The body element contains the actual content of the document instance. It has three basic children: paragraphs, tables, and section break properties.

2.3.1. w:p

The paragraph element contains all of the information necessary to format a single block of text. WordML takes the approach (espoused by many XML developers) that mixed content should never exist. This is counter-intuitive to most document-centric XML developers and users and results in a fairly substantial increase in markup to be dealt with.

2.3.1.1. w:pPr

This is the properties element associated with a paragraph. Most importantly, it contains the w:pStyleelement, which references a particular style as defined in the w:styles section of the document instance.

2.3.1.2. w:r

The run element is the leaf container for data. While this might be straightforward in a pure WordML environment, when using another schema for markup, run elements close before the schema-specific start element and start just following the schema-specific element. For instance, to incorporate an inline emphasis element, you might see the following: ... text </w:t></w:r><ns0:emphasis style="italic"><w:r><w:t>italic ...

2.3.1.3. w:t

While w:r is the leaf container, it actually allows several children, including w:rPr (run properties), w:footnote, and w:pict (picture). The most important of these children is w:t (text).

2.3.2. w:tbl

The table element has two formatting children elements: w:tblPr which defines table-wide properties, and w:tblGrid which defines the individual column layouts. The row element, w:tr, contains a properties child (w:trPr) and the individual cell contents (w:tc).

2.3.3. w:sectPr

3. Pitfalls

While it is easy to get caught up in the complexities of WordML, only a minimal set of markup is required for Word to recognize a file as a valid Word 2003 XML instance.

Figure 1. 

When creating a transform that will take an existing XML instance and surround it with the appropriate WordML markup to enable formatting, particular attention must be paid to low-level markup within a paragraph.

Figure 2. 

The biggest difficulty when merging an existing XML instance with WordML is Word's lack of hierarchy. While WordML contains a section properties element, this is not a container; instead it contains the options associated with a section and is stored at the end of a section break. The lack of hierarchy is most evident at the paragraph level. It is common practice to consider a list part of a paragraph, which would most likely result in nested paragraph elements. WordML does not support this and instead the WordML structure must be flattened by considering each block of text as a separate WordML paragraph structure.

Acknowledgements

I would like to thank the folks at Microsoft for helping me navigate through the nuances of WordML, particularly Jean Paoli and Brian Jones.

Bibliography

[CDK] Microsoft Word XML Content Development Kit Beta 2. Microsoft Word XML CDK

[WordML] Word 11 Beta 2 Schema. Microsoft Word XML CDK

Biography

Senior VP and Principal XML Technologist

Mary first learned to speak structured markup languages in 1992, while working for Butterworth Legal Publishers. Immersion is always the best way to learn, and with the guidance of a knowledgeable mentor, she learned the nuances of document analysis, DTD development, structured editors, content management systems, and the many personality types to be dealt with during an implementation. Today, as Vice President of XML Solutions and Principal XML Technologist for DMSi, she helps her clients avoid all the pitfalls that she first encountered as an early adopter through project management, needs analysis, requirements definition, product selection, schema development, application customization, training and support. Sandwiched in between, Mary was the Manager of Sales Support for Xyvision (now XyEnterprise), focusing on SGML/XML content management solutions. In her spare time, Mary is a textile artist and an editor for Quilting Arts Magazine.