Abstract
WordML is the W3C XML schema behind Microsoft Word 2003. More than just a style description language, it encompasses every aspect of Microsoft Word and has become the native format for Word. If you are planning on implementing an XML solution in Office 2003, you'll need to learn WordML. While similar to other style description languages, in order to create a fully-functional XML authoring environment in Word 2003 you'll need to create XSL transforms that will merge your XML instance with WordML markup. Multiple transforms can be created to show different "views," each applying a specific set of styles. In this session we';ll talk about what can and can't be done using WordML, review each of the major element groups, and discuss in detail paragraph styles, character styles, and tables.
Keywords
Table of Contents
WordML, http://schemas.microsoft.com/office/word/2003/wordml, and its sibling, urn:schemas-microsoft-com:office:office, codify virtually each and every piece of information related to a Microsoft Word 2003 document instance. The easiest way to learn about the structure of a Word 2003 document is to save it as XML then view it in a text editor. However, be forewarned - there is a significant amount of overhead associated with each file. If you have ever examined the RTF version of a Word document, or saved a Word document as HTML, you may remember the enormous size of a file that only contains a line or two of text. The overhead is what makes it possible for another user to open your Word document and see it exactly as it appears on your screen; while unwieldy, it has its purpose.
There are ten basic structures contained within the w:wordDocument root-level element (there are sixteen total possible child elements). All but the last can be considered metadata; that is, they contain information about the document rather than the document content. The main structures are as follows:
| o:DocumentProperties |
part of the office namespace, it includes all of the standard document properties such as title, author, creation date, statistics, etc. |
| o:CustomDocumentProperties |
part of the office namespace, it includes custom document properties as defined on the custom properties tab of the document properties panel. |
| fonts |
available fonts |
| lists |
list definitions - contains each of the possible list styles as well as any modifications made by the stylesheet designer or end user |
| styles |
style definitions - contains all of the default styles as well as any document-specific styles created by the stylesheet designer or end user; includes paragraph, character and table styles |
| divs | |
| docOleData |
storage for OLE objects |
| docSuppData |
toolbar customizations, VBA |
| docPr |
storage for settings in each of the various options panels such as view, print, save, page setup, etc. |
| body |
the actual content of the instance |
The areas that are of most concern when trying to migrate an existing XML instance into Word 2003, or to transform a Word 2003 document instance into another vocabulary (such as a composition system or a particular XML schema) concern lists, styles, and body content.
When writing a transform to WordML, you may find it easier to simply create your template document in Word. Once you have set up all of your styles, lists, and any other document settings you want to show up in your result document, save the template as an xml file. You will then have all of the information created in WordML for you.
The body element contains the actual content of the document instance. It has three basic children: paragraphs, tables, and section break properties.

The paragraph element contains all of the information necessary to format a single block of text. WordML takes the approach (espoused by many XML developers) that mixed content should never exist. This is counter-intuitive to most document-centric XML developers and users and results in a fairly substantial increase in markup to be dealt with.

This is the properties element associated with a paragraph. Most importantly, it contains the w:pStyleelement, which references a particular style as defined in the w:styles section of the document instance.
The run element is the leaf container for data. While this might be straightforward in a pure WordML environment, when using another schema for markup, run elements close before the schema-specific start element and start just following the schema-specific element. For instance, to incorporate an inline emphasis element, you might see the following: ... text </w:t></w:r><ns0:emphasis style="italic"><w:r><w:t>italic ...

The table element has two formatting children elements: w:tblPr which defines table-wide properties, and w:tblGrid which defines the individual column layouts. The row element, w:tr, contains a properties child (w:trPr) and the individual cell contents (w:tc).

While it is easy to get caught up in the complexities of WordML, only a minimal set of markup is required for Word to recognize a file as a valid Word 2003 XML instance.
When creating a transform that will take an existing XML instance and surround it with the appropriate WordML markup to enable formatting, particular attention must be paid to low-level markup within a paragraph.
The biggest difficulty when merging an existing XML instance with WordML is Word's lack of hierarchy. While WordML contains a section properties element, this is not a container; instead it contains the options associated with a section and is stored at the end of a section break. The lack of hierarchy is most evident at the paragraph level. It is common practice to consider a list part of a paragraph, which would most likely result in nested paragraph elements. WordML does not support this and instead the WordML structure must be flattened by considering each block of text as a separate WordML paragraph structure.
I would like to thank the folks at Microsoft for helping me navigate through the nuances of WordML, particularly Jean Paoli and Brian Jones.
![]() ![]() |
Design & Development by deepX Ltd. |