Abstract
WordML is the W3C XML schema behind Microsoft Word 2003. More than just a style description language, it encompasses every aspect of Microsoft Word and has become the native format for Word. If you are planning on implementing an XML solution in Office 2003, you'll need to learn WordML. Similar to other style description languages, in order to create a fully-functional XML authoring environment in Word 2003 you'll need to create XSL transforms that will merge your XML instance with WordML markup. Multiple transforms can be created to show different "views," each applying a specific set of styles. In this session we';ll talk about what can and can't be done using WordML, review each of the major element groups, and discuss in detail paragraph styles, character styles, and tables.
Keywords
Table of Contents
On October 21, 2003, the world of Microsoft Office changed. No longer a set of standalone applications, Microsoft released the Microsoft Office 2003 System, a powerful collection of integrated programs, servers, and services. The most significant enhancement, at least to any respectable xmlgeek, is the movement to XML as the underlying data structure. What this means is that every single possible bit of information that Microsoft maintains and tracks about a Word document is accessible in XML. And with that, stylesheet designers, typesetters, and anyone else trying to expose the contents of a Word document can now work with the data using standard XML tools.
WordML is nothing more than Word in XML. That isn't meant to downplay the significance; it merely is meant to point out that Word is still Word. Unlike past problems in trying to understand RTF (rich text format), the WordML schema is published, documented, licensed, and freely available. The easiest way to learn about the structure of a Word 2003 document is to save it as XML then view it in a text editor. However, be forewarned—there is a significant amount of overhead associated with each file. If you have ever examined the RTF version of a Word document, or saved a Word document as HTML, you'll be familiar with the enormous size of a file that only contains a line or two of text. The overhead is what makes it possible for another user to open your Word document and see it exactly as it appears on your screen; while unwieldy, it has its purpose.
WordML is nothing more than Word in XML. It bears repeating. Word has not been morphed into a structured editor (alhtough there is other cool technology that enables that), nor has the way in which Word creates and maintains information really changed. Only the vocabulary used to identify the infobits is new. There is no hierarchical structure; no nesting. It's still a flat structure.
Word documents (w:wordDocument) can be broken down into five distinct parts: Document Properties, Fonts, Lists, Styles and Body. The content of the document itself is contained in the w:body element and is the main focus of this article. First, an overview of each of the components and their contents.
NOTE: The best way to gain familiarity with WordML is through the use of the Word XML Toolbox. It's available for download from the MSDN Office Developer Center. You'll also want to grab a copy of the Microsoft Office 2003 XML Reference Schemas, which includes the actual schemas as well as documentation and overview articles. (The easiest way to locate the files is to go to www.microsoft.com/downloads and search for 02003xmlref.exe)
Microsoft Office Word 2003 takes advantage of several namespaces. For those of us that spend most of our time working with documents in editors like Epic Editor, XMetaL, or FrameMaker+XML, namespaces can be a bit overwhelming at first. Namespaces allow the mixing of data from disparate sources into a single instance. Identified by their prefix, each element follows its own set of rules. Thanks to namespaces, Word is able to handle customer-specific XML. All of the information relevant to Word is maintained in the various Microsoft namespaces; your specific schema maintains its own namespace (or is given one by default). Since a Word document includes a number of items that are common to any Office document, there's a special namespace for those elements. The vast majority of elements within a Word document fall within the w namespace.
w—http://schemas.microsoft.com/office/word/2003/wordml
wx—http://schemas.microsoft.com/office/word/2003/auxHint
v—urn:schemas-microsoft-com:office:vml
o—urn:schemas-microsoft-com:office:office
aml—http://schemas.microsoft.com/aml/2001/core
The o:DocumentProperties element and its children belong to the Office (o) namespace and include information about the document itself, such as the title, author name, creation date, date last edited, number of pages, etc. This is the XML representation of the data that can be seen by viewing the Properties pop-up window (see image below). A second element, o:CustomDocumentProperties, contains the information found on the Custom tab within the Properties window. Note that specific element names are undefined; CustomDocumentProperties is defined with a content model of "any." This allows the application to assign element names based on the Custom Property name and use the value as the element content. It's not necessary to write these elements when creating a WordML instance; Word will automatically populate them when the new file is first opened in Word. Similarly, in most instances this information will be discarded when transforming a Word XML instance to another format.
The <w:fonts> element has two children: <w:defaultFonts> and <w:font>. The default font is basically the same as would come up if you were creating a new document based on the Normal template; i.e. Times New Roman. Each grouping of the <w:font> element contains details about a particular font used within the actual document instance. Again, it is not necessary to write these elements when creating a WordML instance; Word will automatically populate them when the new file is first opened in Word.
The <w:lists> element also has two children: <w:listDef> and <w:list>. These two are interrelated. Each listDef has an ID attribute associated with it that links it to the appropriate list element. The list definition element is described as referring to "base list definitions." They are not used directly, but are instead referred to by an individual list element, which is referred to by a paragraph property. Sounds a bit confusing? That's because it is. Basically, a paragraph will contain an attribute named "ilfo", whose value points to the ilfo attribute on an individual list element. The individual list element has another attribute, "ilst", whose value points to the listDefID attribute of the listdef element.
If, instead, the list is created as part of a paragraph style, the style name is referenced in the listDef element hierarchy.
The <w:styles> element consists of <w:style> children. Each style group contains all of the details about one of four specific style types: paragraph, character, list or table. While each set of child elements is particular to the type of style, there are values that represent each of the options available on the style panes. Like above, it's not necessary to write all of the details about a particular style when creating a new WordML document instance. It is, however, critical that you have at least the <w:style> element and the style name for each style referenced in the body of the document. Word will automatically pick up the rest of the information from the referenced template. Styles will be explained in more detail in the next section.
The <w:body> element contains what it typically thought of as the document content. Everything that appears on a printed page is contained within the start and end body element, including headers, footers, footnotes, images, and textboxes. It can get pretty wild in here with binary data, proofing errors, grammar errors, and change tracking interrupting the text runs. These are the areas most likely to cause problems when trying to convert a WordML document instance into something else.
Paragraph style definitions (<w:style w:type=”paragraph”>) have 5 predictable components that play a role in document transforms and conversions. These are: name, based on, next, paragraph properties and run properties. Each of these elements and their attributes and children are directly related to the style panels in Word as shown below. There are 8 unique sub-panels, each of which is focused on a particular area: font, paragraph, tabs, border, language, frame, numbering, and shortcut key.
Styles in Microsoft Word are inherited; that is, any of the style characteristics associated with the "based on" style name are automatically associated with the style unless overridden. If your application needs to replicate styles, you'll need to be able to navigate through the style hierarchy to ensure that you've captured each of the relevant characteristics.
The next element identifies what is to happen when the enter key is pressed; by default, the new paragraph will take on the current style; however, by indicating a different style name here, the designer can have more control over the document's look and feel. For instance, if defining a heading style, the "next" style might be set to normal.
The paragraph and run properties are the same elements that are used within the body of document; any local settings (that is, settings within the body element) would override those in the style definition. There are more than two dozen child elements of the paragraph property element (<w:pPr>) ; the most common are listed here. Refer to the actual Word 2003 schema documentation for the complete list.
<w:keepNext/>
<w:keepLines/>
<w:pageBreakBefore/>
<w:widowControl/>
<w:listPr>
<w:tabs>
<w:spacing/>
<w:ind/>
<w:jc w:val=”left/right/center/both”/>
The run properties are also the same as those used for character styles. See Styles—character below for an overview.
Rather than using the individual formatting codes available on the menu bar, it's possible to create named character styles (<w:style w:type=”character”>). This makes writing transforms and conversions much simpler, and allows multiple primitives to be combined into a single command. When the style panel is set to a style type of character, only half of the sub-panels are available: font, border, language and shortcut key.
Amazingly, there are more elements associated with the run property (<w:rPr>) than with the paragraph property, above. Over three dozen unique child elements are possible. Bold, italic, caps, small caps, strike through, underline, outline, shadow, emboss, color, size, and of course, font, are the ones you're most likely to encounter, unless you're working with Asian or right-to-left fonts.
<w:rFonts>
<w:b/>
<w:i/>
<w:caps/>
<w:smallCaps/>
<w:strike/>
<w:outline/>
<w:shadow/>
<w:emboss/>
<w:vanish/>
<w:color/>
<w:spacing/>
<w:kern/>
<w:sz>
<w:u/>
Of course, if any of these run properties are set as part of a paragraph style, their effects will be seen on the entire paragraph, rather than a portion of one. For instance, if you have a paragraph style associated with a heading, it's likely that the entire paragraph will have certain characteristics—such as bold and a larger point size—associated with it.
The body of a Word document (<w:body>) is, for the most part, made up of paragraph elements. Remember, there's no heirarchical structure in Word; a paragraph element (<w:p>) cannot contain another paragraph element. You'll also find tables (<w:tbl>), proofing errors (<w:proofErr>), and permissions (<w:permStart> and <w:permEnd>) intermingled among the paragraphs. The very last element enountered before the close body element will be section properties (<w:sectPr>), which defines any headers, footers, and page layout specs.
Immediately following a paragraph element will be a paragraph properties (<w:pPr>) element. Similar to the the paragraph properties element used in the style model, in this case it most likely contains the child style (<w:pStyle>) element, which tells Word which paragraph style it is to use for this specific paragraph instance. If there is list formatting involved (without using an actual named list style), the style element will be followed by a list properties (<w:listPr>) element.
While you may be expecting to see some actual text, that won't appear until we get to the run element, below.
The run (<w:r>) element, similar to the paragraph element, has its own properties element associated with it. Run properties will associate a run of text with a particular character style, font, or formatting primitive such as bold or italic. One item to pay close attention to is the fact that each of these elements are empty; that is, the text that is to be displayed in bold or italic will not be surrounded by start and end tags. Instead, this property applies to the entire run, as designated by start and end tags.
Finally, we get to the last line of defense—the text (<w:t>) element. This is where the actual text you see in a document is stored. You must remember that in many cases an entire paragraph consists of multiple text runs. This is due to interruptions by proofing errors, comments, change tracking, tables, pictures, font changes, or any of the other elements allowable within a paragraph element.
If converting a WordML document to some other structured format, it is recommended that you first open the document in Word and turn off grammar and spell checking, and resolve any change tracking. Then save the document. This will eliminate much of the extraneous markup that's not really needed for the task at hand.
While it is easy to get caught up in the complexities of WordML, only a minimal set of markup is required for Word to recognize a file as a valid Word 2003 XML instance.
When creating a transform that will take an existing XML instance and surround it with the appropriate WordML markup to enable formatting, particular attention must be paid to low-level markup within a paragraph.
The biggest difficulty when merging an existing XML instance with WordML is Word's lack of hierarchy. While WordML contains a section properties element, this is not a container; instead it contains the options associated with a section and is stored at the end of a section break. The lack of hierarchy is most evident at the paragraph level. It is common practice to consider a list part of a paragraph, which would most likely result in nested paragraph elements. WordML does not support this and instead the WordML structure must be flattened by considering each block of text as a separate WordML paragraph structure.
I would like to thank the folks at Microsoft for helping me navigate through the nuances of WordML, particularly Jean Paoli and Brian Jones.
![]() ![]() |
Design & Development by deepX Ltd. |