Abstract
The XML industry is dominated by healthy competition between various standards and conventions, but without basic quality in the design of XML documents, none of the downstream considerations is of much importance. Regardless of schema language, query and transform technology, poorly-designed XML leads to systems that are difficult and expensive to maintain. This paper and presentation focuses on specifics of good XML design.
The quality of design of XML documents in the field is rather poor. As for individual documents one learns not to even take well-formedness for granted. There is not much one can do about this except to advocate well-formedness checking early in workflow. Even worse is the problem of poor schema design. Most of the XML produced and consumed in the real-world is of a home-grown format rather than a global standard such as XHTML or Docbook. It is all too common to find XML that shows little evidence of analysis, never mind design.
The core problem appears to be that developers do not take XML design as seriously as they take the design of applications code. And there is actually not very much in the way of published professional standards for XML design that can be used to establish best practice.
This paper presents principles of XML design informed by practical experience with the most common shortcomings of XML design in the field. Rather than being too circumspect on matters such as how to choose between elements and attributes, container elements and the like, it provides very clear and practicable guidelines to help promote consistency in designs that adopt its principles. Attention is given to form and accessibility to human authors and editors as well as to considerations of processing models.
Keywords
Table of Contents
On the evidence of usage patterns in typical computer applications, not many developers pay attention to the quality of design of XML documents. It is getting easier to take well-formedness of documents for granted but even this is not a given. Software developers sometimes don't consider XML as a serious enough technology to require careful design, but nonchalance about XML design often exacts a steep price over time. A developer can string together almost any jumble of tags and content in order to create a functional format for some database dump or technical article, but if he doesn't pay attention to form, he ends up with data that is harder to process by machines as well as people, and is thus more expensive to maintain.
There are many aspects to XML design. How to develop a vocabulary? How to decide what to leave in and what to omit from the model? How to structure the content model of the main elements? How to choose between elements and attributes? At a minimum XML content deserves the same depth of analysis and design as applications code, and one can readily argue that XML requires more design because it is the sort of data that so often outlives applications code. This increases the overall risks associated with design flaws.
Good XML design is largely a matter of intuition that comes from practice, but some general guidelines and principles can help developers establish good design practices. Throughout this paper such principles are in boldface.
The most important stage for sound design practice is when developing the schema for XML documents. There are considerations, especially of form, that are important when authoring documents or writing code to generate them, but these are often secondary to the importance of well-designed schema, which is therefore the main focus of this paper.
Naming elements, attributes and controlled vocabulary terms in XML content is one of the most important exercises in design. The considerations to keep in mind when naming things seem a matter of common sense, but are neglected to a surprising extent in practice.
In some cases, the creators of XML formats never expect that humans will ever read those formats. Some commentators on Web Services Description Language (RDDL) reject the idea that anyone would ever actually read a WSDL file directly or edit it in a text or XML editor. "WSDL is just for the Web services toolkits" is a common refrain. But whether or not developers prefer to deal with XML as plain text, there are always times when the choice is removed. Toolkits break down and if you have not made the XML readable, you may regret it while developing or debugging code to process the XML, or while communicating the format to other developers. Never assume that it is not important for XML to be readable.
Always use very explicit and unambiguous element and attribute names. Consider using hyphens or underscores in naming rather than "hump case", for example "first-name" rather than "FirstName". Try to group elements logically so that when pretty printed they stand out more cleary rather than having endless runs of sibling elements.
When XML is developed by multiple contributors without shred formal standards, or if they come from toolkits such as data bindings where varied conditions in execution are manifested in the XML, consistency often suffers. Similar constructs might use different conventions at different points in the XML format. One element might be called "business-name" and a sibling "biz-tax-id". Each instance of such inconsistency is but a minor blemish, but in my experience, if developers do not pay particular attention to consistency, this sort of blemish proliferates until it becomes very confusing to follow the data.
The oldest question asked by adopters of XML is when to use elements and when to use attributes in XML design. In most cases there is no clear answer, but some broad guidelines may help with the decision. None of the guidelines are meant to be absolute; use them as rules of thumb and feel free to break the rules whenever your particular needs require it.
If you consider the information in question to be part of the essential material that is being expressed or communicated in the XML, put it in an element. For human-readable documents this generally means the core content that is being communicated to the reader. For machine-oriented records formats this generally means the data that comes directly from the problem domain. If you consider the information to be peripheral or incidental to the main communication, or purely intended to help applications process the main communication, use attributes. This will avoid cluttering up the core content with auxilliary material. For machine-oriented records formats, this generally means application-specific notations on the main data from the problem-domain.
As an example, it is not uncommon to see XML formats, usually home-grown in businesses, where document titles are placed in an attribute. A title is such a fundamental part of the communication of a document that it should always be in element content. On the other hand, it is not uncommon to find, say, internal product identifiers thrown as elements into descriptive records of the product. In some of these cases, attributes are more appropriate because the specific internal product code would not be of primary interest to most readers or processors of the document, especially when the ID is of a very long or inscrutable format.
The principle of core content is sometimes couched in less deliberate language: data goes in elements, metadata in attributes.
If the information is expressed in a structured form, especially if the structure may be extensible, use elements. On the other hand: If the information is expressed as an atomic token, use attributes. Elements are the extensible engine for expressing structure in XML. Almost all XML processing tools are designed around this fact, and if you break down structured information properly into elements, you'll find that your processing tools complement your design, and that you thereby gain productivity and maintainability. Attributes are designed for expressing simple properties of the information represented in an element. If you work against the basic architecture of XML by shoehorning structured information into attributes you may gain some specious terseness and convenience, but you will probably pay in maintenance costs.
Dates are a good example: A date has fixed structure and generally acts as a single token, so it makes sense as an attribute (preferably expressed in ISO-8601). Representing personal names on the other hand is a case where I've seen this principle surprise designers. It is common to see names in attributes a lot, but personal names should be in element content. A personal name has surprisingly variable structure (in some cultures you can cause confusion or offense by omitting honorifics or assuming an order of parts of names). A personal name is also rarely an atomic token. As an example, sometimes you may want to search or sort by a forename and sometimes by a surname. It is just as problematic to shoehorn a full name into the content of a single element as it is to put it in an attribute. Thus:
<customer>
<name>Gabriel Okara</name>
<occupation>Poet</occupation>
</customer>
Is not much better than:
<customer name="Gabriel Okara">
<occupation>Poet</occupation>
</customer>
An example of a fully-considered name format is that in Docbook. The following example is based on one in [TDG].
<author>
<honorific>Mr</honorific>
<firstname>Norman</firstname>
<surname>Walsh</surname>
<othername role='mi'>D</othername>
</author>
If the information is intended to be read and understood by a person, use elements. In general this guideline places prose in element content. If the information is most readily understood and digested by a machine, use attributes. In general this guideline means that information tokens that are not natural language go in attributes.
There are some cases where people can decipher the information being represented but need a machine to properly use it. URLs are a great example: people have learned to read URLs through exposure in Web browsers and e-mail messages, but a URL is usually not much use without the computer to retrieve the referenced resource. Some database identifiers are also quite readable (although established database management best practice discourages IDs that could have "businss meaning") but such IDs are usually props for machine processing. For these reasons it is usually best to put URLs and IDs in attributes.
Use an element if you need its value to be modified by another attribute. XML establishes a very strong conceptual bond between an attribute and the element in which it appears. An attribute provides some property or modification of that particular element. Processing tools for XML tend to follow this concept and it is almost always a terrible idea to have one attribute modify another. For example if you are designing a format for a restaurant menu and you incude the portion sizes of items on the menu, you may decide that this is not really important to the typical reader of the menu format so you apply the Principle of core content and make it an attribute. The first attempt is:
<menu>
<menu-item portion="250 mL">
<name>Small soft drink</name>
</menu-item>
<menu-item portion="500 g">
<name>Sirloin steak</name>
</menu-item>
</menu>
Following the Principle of structured information you decide not to shoehorn the portion measurement and units into a single attribute, but instead of using an element, you opt for:
<menu>
<menu-item portion-size="250" portion-unit="mL">
<name>Small soft drink</name>
</menu-item>
<menu-item portion-size="500" portion-unit="g">
<name>Sirloin steak</name>
</menu-item>
</menu>
The attribute portion-unit now modifies portion-size, which as I've mentioned is a bad idea. An attribute on the element menu-item should modify that element, and nothing else. The solution is to give in and use an element:
<menu>
<menu-item>
<portion unit="mL">250</portion>
<name>Small soft drink</name>
</menu-item>
<menu-item>
<portion unit="g">500</portion>
<name>Sirloin steak</name>
</menu-item>
</menu>
This involves a mix of the Principle of core content and the Principle of readability This is one of those cases that are less cut and dried and other schemes might be as suitable as mine. The solution also involves contradicting the original decision to put the portion size into an attribute based on the Principle of core content. It illustrates that sometimes general principles will lead to conficting conclusions and it's still a matter of your own judgment to decide on each specific matter.
Choosing between elements and attributes may not be a trivial matter, but once you've made your choice, stick to the same convention in similar situations. One sometimes sees XML where "id" is an attribute on one element, and then a child element on another, with the same meaning in both cases. This makes it harder to write generic and reusable code for processing the XML.
Another matter of constant discussion is whether to use container elements. The employees element in the following listing is an example.
<company>
<employees>
<employee id='101'>
<name>Ezra Pound</name>
</employee>
<employee id='102'>
<name>T.S. Eliot</name>
</employee>
</employees>
</company>
Sometimes container elements are a pure contrivance. Use container elements only when they have a natural correspondence to some real-world entity. In other words, don't just throw in container elements because you think a run of child elements need a holder. If you favor push-type processing of XML rather than pull-type processing, container elements will not significantly affect the ease of processing.
Namespaces solve a difficult problem and the W3C XML namespaces specification is a compromise. As with all compromises it falls short of addressing each user's needs. Namespaces have proven even after all this time very difficult to smoothly incorporate into XML information architecture and lack of care with namespaces can cause a lot of complications for XML processing tools.
Don't repeat in local names information already inheret in the namespace itself. For example, there is no need to make the local name of the linking element in the XHTML namespace xhtml-link. Since it is already local to the XHTML namespace just link will do. For historical reasons the XHTML specifications themselves go against this guideline when naming the root element html; it could just as well have been renamed to document.
Choosing namespace URIs is important. There is some debate over whether to use URLs or URNs. The former have the advantage of familiarity, but people often create namespace URLs that do not have any corresponding resource, that is, if you browse to the URL you get a 404 "not found" error. URNs have the advantage that they don't encourage people to try to look them up in browsers. Use URLs for namespaces if you are careful to place some sort of document at the URL that would be useful for a reader. Placing a Resource Directory Description Language (RDDL)[RDDL] 1.0 document at URLs corresponding to namespaces, unless more specialized conventions apply. For example, in RDF/XML documents namespaces often lead to RDF schema documents when resolved as URLs. There are many classes of URNs (classes of URNs are formally called "namespaces", not to be confused with XML namespaces). If you don't wish to use URLs, use URNs if your organization has a means of managing and resolving a suitable class of URN. Examples of URN namespaces include oid (an ISO-sanctioned system for assigning numerically coded identifiers to network nodes) and publicid (formal public identifiers entities as defined in SGML and XML).
Publish well-known prefixes for namespaces but never make any prefix mandatory. Prefer well-known prefixes for a namespace when creating documents but accept any chosen prefix for a namespace when reading documents.
Care of design and elegance of form in XML are not luxuries. On the contrary, they help save costs. It is often subtle and intangible considerations that mark the difference between an XML design that lasts well, can be reused, and is inexpensive to process, and one that is difficult to maintain. After all the rules and guidelines and principles the most important test of your design is a softer matter. Print out a sample document and hold it up to a colleague for a quick read. If he or she develops a headache from trying to figure out what it means, consider this an omen of the pain that maintaining the format will cause in future.
[RDDL] XML Resource Directory Description Language (RDDL) http://www.rddl.org/, by Jonathan Borden and Tim Bray
[TDG] DocBook: The Definitive Guide http://www.docbook.org/tdg/en/html/docbook.html by Norman Walsh and Leonard Muellner
[PLEA] "A plea for Sanity" http://lists.xml.org/archives/xml-dev/200204/msg00170.html by Joe English
[XMLNS] "XML Namespaces" http://www.jclark.com/xml/xmlns.htm by James Clark
[EFFXML] Effective XML http://www.cafeconleche.org/books/effectivexml/ by Elliotte Rusty Harold
[EA1] "SGML/XML: Elements versus attributes, April 1998" http://xml.coverpages.org/elementAttr9804.html
[EA2] "SGML/XML: Using Elements and Attributes" http://xml.coverpages.org/elementsAndAttrs.html
![]() ![]() |
Design & Development by deepX Ltd. |