Document Model Selection: Off-the-Shelf, Altered-to-Fit, or Bespoke?

Keywords: Content model, schema, application architecture

B. Tommie Usdin
President
Mulberry Technologies, Inc.
Rockville
Maryland
United States of America

Biography

B. Tommie Usdin is President of Mulberry Technologies, Inc., a consultancy specializing in XML and SGML. Ms. Usdin has been working with SGML since 1985 and has been a supporter of XML since 1996. She chairs IDEAlliance's Extreme Markup Language conferences and was co-editor of "Markup Languages: Theory & Practice" published by the MIT Press. Ms. Usdin has developed DTDs, Schemas, and XML/SGML application frameworks for applications in government and industry. Projects include reference materials in medicine, science, engineering, and law; semiconductor documentation; historical and archival materials. Distribution formats have included print books and journals, and both Web-based and media based electronic publications.


Abstract


Document Model selection is a key success factor in XML. Approaches include: adopting an existing model, modifying a model to meet your needs, and creating one to meet your needs. Advantages and disadvantages of each are discussed.


Table of Contents


1. Role of the Document Model in XML
2. Don’t get Distracted by Choice of Constraint Language
3. Requirements First
     3.1 Application Goals
     3.2 Existing Document Standards
     3.3 What are You Going To Do with the Documents as Modeled?
     3.4 Document Creation
4. How Many Document Models?
5. Sources of Vocabularies
     5.1 Off-the-Shelf
          5.1.1 Off-the-Shelf advantages
          5.1.2 Permissions/licensing
          5.1.3 Finding Off-The-Shelf Document Models
          5.1.4 The Bottom Line
     5.2 Altered-to-Fit
          5.2.1 Why Off-the-shelf models may not fit
          5.2.2 Alterations
     5.3 Bespoke
          5.3.1 Off-the-shelf Modules in Bespoke Models
          5.3.2 Creating Document Models
               5.3.2.1 Document Analysis
               5.3.2.2 Select Constraint Language(s)
               5.3.2.3 Develop and Document the Document Model(s)
               5.3.2.4 Test the Document Model
     5.4 There is No Best Approach

Biography

Ms. Usdin has been working with XML and XSLT since their inception, and with SGML since 1985. She Chairs the Extreme Markup Languages® conference, and formerly chaired Markup Technologies and the international SGML'XX conference. Ms. Usdin has been an active participant in SGML Open committee work since the consortium was founded in 1994 and was program chair for the Mid-Atlantic SGML Users Group in 1992 and 1993, and has spoken at: meetings of the Association for Computing in the Humanities, Association for Computing Machines, SGML Forum of New York, the Mid-Atlantic SGML Users Group, and the Northern California SGML Users Group; and Extreme Markup Languages®, Markup Technologies, XML One, XML DevCon, International Markup, TechDoc, Seybold, SGML'XX conferences, SGML Europe, XML Europe, Internet World, and SGML Asia/Pacific conferences.

Ms. Usdin developed a document analysis methodology called Rapid DTD Development (RDD) based on the principles of Joint Application Development, which is now widely used throughout the SGML/XML industry. She has taught document analysis theory and techniques to staff members at Bell Laboratories, Sikorski Aircraft, and Frame Development Corporation, and to public classes at TechDoc, SGML, and XML conferences.

1. Role of the Document Model in XML

A document model (also sometimes called an XML Vocabulary or XML Tag Set) is a description of a set of XML documents that says what tags may be used in the documents, how these tags relate to each other, and what the tags may contain. This set of rules, or constraints on the documents, is the basis of communication in an XML environment.

Document models are the heart of any XML application. If the model is unsuited to the information or its intended uses, the application will be at best awkward. If the model is appropriate to the intended use, it is possible to build systems and applications that meet both the technical and esthetic needs of the users. No document model can ensure a successful project but using inappropriate or awkward model(s) can ensure continual difficulties if not failure.

It is true that the XML specification specifically says that document models are optional. It is also is true that well-formed XML (XML that meets the basic rules of XML syntax but that is either not associated with a document model or does not meet the rules of the model) is useful in some circumstances; but not very many. In real life, the situations in which it looks like people are usefully interchanging well-formed but not valid XML are generally situations in which some mechanism other than an XML model (for example a relational database or a sample document) is used to convey, and agree upon, a document model.

A set of tagged documents, without any indication of what the tags are, what they mean, how they are related, and how they are expected to be used, is much more difficult to deal with than a similar set of documents with a known document model. (If you are responsible for formatting a set of documents, wouldn’t you like to know whether you can expect to see tables inside footnotes or not? If you receive a purchase order in XML, wouldn’t you want to be sure which address is the delivery address and which is the billing address?)

Document model functions include:

Communication
The primary use of a document model is as a tool to allow everyone working with a document set to agree on what the markup will be, independent of or before, the documents are created. It enables the development of tools for use with the documents before the documents are created or complete, and that will work with many documents created by many people.
Software control
Some XML-based software, especially content creation tools, are driven by document models. They use the model to guide the author in creating complete and valid documents. When a document model is appropriate to an authoring task it makes the job of the content creator much easier. (Conversely, an inappropriate document model for authoring can make this task much harder).
Validation
the model is a verifiable contract between content creators and those who will be using, processing, or manipulating the content. It allows content creators to be sure that they have produced documents that others should be able to use, and allows receivers of XML documents to know if the documents are complete and what was expected.

2. Don’t get Distracted by Choice of Constraint Language

The XML 1.0 and 1.1 specifications describe one constraint language; the DTD (Document Type Definition). Since then several other document constraint languages have been developed and standardized to various extents by various organizations. If you are at the beginning stages of an XML project, I suggest that you completely ignore this fact (and all of the advice you will no doubt receive from well-meaning people who know very little about your requirements) on what form of constraint language you should use. Well meaning people will tell you that you shouldn’t even consider using DTDs because they aren’t in XML syntax. Other well meaning people will tell you that you must use DTDs because that is the only constraint language in the XML specification and the only one with 20 year’s of experience behind it. Well meaning people will tell you that you shouldn’t use W3C’s XML Schema (XSD) because the tools don’t interpret XSD schemas in the same way. Well meaning people will tell you that you must use XSD because it has strong data typing and namespace support, as well as being in XML syntax. Well meaning people will tell you that you must avoid RELAX NG because it isn’t a W3C specification. Well meaning people will tell you that RELAX NG is the only sensible choice because it supports strong data typing and namespaces, has an XML syntax and a compact syntax, and is an ISO standard.

This is chaos, and a distraction. Until you know what constraints you want to express, you cannot know what constraint language will best express them. And, in any real XML environment, you should expect that some of the legitimate constraints on your documents will not be enforceable through software, no matter what constraint language you use.

After you know:

you will have enough information to make decisions about constraint language. Don’t even try to approach this decision before you know these things.

3. Requirements First

Before you can make any well-informed decisions about the source of your document model, you need to identify your requirements, in your own vocabulary.

3.1 Application Goals

Document not only what you are trying to do, but also what you are not trying to do. Set limits for the application, the XML, and thus the requirements that must be supported by the model. What functions might a person think you would include in your goals that are, in fact, outside the scope of the project, either because you have decided you don’t want to do them or because they have been postponed to a later stage of development? Set expectations. Note opportunities for future development without promising that they will be developed.

3.2 Existing Document Standards

It is likely that you have existing style guides, specifications, print or web design contracts or expectations for what is in existing documents and how it is arranged. Identify those standards and learn which of the “requirements” are absolute requirements (the presence and wording of legal disclaimers may be an absolute requirements) and which are preferences which can be ignored at will. Be aware that there may be no way to tell the difference between requirements and preferences by reading the guides, but if you try to introduce an XML application that enforces rules that are simple preferences you are likely to have considerable difficulty.

3.3 What are You Going To Do with the Documents as Modeled?

For each document you want to model (each type of document and each modeled stage in the document life-cycle), identify and prioritize the functions of the document.

What end-uses should the document support? Print? Print that looks exactly like your current print products? Print that looks similar to your current print products? Electronic products? With what user interface and/or what functions? Web pages? Electronic books? Voice synthesis? Customized views based on user ID, clearance level, or location? Integration with other information? Interchange with outside organizations? In what form? What content and what level of detail in the tagging is needed for that interchange? Do these documents need to be transformed into another form? What?

3.4 Document Creation

How will the documents be created? Will they be authored by people using an XML tool? Will they be created in an automated or semi-automated fashion? Will they be converted from existing content? If so, is the content structured or unstructured? How“clean” and consistent is the existing structure? Does the current content exist in a database?

4. How Many Document Models?

In some XML environments there is one document model, in others there may be suites of several document models that are used at various parts of the document’s life and for various purposes. One of the key decisions you need to make in deciding how to approach selection/creation of a document model is what that model is for. Remember that, like many other tools, the more functions a document model serves the less well it is likely to serve any of them. An object that is a knife, and a can opener, a wine bottle opener, a toothpick, a cigarette lighter, a compass, and a telephone (if such a thing exists) is not going to be: a well-balanced knife that fits well in the hand, and graceful wine bottle opener, a light-weight and disposable toothpick, an elegant cigarette lighter, … etc. Similarly, a document model cannot be simultaneously optimized for creation of new content (especially with a particular editing tool), conversion of existing unstructured content into XML, long-term storage, and presentation (especially with any particular rendering tool).

If, for each of these functions, you need a document model that makes them easy (and this is not an unreasonable requirement), then you need a set of document models and a set of transformations that convert the content from each of these forms to another — or at least that convert each of these forms into a “repository” form and vice versa.

5. Sources of Vocabularies

5.1 Off-the-Shelf

There is an ever-growing list of XML vocabularies, including document models, publicly available. They are being created, maintained, and promulgated by government, industry, interest, corporate groups, and a variety of “standards bodies”.

While there are exceptions to all of the following generalizations, on the whole, these vocabularies are:

5.1.1 Off-the-Shelf advantages

In addition to the very useful documentation, some of the published document models also provide:

Other advantages of using a published document model include the following:

5.1.2 Permissions/licensing

Off-the-shelf document models are published with a variety of ownership claims and permissions statements. Some are open source, some are licensed, with various claims. Be sure to read the licensing or ownership claims associated with any document model before you use it. I am not aware of any legal actions about the use of an XML DTD or schema in violation of it’s licensing terms, but that doesn’t mean it couldn’t happen, or hasn’t happened.

5.1.3 Finding Off-The-Shelf Document Models

In some industries or areas of interest there is one group or organization that is the natural place for people to work out ways of interchanging information. Perhaps it is a group that has been developing and promulgating information interchange specifications for a long time, or is the place where people in the industry or interest area collaborate on research areas of general interest. If such an organization exists in your area of interest, that is the first place you should look for a published document model. These may be trade associations, non-profit organizations that support your activity, national or international standards bodies. In some areas, common tools such as document models are produced and promulgated, or endorsed, by major suppliers to the industry or by a major player in the industry. For example, a major player may say: “if you want to do business with me, we will interchange information this way – you may also use the model for interchange with others in the industry if you choose”. (This 800-pound gorilla source of document models is not unusual; sometimes the 800-pound gorilla writes the model, sometimes it endorses a model developed elsewhere).

If you are part of a regulated industry anywhere in the world, it is becoming more and more likely that the agency regulating you will specify an XML format in which you must provide information to the regulator. Typically, in this case, you may either use the model they provide internally or use your own or an altered version of their model internally and transform to their model to send them information.

There are several listings of XML projects and document models. Perhaps the largest, and certainly the oldest, is Robin Cover’s listing of XML applications, published by OASIS at: http://xml.coverpages.org/xmlApplications.html. In addition, there are lists at:

5.1.4 The Bottom Line

If you can find an off-the-shelf document model that meets your needs, you can save significant time and money by adopting it. Not only can you skip the step of developing the model itself, you are likely to have a good start on the documentation and support applications you need. In addition, it is possible that your vendors will be familiar with the model (or will learn it at their own cost because they feel they should know it), and you may be able to hire staff or contractors with experience with the model.

However, if it doesn’t meet your needs, you may find yourself modifying the way you do a significant part of your business to accommodate the model, or (more likely) doing duplicate work — using another mechanism to record and track information you need that is not accommodated in the model or even maintaining two sets of critical documents.

5.2 Altered-to-Fit

After hearing all of those advantages of using an off-the-shelf document model, why even consider any other options? Because off-the-shelf models are based on a series of compromises. They were, typically, developed by groups of people who balanced various competing needs as best they could, but not necessarily the way you would make those decisions for your environment.

5.2.1 Why Off-the-shelf models may not fit

XML can be introduced at many places in the document life cycle. The earlier in the lifecycle it is introduced the more important it is that the document model be compatible with your business rules. To take an extreme example: if you continue to develop, modify, and publish your documents as you have before your considered XML, and simply take snap-shots (for example, printed publications) and add a process of making XML versions of those existing documents to your existing processes, the only requirements for the document model are that it be practical/affordable to convert your documents to it and that it meet the needs of whom-ever you want to receive the XML. You don’t need to worry about how the XML model matches your content creation, editorial, or revision processes because it won’t be involved in those processes. (You won’t see any advantages from the capabilities of XML in those processes, either. But that’s another story.)

However, if you want to introduce XML early in the document lifecycle, it is critical that the document model(s) be supportive of your business processes. (One of the reasons that some businesses will not share/publish their document models is that the document models are not only accurate records of their business processes, in some cases they drive parts of the process.) In this scenario, the document model is key to the success of the XML application, and using a model that was developed for other purposes is very risky.

5.2.2 Alterations

Instead of adopting a model that was developed by a committee of outsiders, that your organization may not even have participated on, many XML users choose to start with an off-the-shelf document model and alter it to fit their needs. Among the advantages of this approach are:

There is also a subtle but real psychological advantage, known to all coin-flippers (who say that the minute the coin is tossed, you know how you wish it would land). There is no better way to know what you want and need than to review something that isn’t quite right.

If you are going to alter an off-the-shelf model to fit there are a few things you should look out for:

permissions
Most public XML document models specify who owns them, who is allowed to use them, and how they may be altered. Restrictions on altering them may include that:
Be sure to read the permissions and restrictions on any off-the-shelf model before you decide to modify it. A few completely prohibit alterations; but this is quite rare.
maintenance
Make sure that the mechanism you use to modify the published model will allow you to make changes as your needs change and will allow you to adopt new versions of the published model as it changes.

5.3 Bespoke

Since the beginning of XML, and SGML before it, there have been published document models. Bespoke models, models that are designed, developed, documented, and maintained specifically to meet an organization’s requirements, are a major expense. And yet major applications have developed, and continue to develop, their own document models.

Bespoke document models allow you to realize one of the commonly discussed virtues of XML: you can identify everything that matters to you, calling it what you want to call it, and imposing rules on it’s shape that make sense to you. Among the benefits organizations see in bespoke document models are:

If you want a set of document models, each optimized for your content at a different place in the lifecycle, you will need either bespoke models or altered-to-fit public models. This is partly because few if any of the published models are available in more than one form, and because the published models are rarely optimized for any single function; they try to be all things to all users in their area of interest.

If you create a document model you take on the responsibility of ensuring that the model:

5.3.1 Off-the-shelf Modules in Bespoke Models

Most document models being developed today contain modules of existing, off-the-shelf models. This is true for models that are being developed for use in a particular environment as well as for those that are being developed for interchange within an industry. It does not mean that these incorporated modules are prefect, or even better than their competition. It means they are more useful and convenient because of wide-spread adoption and a developed software base.

For example, while document modeling committees may argue at length over whether they should adopt the OASIS version of the CALS table model or the XHTML table model, practically nobody considers developing a table model de novo. And they shouldn’t, unless there is a very good reason to. Table processing is complex, and developing tools for the authoring, manipulation, and display is one of the more difficult jobs in XML software. Tools exist to convert from spreadsheets into the two common XML table models, and to make authoring tables using them relatively easy. Stylesheets for display of XML in these models or import into non-XML tools are available for these table tag sets. It is occasionally useful to enhance one of them (for example, to add a “scale” attribute to the “table” element), but this still allows people throughout the document lifecycle to use existing table software.

Similarly, most people who want an XML encoding for mathematical expressions use MathML. Not because there is a regulation that says they have to, but because there is an increasingly well supported existing tag set with off-the-shelf software implementations, and creating a tag set for mathematical expressions would be an enormous amount of work (as would creating tools for it’s creation and rendering).

5.3.2 Creating Document Models

5.3.2.1 Document Analysis

Document models should be based on document analysis; the process of collecting sufficient information about the useful and relevant components of an information collection (as opposed to all possible components) to construct the model. During analysis, data elements are identified in current data repositories, hardcopy documents, and electronic products; the interrelationships between different portions of information are determined; and anticipated future requirements are evaluated. Such an analysis should be used to determine the information needed in any document model.

The people best able to name and describe these information components are the people who create, edit, and research the document’s content, those who support the current system, those who create presentations of the information, and those who will use, implement, or support future print and electronic information products. I have had very good results from collaborative Document Analysis meetings in which subject matter experts and publishing staff make design decisions, facilitated by XML experts who guide the discussion and record the results.

5.3.2.2 Select Constraint Language(s)

Only now, after you have decided that a custom document model is appropriate, and you know what constraints you want to enforce on the document collection, can you make an informed decision on constraint language. Remember that there is no rule that you may use only one, just as there is no rule that you have one document model for your entire document lifecycle. It is quite common, for example, to use a DTD to guide authoring and RELAX NG or Schematron to do additional validation on completed documents.

5.3.2.3 Develop and Document the Document Model(s)

The mechanics of writing a document model, in any of the constraint languages, are fairly easy, but don’t let a beginner do it. The mechanics of writing a contract are fairly easy, but you wouldn’t let a beginner write a business-critical contract, which is what a document model is.

Design decisions made during the development of the document model affect how easy it will be to:

Documenting an XML document model is as important as developing it well. Tag set documentation is what makes it possible to have multiple people or organizations use the tags in the same way over time. No matter how carefully you name your elements, there will be times when it isn’t clear to a user or developer what was meant, and more dangerous yet, there will be times when two users, each sure they know what was meant, disagree with each other.

Good tag set documentation often takes longer, and costs more, than tag set development. And it is worth every minute and every penny. Good tag set documentation provides, for those who cannot or do not like to read the syntax of the document, the same information as the model and, in addition, tagged and formatted examples, structural diagrams, and definitions or explanations that would not fit within the structure of the model (in any of the constraint languages).

5.3.2.4 Test the Document Model

A cartoon that hung over my father’s desk for years shows an older man showing a paper to an embarrassed younger man and saying “There are two things you can’t delegate: proofreading and fatherhood”. I would add to that: testing critical systems components, such as document models. Delegating document model testing is delegating decisions about what is, and is not, important to you.

Document model testing by a user community is intended to:

Testers should try to address the following areas:

Comprehensiveness
Can all of the information you want to store or use be recorded in the tags defined by the model? In a straight-forward way? (No chopping off a step-sister’s toes to fit the glass slipper!)
Completeness
Is it possible to supply all of the information that is required? At all? Easily?
Fitness
Does the “logic” of the model reflect the logic of existing materials? If not, is this because there was a decision to change the way you do business, and does the model reflect the way you want to start working?

Among the techniques we suggest for tag set testing are:

Dry tagging
Taking a paper copy of an existing or mock-up document, circling everything of interest on it, and identifying which elements and attributes are associated with each circle. Note if there are any important structures, or any typographically distinct content, that cannot be identified using the tag set. Then check to see that all content that is required by the tag set has been, or can be, identified.
We find that editorial and management people can do very useful document model testing using dry tagging. This test is a good test of the tag set and a very good test of the tag set documentation!
Guided tagging
Use a context-sensitive XML editor to tag a set of sample documents. Allow the software to guide you through building the documents, filling in the information as the tool permits. Notice if there is any information in the sample document that cannot be included, required information that is not available, or awkward sequencing.
We find that this testing is best done by people familiar with XML and XML editing tools. If there are no people who available who are familiar with both the documents and organizational requirements and XML editing tools we suggest that an editor sit beside a techie to do this testing.
Sample transformation
While it is probably not practical to make any of your final information products from test files, it is usually practical, and informative, to write at least one transformation from the tag set you have just defined to another. In many cases, making an HTML display of XML content is an easy way to identify difficulties in nesting levels and containment.

5.4 There is No Best Approach

In some situations a new, custom, document model is the only option, or could be a far more comfortable fit than anything off-the-rack, even if altered to fit. In more cases, an XML application can save a lot of time and cost by starting with an existing model and altering to fit their needs. And in a small but increasing number of cases a document model that meets your needs as well as anything you could create is available for the taking.

The key to success is to identify your requirements and expectations and then select a document model source, and document model, that meets those needs.

XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.