XML Europe 2002 logo

XSD Schemas in Book and Journal Publishing

Abstract

Many commercial publishers were looking to XML Schema technologies to replace DTDs and to solve some of the quality and data typing problems inherent in large volumes of marked-up data. But do XML Schemas actually provide a better way to define full-text documents?

This presentation outlines the motivation behind the decision by Cambridge University Press to pilot development of an XSD Schema for use in its book and journal production.

The advantages of Schemas over DTDs were readily apparent, especially their more meaningful data typing and better XML Namespace support. Data typing would allow quality assurance checks to be embedded in the Schema, rather than devolved to external programs. Namespace support would allow vocabularies to mingle freely together without some of the hacks necessary with DTDs.

The main findings of the pilot can be summarized thus:

  1. Initial development was hampered by the (understandable) difficulty of finding tools whose behavior conformed properly to the Recommendation, though many vendors are promising Schema support ‘real soon now’.

  2. Tools that did manage to parse the Schema were many times slower than those which validated against an equivalent DTD. Moreover, the multi-entity disposition of a large Schema means access via HTTP over a slow network is tortuous.

  3. XSLT-based transformation of XML data is obviously a common activity, yet with XML instances valid to a Schema, this becomes a much more involved process.

  4. In-depth understanding of XML among suppliers to the publishing industry is still variable, and the introduction of an XSD Schema with its attendant rigors would be likely to cause expensive confusion.

  5. The data typing available through XSD Schemas falls well short of that necessary for use in full text documents being unable, for example, to allow a processor to determine something as fundamental as whether a table can be rendered.

  6. There are, however, some positive suggestion that emerge through Schema development, particularly concerning how a DTD can be split for modularisation.


Table of Contents

1. Background
2. DTDs and Their Shortcomings
3. Stopgap Solutions
4. The Promise of Schemas
5. Auguries
6. The Trouble with Documents
7. The Bleeding Edge
8. The Real World
9. Performance
10. Interoperability
11. Trouble Spots
12. The Knowledge Gap
13. Conclusions
Acknowledgements
Bibliography
Glossary
Biography

1. Background

For publishers, SGML and XML have held out the prospect of achieving greater consistency and integrity of content; in other words, greater quality. Most people who have worked with large volumes of human-created SGML or XML data, however, will confirm that consistency and integrity are the exception, rather than the rule. This is hardly surprising: SGML and XML have presented publishers with technological challenges which they are typically not equipped to handle. Management of digital information requires working practices and tools closer to those used in the software development industry, than those used in the publishing and print industry.

This paper is not primarily concerned with the working practices associated with SGML and XML, but with how software (rather than people) could help improve publishers' data quality. The W3C's XML Schema language (I will refer to this language as the XSD Schema language from now on) holds out the promise that Schemas will ‘allow machines to carry out rules made by people,’ i.e., that the software will take more of the strain of checking for consistency and integrity of content. If these goals can be realized, then the XSD Schema language will certainly be of interest to publishers.

Directly linked to the theoretical benefits of XSD Schemas, there are strong business drivers: inconsistent data and poor data integrity costs money, so most obviously, an effective Schema language, carrying out the rules made by people, should be able to reduce costs. In addition to this is the vague, but widespread feeling that Schemas are the ‘now’ of XML and are to be embraced: not to do so could leave a publisher in the wake of those embracing XML more conspicuously.

2. DTDs and Their Shortcomings

Before Schemas, there were, of course, DTDs. But while the DTD made a great stride in enabling generalized markup,’ its shortcomings in publishing workflows are apparent to many. The problem is that a DTD (Document Type Definition) defines in only a very restricted sense the type of a document. As has often been said, a DTD is all syntax and no semantics. So, to take a non-publishing example, a document of type ‘credit card transaction’ might contain name, credit card number, expiry data, etc. and a DTD can certainly express this grammar adequately:

<!ELEMENT credit-card-details
 (holder-name, card-type, card-number, expiry-date, 
  issue-number?, holder-address)>
<!ELEMENT card-type (#PCDATA)>
<!ELEMENT holder-name (#PCDATA)>
<!ELEMENT card-number (#PCDATA)>
<!ELEMENT expiry-date (#PCDATA)>
<!ELEMENT issue-number (#PCDATA)>
<!ELEMENT holder-address (#PCDATA)>

But the real ‘type’ of the document is much more complex than can be expressed by grammar alone. The holder name has a certain minimum and maximum length; the card-number and expiry-date must conform to a number of pre-determined formats, and so on. Not even the grammar is rich enough: for certain kinds of debit card an issue number must be specified; for others no such number appears on the card. In other words there is no semantic checking of content, and the grammar can only be crudely defined. It is perfectly possible to create a document valid to this DTD which has no content in any of the elements, and a parser will say Yup, that's fine this is a document of type credit card details.

To translate this example into publishing terms, there is no way a DTD alone can prevent all sorts of problems in content that might parse against a DTD perfectly. To take some examples I have come across, ISBN elements with content of ‘don't know’; metadata headers where every element is blank; document instances that ignore all elements other than those which achieve the correct look; and tables that parse fine, but which can never be rendered.

3. Stopgap Solutions

DTDs are also not so good at using some of the adjunctive technologies that have emerged around the central XML 1.0 Recommendation. XML Namespaces, for example, require something of a sleight of hand to use at all, and their full usefulness is not realized when they are fixed onto element declarations from within a DTD.

Similar sleights of hand have been available to augment the grammar-based rules of DTDs so that software could check for a richer notion of document type.

One approach is to use defensive design techniques when authoring a DTD. Traditionally a DTD is authored following a document analysis which discovers a grammar within a document set: the grammar is encoded by the DTD. However, some of the optionality present in a document set may only be necessary in very rare cases. For example, in a particular journal, it may be possible to omit the heading of a section. Yet if the DTD services other journals, this heading element must be made optional because of the requirements of just one title. In the majority of cases this is inappropriate optionality and an invitation for tag abuse.

A defensively designed DTD would help to offset such potentially dangerous optionality by making the choice more explicit. For example, it might define an empty element noHeading which could be used instead of a heading, thus forcing the user of the DTD to make a positive choice about the type of section they are creating.

Such techniques can only make a slight improvement in the safety of a DTD. Most major publishers have stepped entirely beyond the grammar-based rules of DTDs and developed custom quality assurance (QA) tools which will attempt to validate document data not just against a DTD, but against content rules embedded in the QA application. For example, such a tool may well check that the format of an ISBN is valid, or that a reference to a journal article in a bibliography contains enough data to resolve to an online resource through CrossRef or other link resolution services.

Inevitably such tools typically developed in a programming language such as perl or Java, are proprietary and difficult to understand or extend for non-programmers.

4. The Promise of Schemas

The promise of Schemas is to provide such rules-based checking within a proper XML framework.

Unlike DTDs, they are not all syntax and no semantics, but instead have a richer conception of type which can be applied to elements. For example, a date element type need no longer be free-text #PCDATA, but can be assigned a data type of date which will signal to any processing application that its content should conform to a flavor of the ISO 8601 date format.

Schemas also allow richer grammars than were available with DTDs, allowing such features as the setting of minimum and maximum number of occurrences for repeated elements, and permitting some of the content models that were definable in SGML, back into XML for example the rule that elements must occur within a parent, but in any order.

As you would expect, XSD Schema language handles better some of the XML adjunctive technologies that have emerged since XML 1.0. Unlike in DTDs, for example, XML Namespaces are integral to the design of XSD Schemas, and this holds out the promise of being able to combine different Schema fragments when assembling a new Schema without worrying about name collisions, as well as having some objectorientationlike features available, such as the ability to extend the type of an element. So, in practice this would enable us to combine a pre-written table module with a document modeling Schema, and override the content model for its cell element to specialize it for our own element content. (This can of course be done, in an distinctly unobjectoriented way, using parameter entities in DTDs.)

5. Auguries

All the signs are good. There is a general feeling that Schemas' time has come, which can be well summed-up by this posting to the XML-dev newsgroup in January 2001: I get the feeling after reading this list and other sources that DTD is on the skids and likely to become obsolete and that Schema is generally considered the future, or by papers such as one given at XML 2001 in Orlando, which concluded XML Schema in fact provides appealing and important capabilities for publishing applications.

Developer support is there too: the Apache XML Project's Xerces parser supports XSD Schemas;[Apache] Sun has support for XSD Schemas (and other Schema languages) in its MultiSchema XML Validator; [Sun MSV] and Microsoft's MSXML 4.0 software[MSXML] promisessupport of the World Wide Web (W3) Consortium's final recommendation for XML Schema.

In addition to these developer-level tools there are more userfocused desktop applications, with relative easytouse graphical user interfaces, for example SoftQuad's XMetaL 3.0, or Altova's XML Spy which promises pointandclick Schema authoring.

6. The Trouble with Documents

Document modeling has always presented a distinct challenge to authors of SGML and XML DTDs, but one in which the technology, as far as it went, was in sympathy with the problem. SGML was designed as a language for document representation usable for publishing in its broadest definition.[SGML] Leaving aside the question of what the difference is between documents and data, it is clear with reference to our credit card example that DTDs alone failed to provide an adequate way of modeling data-like content. But this is only one of the challenges facing authors of document DTDs and Schemas. Other problems include:

  1. Size and complexity: A typical full text document DTD for, say, journal articles, can contain around 300 element type declarations. I have seen a publisher's full text DTD which declared over 600 element types.

  2. Modularity: Publishers often require extensibility mechanisms built into DTDs in the manner of the TEI DTD.[TEI] Such mechanisms can enable element type renaming, switching on and off of portions of the DTD, and parameterization of content models.

  3. Use of de facto standards: where possible it makes sense to re-use DTDs which have already been written. This not only saves development costs, but often has the added advantage that software already exists for processing these portions of content. A modern XML DTD will, for example, typically use at least XLink[XLink], the OASIS Exchange Table Model[SOEXTBLX] and MathML.[MathML]

7. The Bleeding Edge

The cutting edge is the bleeding edge, and so it still proves with Schema software. At the time of writing, there are no Schema versions of MathML, XLink, or the OASIS Exchange Table Model available. No problem, you may think, we'll use some software to convert from DTD to Schema. But there is a problem when the conversion produces an XSD Schema which deviates from the Recommendation. I understand jet aircraft manufacturers have three avionics software systems developed to an identical specification for each type of aircraft. Then each aircraft control decision is voted on by the three systems. So it is with Schema software at the moment, it is a good idea to use several applications together and arrive at a consensus on whether a Schema is correct or not, and whether a document is valid or not to that Schema. So for our standard components, only after some headscratching and handcrafting, so we have XSD Schema equivalents.

I'm not sure it's appropriate to criticize the shortcomings of Schema software, since a lot of people are working hard to grapple with very difficult technical problems. But it is as well to be aware that there are now problems with much of the available Schema software and not just the free stuff.

To take a random example: Sun's ambitious Multi-Schema XML Validator[Sun MSV] complained of unimplemented features when it encountered the following attribute definition: <xsd:attribute ref='xlink:type' fixed='title'/>. Since the ability to fix a type of an XLink attribute is essential for XLink use, this rather spoils the fun. Of course Sun's software is offered as a technology preview, yet publishers must be aware that there is little choice when deciding on robust Schema processing software which will be up to handling document content, especially if they require a crossplatform solution.

8. The Real World

During my period as a research student, I would frequently encounter former students who had gone on to get a proper job. From time to time such people would refer to the ‘real world’ – somewhere I, as an academic, was presumed not to inhabit. After a while I drew the conclusion that the phrase ‘real world’ was only used by people who weren't having such a good time as oneself.

Now that I've crossed over into the commercial world, I see little reason to revise my view of the phrase.

A fair proportion of the Schema software I came across was fine with the sort of examples that are familiar from tutorial: menus, memos and the like but show it the sort of Schema that would be typical in a document publishing workflow, and problems emerge. As a test I suggest, when faced with a piece of validating software, or an XSLT script, or anything that claims to understand XSD Schemas, see how it behaves when faced with the redefine feature: this sorts the sheep from the goats.

9. Performance

It is slower to validate XML against a Schema than against a DTD. To get an idea of how much slower, I parsed two similar documents against a nearly equivalent DTD and XSD Schema for modeling journal article. Each had around 300 element declarations and were split across multiple files. The parser used was Xerces-J 1.4.1[Apache], and so there is some overhead for firing up the Java Virtual Machine (JVM) . The time taken for parsing against a DTD is 3.4 seconds. The time taken for parsing against an XSD Schema is 6.7 seconds. So, allowing for Java, we're probably looking at a parse taking 2 to 3 times longer when using Schemas.

Another performance issue arises when putting an XSD Schema on the Web. A DTD can be normalized into one physical file and placed on the Web for all to access; but any XSD Schema which uses multiple namespaces will have no option but to be split physically across files. Thus when the Schema is accessed over the Web, a number of Hypertext Transfer Protocol (HTTP) connections need to be established, further slowing down the validation process.

10. Interoperability

By interoperability I mean the ability of Schemagoverned XML instances to coexist with other parts of the XML technology set, like XSLT.

When an XML instance is governed by a XSD Schema, it is impossible to be sure of that document instance's content without reference to its Schema for such items as fixed attribute values, etc. This information is collectively termed the Post Schema Validation Infoset (PSVI), and is at this date poorly supported by software tools. Whereas existing XML APIs like The Simple API for XML (SAX), do offer programmatic access to a DTD's content, there is as yet no accepted API for interrogating Schema content.

In practical terms, the consequences are serious. XSLT engines tend to be based on existing XML technologies, and as such at the time of writing I know of no way to transform Schemagoverned XML using XSLT, other than with a bit of programming and with fingers crossed. No doubt XSLT 2.0 will resolve this problem, but that is tomorrow, not today.

11. Trouble Spots

The unholy trinity in document publishing, especially for XML-based systems, are tables, mathematics and special characters, so any benefits Schemas can bring to handling of these constructs would be welcome.

The mathematics problem is being solved by MathML (currently only a DTD). It remains to be seen whether a Schema version of MathML can offer any benefits over the DTD version.

Tables are commonly modeled by publishers with the OASIS Exchange Table Model. This DTD has only seven elements, but allows more or less adequate modeling of all but the most complicated tables. Of course the problem with a DTDbased table grammar is that it can absolutely no conception of the complex rules which control the layout of content into an arbitrary grid. Underlength rows, overlength rows, invalid spans, colliding spanned columns. All of these anomalies can exist in a table instance and sail by a parser without a murmur.

Unfortunately, Schemas are not capable of checking for any of these errors. So publishers will have the same problems with tables, that they had when using DTDs. Custom software will still be required to check the content. I suggest as an acid test for any Schema language, that it should be able to validate an XML-modeled table for its semantics. Anything less than this, I suggest, probably does not represent enough of an advance over DTDs to bother with.

For special characters, it should be noted that XSD Schemas have no support for declaring named character entities. In a world where Unicode editors are fonts rare, and in which XML is still sometimes keyed, this can cause distress to production staff who find the string '&alpha;' more intuitive than '&#945;'.

12. The Knowledge Gap

Perhaps one of the biggest barriers to successful introduction of XSD Schemas into publishing, is the prevailing level of XML knowledge in the industry. In publishing margins are often tight and cost levels important, and this has led to the use of much offshore supply of services to the publishing industry notably typesetting from the Asian subcontinent.

Understandably, the level of technical expertise among suppliers, and in publishing production generally, is not great. Those few staff with indepth knowledge of XML tend to migrate to more profitable industries.

Since the quality of XML services supplied to publishers is already an issue and one that, thanks to the relative simplicity of XML 1.0, seems might be eventually solvable the case for XSD Schemas would have to be a good one before the added cost of handling Schemagoverned XML could be justified.

This may be the conclusive argument against using XSD Schemas in publishing: they don't give you much, and what they do give will be disproportionately expensive.

13. Conclusions

I have to confess I feel a little guilty about giving a presentation which is largely negative about an XML technology, so let me say by saying that it's not all bad news for XSD Schemas. For data handling they make a lot more sense and indeed I have happily implemented database messaging systems using XML Schema as a sanity check for data contents. If you're handling data, especially in a more IT-centred environment that publishing production, then Schemas do bring benefits.

It's also worth mentioning that XSD Schemas are not the only Schemas. There are other langauges (like James Clark's Relax-NG), and no doubt there will be yet more. I don't think the particular shortcomings of XSD Schemas should deter the industry from the laudable goal of developing a document and data definition language which offers more than DTDs.

However, the scope of this presentation is XSD Schemas in document publishing, and my conclusions here are that publishers should be very wary of this technology and weigh carefully the benefits that may be getting from it, before committing. In my judgement for such applications they are difficult to use, poorly supported, offer little more than DTDs, and cost too much to implement.

Acknowledgements

Thanks to Cambridge University Press for permission to use their research in preparation of this paper.

Bibliography

[Apache] The Apache XML Project, http://xml.apache.org.

[Atchley] Atchley, John, posting to xml-dev mailing list, 8 January 2002.

[Houser] Houser, Alan R., XML Schemas for Publishing Applications. http://www.idealliance.org/papers/xml2001/papers/html/04-05-05.html

[MathML] W3C's Math Home Page, http://www.w3.org/Math/.

[MSXML] Microsoft® XML Core Services (MSXML) 4.0. http://msdn.microsoft.com/downloads/default.asp?url=/downloads/sample.asp?url=/msdn-files/027/001/766/msdncompositedoc.xml.

[SGML] Standard Generalized Markup Language, ISO 8879.

[SOEXTBLX] XML Exchange Table Model Document Type Definition, http://www.oasis-open.org/specs/tm9901.html.

[Sun MSV] Sun Multi-Schema XML Validator. http://www.sun.com/software/xml/developers/multischema/

[TEI] Text Encoding Initiative, http://www.tei-c.org/

[W3C Schema] XML Schema http://www.w3.org/XML/Schema

[XLink] XML Linking Language (XLink) Version 1.0, http://www.w3.org/TR/xlink/.

[SAX] The Simple API for XML, http://www.saxproject.org/

Glossary

HTTP

Hypertext Transfer Protocol

JVM

Java Virtual Machine

PSVI

Post Schema Validation Infoset

QA

quality assurance

SAX

The Simple API for XML

Biography

Alex is Technical Director of Griffin Brown Digital Publishing Ltd. He read English Literature at the University of Bristol (UK), and gained a Ph.D. for his work on Shakespeare editions. Following a brief spell in academia he revived his teenage interest in software development and spent four interesting years working on C++ application frameworks for multimedia products. Tired of the bauble world of multimedia, he decided to focus on content and SGML. In 1997 he was one of the founding directors of Griffin Brown where he now focuses on leading technical work on XML DTDs and Schemas, Java, C++ development and more.