XML Europe 2003 logo

Using XML Schema in a Document World

Abstract

SGML, Standard Generalized Markup Language (ISO 8879 came on the scene in 1986 and was adopted by publishers who understood the value of working in an open, standardized environment. SGML was not an easy step for any publisher to take, but the benefits of SGML were clear enough to drive adoption by many major publishers. XML, a simplified dialect of SGML that had been optimized for the Web, was not widely adopted by publishers already using SGML. In fact today, most of the users of XML are outside the publishing environment.

Now, 12 years after the adoption of SGML by many publishers, the SGML-based publishing environments are becoming archaic and need to be replaced. Publishers are now facing a transition from SGML to XML. In addition, they face a choice between using XML Schema or remaining with DTD. This paper examines the benefits of using XML Schema in a document world.

Keywords


Table of Contents

1. Background Information
2. From XML DTD to XML Schema
2.1. Why XML Schemas?
2.1.1. Enhanced Data Validation
2.1.2. Augmentation of Data
2.2. XML Schema Vs. XML DTD
2.2.1. Syntax
2.2.2. Declarations Vs. Definitions
2.2.3. The Concept of Global and Local
2.2.4. Tags Vs. Types
2.2.5. Data Types
2.2.6. Advanced Features
2.2.6.1. Defaults
2.2.6.2. Uniqueness Constraints
2.2.6.3. Key Constraints
2.2.6.4. Key References
2.2.6.5. Patterns
3. Findings from Best Practices Document
3.1. ID Attribute Data Format
3.2. Date Attribute on Article-Level Elements
3.3. ISSN Attribute Data Format
3.4. ISSN Attribute Values
4. Are There Publishing Tools for XML Schemas?
4.1. XML Authoring Software
4.2. XML Schema Validation and Processing
4.3. XML Schema Support in Composition Tools
5. Summary
Biography

1. Background Information

In 2002, I worked with a major publisher of medical and scientific journals who found themselves in the position of re-evaluating their standards and tools. This publisher had been using SGML for some time and had a number of SGML DTDs, SGML tools and SGML-based workflows firmly established. However pressures to bring content to the Web in an XML environment and update the technology infrastructure has placed pressure on the status quo. Hence a project called "Content Preparation" was launched to enable the organization to develop an understanding of how structured markup is used today and how changes to the structured markup scenario could improve the timeliness of content preparation, lower costs, improve quality and position the publisher to reuse and re-purpose content more easily and less expensively.

A critical component of gaining this understanding was to study the current SGML markup environment and draw some conclusions about ways that the structured markup environment could be improved to meet today's business goals. The scope of the study included SGML DTDs for both journals and books. Each DTD was compared against existing usage and best practices documentation to determine where changing the type of guiding schema (SGML DTD, XML DTD, XML Schema) could improve the content preparation environment.

In the current environment, an SGML DTD was coupled with a document called the SGML Usage Guidelines. These guidelines spelled out constraints that were not possible to institutionalize using the SGML DTD alone. First it was important to understand where, how, and why the usage constraints found within the SGML Usage Guidelines differ from the constraints specified by the current SGML DTDs. Next it was important to investigate if XML schema could be used to provide increased automatic content/data validation for elements addressed within the SGML Usage Guidelines.

2. From XML DTD to XML Schema

2.1. Why XML Schemas?

When XML was first developed, the idea was to simplify SGML. Make it easy so everyone could use it on the Web. Interestingly enough, XML was most-rapidly adopted by those wanting to use XML to conduct e-commerce, not by publishers. As you might guess, e-commerce requires very strict constraints on data. In order to get the data-centric e-commerce applications to work, it turns out that we really needed more robust ways to constrain data. We needed constraints that were even more strict than those eliminated from SGML. Because most e-commerce XML developers came from the programming and database worlds, they wanted a constraint mechanism that was familiar. And so the XML Schema Definition Language was born.

For this journal publisher there are several enhancements that XML with an XML Schema can provide over SGML. These are highlighted in the following sections.

2.1.1. Enhanced Data Validation

XML Schema provides mechanisms to validate elements and attributes in complex prescribed combinations as well as to validate the data within them. For the publisher featured in this case study, this means even more control over the tags and the data entered. This out-of-the-box validation is especially important to the automated loading of online systems.

2.1.2. Augmentation of Data

XML Schemas can be used to add to the data as well as to check the validity of data. Schemas contain a number of default mechanisms that enable the automated normalization of data. An example is the ability to add "NA" when a required element has no content.

2.2. XML Schema Vs. XML DTD

There are some very fundamental differences between and XML Schema and an XML DTD (or even an SGML DTD for that matter). Understanding these differences will help you understand how XML Schemas can help improve validation and automate data augmentation during content preparation at the publisher.

2.2.1. Syntax

The syntax of an SGML DTD and an XML Schema are quite different. The SGML DTD is in the specific ISO 8879 syntax. In this example you see the declaration for the document and the elements <xyz> and <journal>.

<!DOCTYPE xyz PUBLIC "�//XYZ//DTD XYZ Journals DTD//EN">
<!ELEMENT xyz � �  )journal | supplement(+ >
<!ATTLIST xyz
        version         CDATA        #REQUIRED >
<!ELEMENT journal � O  
                )journal.title,
                 sub.journal.title?,
                 )byline | )editor+,)address? & trailer&quest;(((*,
                  cover?,                   )abstracts | announcements | article | case�report
                   | contents | correction | editorials | cme | 
                   filler | index | letters | research�report | 
                   reviews | pdf�only(+(    >
<!ATTLIST journal
        id              ID           #IMPLIED
        ISSN            CDATA        #REQUIRED 
        doctopic        CDATA        #IMPLIED
        docsubj         CDATA        #IMPLIED >

The syntax of the XML Schema uses XML tags. So it is not in the ISO 8879 syntax at all. Here is the XML Schema definition for <xyz> and <Journal>.

<?xml version = "1.0" encoding = "UTF�8"?>
<xsd:schema xmlns:xsd = "http://www.w3.org/2000/10/XMLSchema">
  <xsd:element name="xyz">
    <xsd:complexType>
	<xsd:choice maxOccurs="unbounded">
	  <xsd:element ref="journal"/>
		<xsd:element ref="supplement"/>
	  </xsd:choice>   <xsd:element name="journal">
     <xsd:complexType>
	 <xsd:sequence>
	  <xsd:element ref="journal.title"/>
	  <xsd:element ref="sub.journal.title" minOccurs="0"/>
	    <xsd:choice minOccurs="0" maxOccurs="unbounded">
		 <xsd:element ref="byline"/>
		   <xsd:sequence>
		    <xsd:element ref="editor" maxOccurs="unbounded"/>
			<xsd:choice minOccurs="0" maxOccurs="unbounded">
			  <xsd:element ref="address"/>
			  <xsd:element ref="trailer"/>

	<xsd:attribute name="version" use="required" type="xsd:string"/>
     </xsd:complexType>
   </xsd:element>
			</xsd:choice>
		   </xsd:sequence>
		</xsd:choice>
	    <xsd:element ref="cover" minOccurs="0"/>
	    <xsd:element ref="rhr" minOccurs="0"/>
	    <xsd:element ref="rhv" minOccurs="0"/>
	    <xsd:choice maxOccurs="unbounded">
		<xsd:element ref="abstracts"/>
		<xsd:element ref="announcements"/>
		<xsd:element ref="article"/>
		<xsd:element ref="case�report"/>
		<xsd:element ref="contents"/>
		<xsd:element ref="correction"/>
		<xsd:element ref="editorials"/>
		<xsd:element ref="cme"/>
		<xsd:element ref="filler"/>
		<xsd:element ref="index"/>
		<xsd:element ref="letters"/>
		<xsd:element ref="research�report"/>
		<xsd:element ref="reviews"/>
		<xsd:element ref="pdf�only"/>
	    </xsd:choice>
	   </xsd:sequence>
	<xsd:attribute name="id" type="xsd:ID"/>
	<xsd:attribute name="ISSN" use="required" type="xsd:string"/>
	<xsd:attribute name="JCODE" type="xsd:string"/>
	<xsd:attribute name="DIR" type="xsd:string"/>
	<xsd:attribute name="ARTNO" type="xsd:string"/>
	<xsd:attribute name="OPER" type="xsd:string"/>
	<xsd:attribute name="doctopic" type="xsd:string"/>
	<xsd:attribute name="docsubj" type="xsd:string"/>
    </xsd:complexType>
  </xsd:element>

2.2.2. Declarations Vs. Definitions

An SGML DTD is a series of declarations that specify components that occur in the document instance. Some of the declarations include:

  • <!DOCTYPE (document)

  • <!ELEMENT (element)

  • <!ATTLIST (attribute)

  • <!ENTITY (entity)

XML Schema has both declarations and definitions. Declarations are used to specify components of the document instance. So declarations in an XML Schema also specify elements, attributes and entities. But XML Schemas are also made up of definitions. Definitions specify components internal to the schema. Definitions include the specification of data types, model groups, attribute groups, and identity constraints.

2.2.3. The Concept of Global and Local

XML Schemas give us the ability to specify a scope for declarations. First we can declare component specifications that are global. Global declarations appear at the top level of a schema. These declarations are always named and are in force throughout the document instance. There cannot be two different declarations with the same name.

The second kinds of declaration that can be made within an XML Schema are local declarations. Local declarations are scoped by the declaration or definition that contains them. So, for example we could declare a <HEADER> in a <CHAPTER> that has a different structure from the <HEADER> defined within an <APPENDIX>.

2.2.4. Tags Vs. Types

XML Schemas have the concept of declaring the structure of elements and attributes and defining simple and complex types as well. This enables us to define a type once and use it in many element or attribute declarations. For example, we might be developing an XML Schema for a journal that has <AUTHOR> and <EDITOR> elements that have identical structures. XML Schema enables us to define one type such as nameType that can be used in the declarations for both author and editor.

<xsd:complexType name=‘nameType’>
  <xsd:sequence>
     <xsd:element name=‘firstName’ type=‘xsd:string’/>
     <xsd:element name=‘middleName’ type=‘xsd:string’ minOccurs=‘0’/>
     <xsd:element name=‘firstName’ type=‘xsd:string’/>
  </xsd:sequence>
</xsd:complexType>

<xsd:element name=‘AUTHOR’ type=‘nameType’/>

<xsd:element name=‘EDITOR’ type=‘nameType’/>

There are two kinds of types that can be defined. If the content model is strictly data and there are no attributes, the definition is a simple type. Simple type definitions enable us to either restrict or enhance the data type of the element being declared. You would use a simple type definition to define ISSN (issue number) as a positive integer with a maximum value of 25. This restriction of the positiveInteger data type cannot be done directly within an element declaration, so we must define it in a simple type definition.

<xsd:element name=‘ISSN’ type=‘ISSNType’/>
<xsd:simpleType name=‘ISSNType’>
  <xsd:restriction base=‘xsd:positiveInteger’>
     <xsd:maxInclusive value=‘25’/>
  </xsd:restriction>
</xsd:simpleType>

If, however, the content model being defined is made up of other elements and attributes, it must be defined as a complex type as you see in the previous example.

2.2.5. Data Types

Data types in XML Schemas enable us to specify the content of either an attribute or element. In SGML DTDs, the only control we had over content was for attributes where we could enumerate values or specify very simple, document-based data types:

<!ELEMENT art � O  EMPTY >
<!ATTLIST art
        id              ID           #IMPLIED
        source          ENTITY       #REQUIRED      
        process         (yes)        #IMPLIED       
        process�by      ENTITY       #IMPLIED
        usage           (online-only | print-only | all-usage)
                                     all�usage >

XML Schemas provide us with far greater control over the content of both attributes and elements. The XML Schema specification comes with 44 built-in data type specifications such as integer, date, string, and time. In the example below we have assigned the XSD (XML Schema Definition) type string to firstName, middleName, and lastName.

<xsd:complexType name=‘nameType’>
  <xsd:sequence>
     <xsd:element name=‘firstName’ type=‘xsd:string’/>
     <xsd:element name=‘middleName’ type=‘xsd:string’ minOccurs=‘0’/>
     <xsd:element name=‘firstName’ type=‘xsd:string’/>
  </xsd:sequence>
</xsd:complexType>

We can add further restrictions to the built-in data types to restrict the use of the general types. In this example, the issue number is defined using the built-in data type xsd:postiveInteger. It is further restricted to be an integer no greater than 25 for this journal.

<xsd:element name=‘ISSN’>
    <xsd:simpleType>
       <xsd:restriction base=‘xsd:positiveInteger’>
         <xsd:maxInclusive value=‘25’/>
       </xsd:restriction>
    </xsd:simpleType>
</xsd:element>

Other ways of restricting data types are discussed in the following Advanced Features section.

2.2.6. Advanced Features

2.2.6.1. Defaults
2.2.6.2. Uniqueness Constraints
2.2.6.3. Key Constraints
2.2.6.4. Key References
2.2.6.5. Patterns

In addition to the simple features of XML Schemas, many advanced features have been added to provide even further constraints on the XML instance.

2.2.6.1. Defaults

In SGML, we have the ability to provide defaults for attribute values but had no ability to provide defaults for elements. XML Schema gives us the ability to provide default values for both attributes and elements.

<xsd:element name=‘ISSUE.NUMBER’ default=‘NA;’>
       <xsd:simpleType>
        <xsd:restriction base=‘xsd:positiveInteger’>
         <xsd:maxInclusive value=‘25’/>
        </xsd:restriction>
       </xsd:simpleType>
     </xsd:element>
2.2.6.2. Uniqueness Constraints

One of the most important constraint capabilities that XML Schema provides that are missing in SGML is the ability to enforce uniqueness within any scope. In SGML we can enforce uniqueness using the built-in ID attribute. ID-based uniqueness has the scope of the document instance in SGML. That is useful in some cases, but many times we need to limit uniqueness based on the document context. With XML Schema we can define uniqueness within any scope using the special uniqueness constraint. Here we have defined the ID attribute on <ARTICLE> to be unique within the scope of <JOURNAL>.

<xsd:element name=‘ARTICLE’ type=‘articleType‘>
 <xsd:unique name=‘articleIdKey’>
    <xsd:selector xpath=‘./JOURNAL’>
    <xsd:field xpath=‘@ID’>
  </xsd:unique>
</element>
<xsd:complexType name=‘articleType’>
  <xsd:sequence>
   <xsd:element name=‘ARTICLE.TITLE’ type=‘xsd:string’/>
   <xsd:element name=‘AUTHOR’ type=‘authorType’/>
   <xsd:element name=‘ARTICLE.BODY’ type=‘articleBodyType’>
  <xsd:sequence/>
  <xsd:attribute name=‘ID’ type=‘xsd:integer’ use=‘required’/> 
  <xsd:attribute name=‘DATE’ type=‘xsd:date’ use=‘required’/>
  </xsd:complexType>
  

Note that uniqueness can also be used to assure that two or more combined fields are unique. Simply combine them within the same uniqueness statement.

2.2.6.3. Key Constraints

Key constraints are very much like uniqueness constraints except that this constraint is intended to be used as a "key" in the database sense and hence is required to be present and cannot have a null value. It must be given a name so that it can be referenced by a key reference.

<xsd:element name=‘ARTICLE’ type=‘articleType’>
 <xsd:key name=‘articleIdKey’>
    <xsd:selector xpath=‘./JOURNAL’>
    <xsd:field xpath=‘@ID’>
  </xsd:key>
</element>
2.2.6.4. Key References

Key references have nothing to do with uniqueness. Rather they have to do with assuring that certain values match each other and match the key. For example, it is a requirement that the DATE attribute on journal articles must match the <DATE> of the Journal. This can be validated by using Key and Key Reference features of an XML Schema.

In this example we are declaring the element <JOURNAL>. The content model is found in the type definition for journalType. In the element declaration for journal we have defined both a Key and a Key Reference. The key definition selects the scope of the journal and identifies the <DATE> element as the unique key element in the content model for journal. The element declaration also specifies a Key Reference. The Key Reference is selected to be the DATE attribute on <ARTICLE>. What this means is that <DATE> is a key for the journal. It must match every DATE attribute on articles throughout the journal.

<xsd:element name=‘JOURNAL’ type=‘journalType’>
 <xsd:key name=‘journalDateKey’>
    <xsd:selector xpath=‘./JOURNAL’>
    <xsd:field xpath=‘./JOURNAL/DATE/’>
  </xsd:key>
   <xsd:keyref name=‘journalDateKeyRef’ ref=‘journalDateKey’>
    <xsd:selector xpath=‘./*/ARTICLE*’/>
     <xsd:field xpath=‘@DATE’/>
  </xsd:keyref>
</element>
2.2.6.5. Patterns

One of the advanced ways to specify a data type is to specify a pattern for the content. This enables us to control the format of data in the instance and can provide a great way to validate content that might otherwise not be easily validated.

<xsd:element name=‘form’>
  <xsd:complexType>
    <xsd:attribute name=‘id’ type=‘xsd:integer’ use=‘required’>
      <xsd:restriction base=‘xsd:string’>
        <xsd:pattern value=‘[0�9]{5}(�[0�9]}4})?’/>
      </xsd:restriction>
     </xsd:attribute>
  </xsd:complexType>
  </xsd:element >

3. Findings from Best Practices Document

There are many factors in evaluating how the use of XML Schemas would benefit content preparation at a publisher. Perhaps the best indicator of the impact of the use of XML Schemas for this project was a document called SGML Usage Guidelines. This highly detailed document provides documentation for the journal tag set. It describes when it is appropriate to use a particular element. The SGML Usage Guidelines provides information about the usage of tags and elements. This usage was not the set of constraints that is dictated by the DTD. In fact, in some cases, the usage guidelines say that although the DTD specifies a certain use of elements and attributes, a more restricted use is often dictated.

Examples of how some of the restraints that are indicated in the SGML Usage Guidelines can be enforced by XML Schemas are indicated below:

3.1. ID Attribute Data Format

Numerous usage notes in the Usage Guidelines refer to the data format of the id attribute on various elements. Here is an example of the usage guideline: If the id value for <form> is specified but does not conform to the correct naming convention, this will cause the SGML data to be rejected.

XML Schema can be used to define a data type for each attribute format. A different data type can be defined for id on different elements. Here is an example of how a user-defined data type for id might be constructed using XML Schema Definition Language:

<xsd:element name=‘FORM’>
  <xsd:complexType>
    <xsd:attribute name=‘id’ type=‘xsd:integer’ use=‘required’>
      <xsd:restriction base=‘xsd:string’>
        <xsd:pattern value=‘[0�9]{5}(�[0�9]{4})?’/>
      </xsd:restriction>
     </xsd:attribute>
  </xsd:complexType>
  </xsd:element >

In addition XML Schema can specify that the ID attribute on the same element <FORM> will be unique within the scope, in this example unique within the journal. We could also scope a particular id attribute to be unique within an <EDITORIAL> or within a set of <REFERENCES>. Here XML Schemas are much more powerful than the SGML ID/IDREF construct. Both of these features of XML Schema combine to provide a significantly improved validation for the ID attribute on any element out-of-the-box for XML Schema validating parsers.

3.2. Date Attribute on Article-Level Elements

Numerous usage notes in the SGML Usage Guidelines refer to the data format of the DATE attribute on various elements. Here is an example usage note:

Any DATE attributes, which do not comply with the specifications listed above, will cause the SGML to be rejected. The DATE attribute value must be identical in all article-level elements within an issue. Variation among article-level elements will cause the SGML data to be rejected. If the DATE attribute value does not match that listed on the cover, this will cause the SGML data to be rejected.

XML Schema can be used to specify the data type for the date just as it could for the format of ID or DOI.

<xsd:element name=‘letters’ type=‘letterType’>
  <xsd:complexType name=‘letterType’>
    <xsd:sequence ...
........
     </xds:sequence>
    <xsd:attribute name=‘date’ type=‘xsd:date ’ use=‘required’/>
  </xsd:complexType>
  </xsd:element >

In addition, with an XML Schema we can do Key References to ensure that there is a match between sets of values within an instance. This means we can validate that all article dates match the journal date. In this example we have defined the <DATE> element of <JOURNAL> to be the key. It will match to all DATE attributes on articles within the journal.

<xsd:element name=‘JOURNAL’ type=‘journalType’>
 <xsd:key name=‘journalDateKey’>
    <xsd:selector xpath=‘./JOURNAL’>
    <xsd:field xpath=‘./JOURNAL/DATE/’>
  </xsd:key>
   <xsd:keyref name=‘journalDateKeyRef’ ref=‘journalDateKey’>
    <xsd:selector xpath=‘./*/ARTICLE*’/>
     <xsd:field xpath=‘@DATE’/>
  </xsd:keyref>
</element>

3.3. ISSN Attribute Data Format

The ISSN number has a very precise format that is specified within the SGML Usage Guidelines:

If the ISSN attribute is not provided, does not match the value given in Appendix A, or is incorrectly specified, it would cause the SGML data to be rejected.

XML Schema provides a mechanism to enumerate both attribute and element values as well as to define custom data types with patterns. This mechanism could be used to enforce the ISSN value format.

3.4. ISSN Attribute Values

The SGML Usage Guidelines requires that the ISSN attribute in the SGML source for IPS match the ISSN on the print issue:

If the ISSN attribute value does not contain the same content (although formatting may differ) as the ISSN on the print issue, this will cause the SGML data to be rejected.

The use of XML Schema cannot assist with this validation.

4. Are There Publishing Tools for XML Schemas?

One big question that was asked during the 2002 project was if there is a consideration of moving from SGML DTDs to DTD-based XML or to Schema-based XML is the availability of tools for publishers. Even if XML Schemas were to provide very powerful advantages in theory, are there tools available to make the theory a reality?

For publishers there are several classes of XML-based tools that must be considered:

  • Authoring/Editorial Software

  • XML Schema Processors to do validation and augmentation

  • Composition engines and publishing tools

It is important to realize that those with an interest in publishing did not develop the XML Schema specification. While we can see the power XML Schemas would give a publisher, tools that use schema are more often found in the server and database world than they are in the publishing world.

4.1. XML Authoring Software

The two premier authoring/editing tools for SGML are Arbortext Epic and SoftQuad's Xmetal. While Arbortext made no mention of XML Schema support in 2002, today they tout XML Schema support. At XML 2001 marketing people from Arbortext indicated that a transform from XML Schema to XML DTD could be used if XML Schema support was required. While this will work, somewhat, it will not provide for the power of XML Schema during editing sessions.

XMetal 3.0 advertised XML Schema support for the features that are common to both XML DTDs and XML Schemas plus some important features that are only found in XML Schema. Today XMetal 4.0 highlights features of XML schema that are supported in the product.

Perhaps the most significant advance in XML-schema based authoring will come this spring with the release of Microsoft Office 2003 (formerly named Office 11 by Jean Paoli during his keynote at XML 2002.). This Office suite brings the power of XML, and XML Schema, to the masses. According to Microsoft, "Author, access, and analyze information that resides in disparate systems. Office 2003 supports the industry-standard XML technology that facilitates enterprise transactions and business-to-business data exchange."

4.2. XML Schema Validation and Processing

Here we are in luck. A tremendous amount of work had been done in support of overall XML Schema validation and processing. There are both open-source engines and commercial engines available. So if you decide to use XML Schemas for validation, good schema processing engines are available.

4.3. XML Schema Support in Composition Tools

A quick survey of composition vendors and their support for a host of XML-based standards including XML Schema revealed that the majority of these vendors do not have support for XML Schema. The trend today is for composition engines to assume they are receiving valid XML as an input data stream. In this case an understanding of either a DTD or an XML Schema is not necessary.

5. Summary

In 2002, the study in question concluded that "XML Schemas could provide a great deal more validation control in support of many of the best practices, particularly those that center around content format (such as the format of the ID or ISSN or DOI. Schemas can also help validate both uniqueness (such as the DOIs) and key-key reference relationships (such a journal date against article dates)." But the adoption of XML Schema in a document environment was not recommended at that time because "XML Schema tools for publishers are, for the most part nonexistent."

So in mid-2002, the project team decided to utilize XML DTDs, rather than schemas, for the time being. This decision is a result of analyses of the current environment. Moving from SGML DTDs, which the organization currently uses, to XML DTDs is a natural progression that will facilitate the movement toward even more delivery channels. Using a schema was believed to add a layer of complexity and constraint around data and could not easily be supported by authoring tools.

But how quickly technology changes. Today, just one year later, Microsoft has entered the market with XML for the mainstream. Because Microsoft technology is exclusively XML schema based, those who wish to compete in this marketplace, will also have to make the shift to enable the power of XML schema in the document world.

Biography