Abstract
SGML, Standard Generalized Markup Language (ISO 8879 came on the scene in 1986 and was adopted by publishers who understood the value of working in an open, standardized environment. SGML was not an easy step for any publisher to take, but the benefits of SGML were clear enough to drive adoption by many major publishers. XML, a simplified dialect of SGML that had been optimized for the Web, was not widely adopted by publishers already using SGML. In fact today, most of the users of XML are outside the publishing environment.
Now, 12 years after the adoption of SGML by many publishers, the SGML-based publishing environments are becoming archaic and need to be replaced. Publishers who remain happy with the ISO SGML standard face questions from management and IT departments about a transition from SGML to XML. This paper presents the methodology used by a major publishing house to make publishing technology decisions for their next-generation publishing systems. SGML or XML? Schemas or DTDs? Learn how you can take an organized approach to making these decisions.
Keywords
Table of Contents
A major publisher of medical and scientific journals found themselves in the position of re-evaluating their standards and tools. This publisher had been using SGML for some time and had a number of SGML DTDs, SGML tools and SGML-based workflows firmly established. However pressures to bring content to the Web in an XML environment and update the technology infrastructure has placed pressure on the status quo. Hence a project called "Content Preparation" was launched to enable the organization to develop an understanding of how structured markup is used today and how changes to the structured markup scenario could improve the timeliness of content preparation, lower costs, improve quality and position the publisher to reuse and re-purpose content more easily and less expensively.
A critical component of gaining this understanding is to study the current SGML markup environment and draw some conclusions about ways that the structured markup environment could be improved to meet today's business goals.
The scope of the study included SGML DTDs for both journals and books. Each DTD was compared against existing usage and best practices documentation to determine where changing the type of guiding schema (SGML DTD, XML DTD, XML Schema) could improve the content preparation environment.
The strategic objectives for the fact finding were
Understanding where, how, and why the usage constraints found within the SGML Usage Guidelines differ from the constraints specified by the current SGML DTDs
Understanding the functionality differences for content prepared using the constraints possible with an SGML DTD Vs. an XML DTD
Understanding where XML schema could be used to provide increased automatic content/data validation for elements
Understanding where XML schema could be used to provide increased automatic content/data validation for attributes
The following methodology was followed for this project
Gather all resources.
Ask questions to clarify the content and intent of the documents collected.
Transform the current journals SGML DTD into an XML DTD noting all issues. What was lost in going from SGML to XML? What was gained?
Review the element usage report to understand which elements in the current DTD are used in practice and which are largely ignored. This will help to determine how truly useful restraints imposed by an XML Schema would actually be in practice
Use the TurboXML toolset to help with the evaluation of the DTD against the SGML Usage Guidelines document.
Reports and Logs from TurboXML, will be used to identify and document features from the SGML Journal DTD that were eliminated in the transformation from an SGML DTD to an XML DTD, places where the DTD is inadequate to specify publishing and online delivery requirements, places where an XML schema could specify these requirements better, current attributes, where they are used and what kind of SGML-based data typing, defaults, and fixed values are used, places where there are duplicate element models that could be generalized in an XML Schema Definition, and requirements that cannot be addressed directly by either a DTD or XML Schema.
Review any miscellaneous factors and report these.
Provide a summary of the findings.
The first standardized content preparation specification was Standard Generalized Markup Language (ISO 8879) (SGML). This ISO standard came on the scene in 1986 and was adopted by publishers who understood the value of working in an open, standardized environment. SGML was not an easy step for any publisher to take. First, employing SGML required a rigorous design and data-modeling phase. Second, the tools to author and publish in SGML were few and were expensive. But the clear value of encoding information assets in a format that was interchangeable and not restricted to proprietary systems was clear enough to drive adoption by many major publishers.
So along comes XML. Two communities influenced XML. First was the SGML community and second with the Internet, HTML community. By 1996 when the XML Specification was launched, it because clear that HTML (the Web page language) was great for presentation but not robust enough to handle structured data applications on the Web. Equally clear was the fact that SGML was too difficult to use, too expensive, and mainstream tools had never emerged. Hence, XML began as a simplified dialect of SGML that had been optimized for the Web.
There are several strategies to move from SGML to XML. First, one could read the rules that spell out the differences between SGML and XML and create an XML DTD manually. But today there are great tools to transform an SGML DTD into an XML DTD at the push of a button. For this project, a tool was used to transform SGML into XML and help evaluate the effects of the transformation. The tool, TurboXML from Tibco, was used to import the SGML DTD and export it as an XML DTD as well as an XML Schema.
Note Readers should understand that importing an SGML DTD to create an XML DTD is not really just a push of a button. Actually a number of decisions must be made that change how data will be constrained from the SGML environment to the XML environment.
When XML was developed, great care was taken to eliminate the "hard bits" of SGML. Hence something is lost when moving to XML from SGML in the DTD environment. In the case of the journal DTD, the import resulted in 474 errors found. These errors included
System id must be enclosed in quoted, found ">" instead
Content model contains unrecognized character ")"
Tag requirements ignored
Attributes must be enclosed in quotes
The & connector was changed to a |. As a result the DTD will be less restrictive
Attribute value #CONREF was changed to #IMPLIED
Entity reference did not end with an ";"
Attribute type NUMBER was changed to CDATA
Content model in/ex(clusion) was ignored
Attribute value NAME converted to NMTOKEN
Attribute value NUTOKEN converted to NMTOKEN
In order to create an XML DTD from an SGML DTD, 474 differences (errors to XML) had to be accounted for. Although the tool made some corrections automatically, most were made manually before import.
Many of the corrections that had to be made were changes in syntax from the SGML to XML version of the DTD and had no impact on content preparation. But as a result of the change from an SGML to an XML environment, the following functionality was "lost" in the content
Tags cannot be minimized in XML. This publisher currently takes advantage of tag minimization for elements like <ABSTRACTS> and <ANNOUNCEMENTS>. In this case the publisher allows the end tag to be omitted. XML will not tolerate end tag minimization. In theory this adds somewhat to the overhead of adding tags to the content.
Tags are no longer case insensitive. SGML allows for tags to be in any case. So the tag <ABSTRACT> is equivalent to <bstract> or <Abstract>. In XML only the case specified in the XML DTD is recognized for any element. In XML <ABSTRACT> is a different element than <abstract>. For this publisher this turns out not to be a bad feature since all tags are to be entered in uppercase. Nonetheless, it is a functionality that is lost when transitioning from SGML to XML.
Attribute values lost some of their ability to constrain content hence making validation of the data a bit less effective. For example, the attribute value type of NAME is not allowed in an XML DTD and so NAME can no longer be validated as it once was in the data. Another example is the elimination of the attribute value type of NUMBER. In place of these constructs, a general NMTOKEN (Name Token) is substituted when making the transition from an SGML DTD to an XML DTD. So moving from SGML to XML has the effect of eliminating some of the data typing that SGML provided for in its attribute values.
In SGML the & connector lets us list sub-elements and indicate that they can occur in any order. The & connector was not brought forward into the XML DTD. Since the publisher uses the & connector, models created with this are simplified in the XML DTD. The effect of this simplification is that the XML content model is more flexible (or less restrictive) than it was in the original SGML DTD.
Perhaps the biggest difference between the content models specified in the SGML DTD and those specified in the XML DTD is the elimination of inclusions and exclusions. Inclusions are elements that are allowed to be included at any point in a content model, even inside parsable character data (#PCDATA) itself. Inclusions provide an easy mechanism in SGML to extend content models. Exclusions are just the opposite. They enable us to restrict a content model by indicating what is excluded from the model. Since neither of these constructs is allowed in XML DTDs, the flexibility (or further restriction) that inclusions and exclusions give is no longer available.
The next question is why would we switch from SGML to XML. This publisher is working in an environment where they have written an SGML Usage Guidelines manual that is several hundred pages long because SGML is not restrictive enough to constrain their data properly to ensure automated loading into online systems. Why would I make a move that would restrict the data even less?
The answer is that if you are just transitioning from SGML to XML using the DTD as the vehicle to constrain content, there is no gain in the move to XML. In fact, the opposite is true. Moving from SGML to XML using DTDs means that you have lost functionality and lost the ability to constrain your data. It makes no sense to transition to XML following this strategy.
Note The alternate strategy to make a transition from SGML to XML, using an XML Schema in place of an XML DTD presents an entirely different set of considerations. Following a strategy that employs an XML Schema (and using the schema functionality to its fullest) will provide a way to constrain data in a way far more effective than an SGML DTD ever could.
When XML was first developed, the idea was to simplify SGML. Make it easy so everyone could use it on the Web. Interestingly enough, XML was most-rapidly adopted by those wanting to use XML to conduct e-commerce, not by publishers. As you might guess, e-commerce requires very strict constraints on data. In order to get the data-centric e-commerce applications to work, it turns out that we really needed more robust ways to constrain data. We needed constraints that were even more strict than those eliminated from SGML. Because most e-commerce XML developers came from the programming and database worlds, they wanted a constraint mechanism that was familiar. And so the XML Schema Definition Language was born.
For this journal publisher there are several enhancements that XML with an XML Schema can provide over SGML. These are highlighted in the following sections.
XML Schema provides mechanisms to validate elements and attributes in complex prescribed combinations as well as to validate the data within them. For the publisher featured in this case study, this means even more control over the tags and the data entered. This out-of-the-box validation is especially important to the automated loading of online systems.
There are some very fundamental differences between and XML Schema and an XML DTD (or even an SGML DTD for that matter). Understanding these differences will help you understand how XML Schemas can help improve validation and automate data augmentation during content preparation at the publisher.
The syntax of an SGML DTD and an XML Schema are quite different. The SGML DTD is in the specific ISO 8879 syntax. In this example you see the declaration for the document and the elements <xyz> and <journal>.
<!DOCTYPE xyz PUBLIC "-//XYZ//DTD XYZ Journals DTD//EN">
<!ELEMENT xyz - - )journal | supplement(+ >
<!ATTLIST xyz
version CDATA #REQUIRED >
<!ELEMENT journal - O
)journal.title,
sub.journal.title?,
)byline | )editor+,)address? & trailer?(((*,
cover?, )abstracts | announcements | article | case-report
| contents | correction | editorials | cme |
filler | index | letters | research-report |
reviews | pdf-only(+( >
<!ATTLIST journal
id ID #IMPLIED
ISSN CDATA #REQUIRED
doctopic CDATA #IMPLIED
docsubj CDATA #IMPLIED >The syntax of the XML Schema uses XML tags. So it is not in the ISO 8879 syntax at all. Here is the XML Schema definition for <xyz> and <Journal>.
<?xml version = "1.0" encoding = "UTF-8"?>
<xsd:schema xmlns:xsd = "http://www.w3.org/2000/10/XMLSchema">
<xsd:element name="xyz">
<xsd:complexType>
<xsd:choice maxOccurs="unbounded">
<xsd:element ref="journal"/>
<xsd:element ref="supplement"/>
</xsd:choice> <xsd:element name="journal">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="journal.title"/>
<xsd:element ref="sub.journal.title" minOccurs="0"/>
<xsd:choice minOccurs="0" maxOccurs="unbounded">
<xsd:element ref="byline"/>
<xsd:sequence>
<xsd:element ref="editor" maxOccurs="unbounded"/>
<xsd:choice minOccurs="0" maxOccurs="unbounded">
<xsd:element ref="address"/>
<xsd:element ref="trailer"/>
<xsd:attribute name="version" use="required" type="xsd:string"/>
</xsd:complexType>
</xsd:element>
</xsd:choice>
</xsd:sequence>
</xsd:choice>
<xsd:element ref="cover" minOccurs="0"/>
<xsd:element ref="rhr" minOccurs="0"/>
<xsd:element ref="rhv" minOccurs="0"/>
<xsd:choice maxOccurs="unbounded">
<xsd:element ref="abstracts"/>
<xsd:element ref="announcements"/>
<xsd:element ref="article"/>
<xsd:element ref="case-report"/>
<xsd:element ref="contents"/>
<xsd:element ref="correction"/>
<xsd:element ref="editorials"/>
<xsd:element ref="cme"/>
<xsd:element ref="filler"/>
<xsd:element ref="index"/>
<xsd:element ref="letters"/>
<xsd:element ref="research-report"/>
<xsd:element ref="reviews"/>
<xsd:element ref="pdf-only"/>
</xsd:choice>
</xsd:sequence>
<xsd:attribute name="id" type="xsd:ID"/>
<xsd:attribute name="ISSN" use="required" type="xsd:string"/>
<xsd:attribute name="JCODE" type="xsd:string"/>
<xsd:attribute name="DIR" type="xsd:string"/>
<xsd:attribute name="ARTNO" type="xsd:string"/>
<xsd:attribute name="OPER" type="xsd:string"/>
<xsd:attribute name="doctopic" type="xsd:string"/>
<xsd:attribute name="docsubj" type="xsd:string"/>
</xsd:complexType>
</xsd:element>An SGML DTD is a series of declarations that specify components that occur in the document instance. Some of the declarations include:
<!DOCTYPE (document)
<!ELEMENT (element)
<!ATTLIST (attribute)
<!ENTITY (entity)
XML Schema has both declarations and definitions. Declarations are used to specify components of the document instance. So declarations in an XML Schema also specify elements, attributes and entities. But XML Schemas are also made up of definitions. Definitions specify components internal to the schema. Definitions include the specification of data types, model groups, attribute groups, and identity constraints.
XML Schemas give us the ability to specify a scope for declarations. First we can declare component specifications that are global. Global declarations appear at the top level of a schema. These declarations are always named and are in force throughout the document instance. There cannot be two different declarations with the same name.
The second kinds of declaration that can be made within an XML Schema are local declarations. Local declarations are scoped by the declaration or definition that contains them. So, for example we could declare a <HEADER> in a <CHAPTER> that has a different structure from the <HEADER> defined within an <APPENDIX>.
XML Schemas have the concept of declaring the structure of elements and attributes and defining simple and complex types as well. This enables us to define a type once and use it in many element or attribute declarations. For example, we might be developing an XML Schema for a journal that has <AUTHOR> and <EDITOR> elements that have identical structures. XML Schema enables us to define one type such as nameType that can be used in the declarations for both author and editor.
<xsd:complexType name="nameType">
<xsd:sequence>
<xsd:element name="firstName" type="xsd:string"/>
<xsd:element name="middleName" type="xsd:string" minOccurs="0"/>
<xsd:element name="firstName" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
<xsd:element name="AUTHOR" type="nameType"/>
<xsd:element name="EDITOR" type="nameType"/>There are two kinds of types that can be defined. If the content model is strictly data and there are no attributes, the definition is a simple type. Simple type definitions enable us to either restrict or enhance the data type of the element being declared. You would use a simple type definition to define ISSN (issue number) as a positive integer with a maximum value of 25. This restriction of the positiveInteger data type cannot be done directly within an element declaration, so we must define it in a simple type definition.
<xsd:element name="ISSN" type="ISSNType"/>
<xsd:simpleType name="ISSNType">
<xsd:restriction base="xsd:positiveInteger">
<xsd:maxInclusive value="25"/>
</xsd:restriction>
</xsd:simpleType>If, however, the content model being defined is made up of other elements and attributes, it must be defined as a complex type as you see in the previous example.
Data types in XML Schemas enable us to specify the content of either an attribute or element. In SGML DTDs, the only control we had over content was for attributes where we could enumerate values or specify very simple, document-based data types:
<!ELEMENT art - O EMPTY >
<!ATTLIST art
id ID #IMPLIED
source ENTITY #REQUIRED
process (yes) #IMPLIED
process-by ENTITY #IMPLIED
usage (online-only | print-only | all-usage)
all-usage >XML Schemas provide us with far greater control over the content of both attributes and elements. The XML Schema specification comes with 44 built-in data type specifications such as integer, date, string, and time. In the example below we have assigned the XSD (XML Schema Definition) type string to firstName, middleName, and lastName.
<xsd:complexType name="nameType">
<xsd:sequence>
<xsd:element name="firstName" type="xsd:string"/>
<xsd:element name="middleName" type="xsd:string" minOccurs="0"/>
<xsd:element name="firstName" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>We can add further restrictions to the built-in data types to restrict the use of the general types. In this example, the issue number is defined using the built-in data type xsd:postiveInteger. It is further restricted to be an integer no greater than 25 for this journal.
<xsd:element name="ISSN">
<xsd:simpleType>
<xsd:restriction base="xsd:positiveInteger">
<xsd:maxInclusive value="25"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:element>Other ways of restricting data types are discussed in the following Advanced Features section.
In addition to the simple features of XML Schemas, many advanced features have been added to provide even further constraints on the XML instance.
In SGML, we have the ability to provide defaults for attribute values but had no ability to provide defaults for elements. XML Schema gives us the ability to provide default values for both attributes and elements.
<xsd:element name="ISSUE.NUMBER" default="NA;">
<xsd:simpleType>
<xsd:restriction base="xsd:positiveInteger">
<xsd:maxInclusive value="25"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:element>One of the most important constraint capabilities that XML Schema provides that are missing in SGML is the ability to enforce uniqueness within any scope. In SGML we can enforce uniqueness using the built-in ID attribute. ID-based uniqueness has the scope of the document instance in SGML. That is useful in some cases, but many times we need to limit uniqueness based on the document context. With XML Schema we can define uniqueness within any scope using the special uniqueness constraint. Here we have defined the ID attribute on <ARTICLE> to be unique within the scope of <JOURNAL>.
<xsd:element name="ARTICLE" type="articleType">
<xsd:unique name="articleIdKey">
<xsd:selector xpath="./JOURNAL">
<xsd:field xpath="@ID">
</xsd:unique>
</element>
<xsd:complexType name="articleType">
<xsd:sequence>
<xsd:element name="ARTICLE.TITLE" type="xsd:string"/>
<xsd:element name="AUTHOR" type="authorType"/>
<xsd:element name="ARTICLE.BODY" type="articleBodyType">
<xsd:sequence/>
<xsd:attribute name="ID" type="xsd:integer" use="required"/>
<xsd:attribute name="DATE" type="xsd:date" use="required"/>
</xsd:complexType>
Note that uniqueness can also be used to assure that two or more combined fields are unique. Simply combine them within the same uniqueness statement.
Key constraints are very much like uniqueness constraints except that this constraint is intended to be used as a "key" in the database sense and hence is required to be present and cannot have a null value. It must be given a name so that it can be referenced by a key reference.
<xsd:element name="ARTICLE" type="articleType">
<xsd:key name="articleIdKey">
<xsd:selector xpath="./JOURNAL">
<xsd:field xpath="@ID">
</xsd:key>
</element>Key references have nothing to do with uniqueness. Rather they have to do with assuring that certain values match each other and match the key. For example, it is a requirement that the DATE attribute on journal articles must match the <DATE> of the Journal. This can be validated by using Key and Key Reference features of an XML Schema.
In this example we are declaring the element <JOURNAL>. The content model is found in the type definition for journalType. In the element declaration for journal we have defined both a Key and a Key Reference. The key definition selects the scope of the journal and identifies the <DATE> element as the unique key element in the content model for journal. The element declaration also specifies a Key Reference. The Key Reference is selected to be the DATE attribute on <ARTICLE>. What this means is that <DATE> is a key for the journal. It must match every DATE attribute on articles throughout the journal.
<xsd:element name="JOURNAL" type="journalType">
<xsd:key name="journalDateKey">
<xsd:selector xpath="./JOURNAL">
<xsd:field xpath="./JOURNAL/DATE/">
</xsd:key>
<xsd:keyref name="journalDateKeyRef" ref="journalDateKey">
<xsd:selector xpath="./*/ARTICLE*"/>
<xsd:field xpath="@DATE"/>
</xsd:keyref>
</element>One of the advanced ways to specify a data type is to specify a pattern for the content. This enables us to control the format of data in the instance and can provide a great way to validate content that might otherwise not be easily validated.
<xsd:element name="form">
<xsd:complexType>
<xsd:attribute name="id" type="xsd:integer" use="required">
<xsd:restriction base="xsd:string">
<xsd:pattern value="[0-9]{5}(-[0-9]}4})?"/>
</xsd:restriction>
</xsd:attribute>
</xsd:complexType>
</xsd:element >There are many factors in evaluating how the use of XML Schemas would benefit content preparation at a publisher. Perhaps the best indicator of the impact of the use of XML Schemas for this project was a document called SGML Usage Guidelines. This highly detailed document provides documentation for the journal tag set. It describes when it is appropriate to use a particular element. The SGML Usage Guidelines provides information about the usage of tags and elements. This usage was not the set of constraints that is dictated by the DTD. In fact, in some cases, the usage guidelines say that although the DTD specifies a certain use of elements and attributes, a more restricted use is often dictated.
Examples of how some of the restraints that are indicated in the SGML Usage Guidelines can be enforced by XML Schemas are indicated below:
Numerous usage notes in the Usage Guidelines refer to the data format of the id attribute on various elements. Here is an example of the usage guideline: If the id value for <form> is specified but does not conform to the correct naming convention, this will cause the SGML data to be rejected.
XML Schema can be used to define a data type for each attribute format. A different data type can be defined for id on different elements. Here is an example of how a user-defined data type for id might be constructed using XML Schema Definition Language:
<xsd:element name="FORM">
<xsd:complexType>
<xsd:attribute name="id" type="xsd:integer" use="required">
<xsd:restriction base="xsd:string">
<xsd:pattern value="[0-9]{5}(-[0-9]{4})?"/>
</xsd:restriction>
</xsd:attribute>
</xsd:complexType>
</xsd:element >In addition XML Schema can specify that the ID attribute on the same element <FORM> will be unique within the scope, in this example unique within the journal. We could also scope a particular id attribute to be unique within an <EDITORIAL> or within a set of <REFERENCES>. Here XML Schemas are much more powerful than the SGML ID/IDREF construct. Both of these features of XML Schema combine to provide a significantly improved validation for the ID attribute on any element out-of-the-box for XML Schema validating parsers.
Numerous usage notes in the SGML Usage Guidelines refer to the data format of the DATE attribute on various elements. Here is an example usage note:
Any DATE attributes, which do not comply with the specifications listed above, will cause the SGML to be rejected. The DATE attribute value must be identical in all article-level elements within an issue. Variation among article-level elements will cause the SGML data to be rejected. If the DATE attribute value does not match that listed on the cover, this will cause the SGML data to be rejected.
XML Schema can be used to specify the data type for the date just as it could for the format of ID or DOI.
<xsd:element name="letters" type="letterType">
<xsd:complexType name="letterType">
<xsd:sequence ...
........
</xds:sequence>
<xsd:attribute name="date" type="xsd:date " use="required"/>
</xsd:complexType>
</xsd:element >In addition, with an XML Schema we can do Key References to ensure that there is a match between sets of values within an instance. This means we can validate that all article dates match the journal date. In this example we have defined the <DATE> element of <JOURNAL> to be the key. It will match to all DATE attributes on articles within the journal.
<xsd:element name="JOURNAL" type="journalType">
<xsd:key name="journalDateKey">
<xsd:selector xpath="./JOURNAL">
<xsd:field xpath="./JOURNAL/DATE/">
</xsd:key>
<xsd:keyref name="journalDateKeyRef" ref="journalDateKey">
<xsd:selector xpath="./*/ARTICLE*"/>
<xsd:field xpath="@DATE"/>
</xsd:keyref>
</element>The ISSN number has a very precise format that is specified within the SGML Usage Guidelines:
If the ISSN attribute is not provided, does not match the value given in Appendix A, or is incorrectly specified, it would cause the SGML data to be rejected.
XML Schema provides a mechanism to enumerate both attribute and element values as well as to define custom data types with patterns. This mechanism could be used to enforce the ISSN value format.
The SGML Usage Guidelines requires that the ISSN attribute in the SGML source for IPS match the ISSN on the print issue:
If the ISSN attribute value does not contain the same content (although formatting may differ) as the ISSN on the print issue, this will cause the SGML data to be rejected.
The use of XML Schema cannot assist with this validation.
One big question that must be asked, if there is a consideration of moving from SGML DTDs to DTD-based XML or to Schema-based XML is the availability of tools for publishers. Even if XML Schemas were to provide very powerful advantages in theory, are there tools available to make the theory a reality?
For publishers there are several classes of XML-based tools that must be considered:
Authoring/Editorial Software
XML Schema Processors to do validation and augmentation
Composition engines and publishing tools
It is important to realize that those with an interest in publishing did not develop XML Schemas. While we can see the power XML Schemas would give a publisher, tools that use schema are more often found in the server and database world than they are in the publishing world.
The two premier authoring/editing tools for SGML are Arbortext Epic and SoftQuad's Xmetal. Arbortext makes no mention of XML Schema support. At XML 2001 marketing people from Arbortext indicated that a transform from XML Schema to XML DTD could be used if XML Schema support was required. While this will work, somewhat, it will not provide for the power of XML Schema during editing sessions.
XMetal 3.0 advertises XML Schema support for the features that are common to both XML DTDs and XML Schemas plus some important features that are only found in XML Schema. Unfortunately they do not advertise what these features are. So it is impossible to tell if the schema support that they have added will help publisher content preparation.
Here we are in luck. A tremendous amount of work had been done in support of overall XML Schema validation and processing. There are both open-source engines and commercial engines available. So if you decide to use XML Schemas for validation, good schema processing engines are available.
While at the XML 2001 Conference, a quick survey of composition vendors and their support for a host of XML-based standards including XML Schema revealed that the majority of these vendors do not have support for XML Schema. The trend today is for composition engines to assume they are receiving valid XML as an input data stream. In this case an understanding of either a DTD or an XML Schema is not necessary.
A growing school of thought among publishers is that SGML DTDs and XML Schemas can be used successfully together. Publishers have invested heavily in SGML design for their systems. In the authoring arena, tools for authoring using SGML or XML DTDs as the base are available and well tested. However tools for a more-strict XML Schema based authoring environment are not yet on the market. It is questionable if such tools will emerge due to the niche appeal of these tools. So perhaps the answer is to use an XML or SGML DTD for authoring along with a best practices document to guide the author or data conversion house with the rules that DTDs cannot capture. Finally an XML Schema with declarations for data typing, uniqueness, and key references could be used for validation and data augmentation.
Based on this study we can conclude the following:
Moving from an SGML DTD to an XML DTD compromises some functionality that the SGML DTD provides.
Moving from an SGML DTD to an XML DTD provides only a few advantages in terms of validating data according to the best practices. These include the assurance of UPPERCASE tagging and the assurance that both start and end tags will be in place. Very little else is to be gained.
XML Schemas could provide a great deal more validation control in support of many of the best practices, particularly those that center around content format (such as the format of the ID or ISSN or DOI. Schemas can also help validate both uniqueness (such as the DOIs) and key-key reference relationships (such a journal date against article dates).
XML Schemas cannot assist with the validation of the IPS electronic version against the print source. This remains a manual check.
Many best practices exist because the DTD does not match the best practices for IPS SGML content. This is often due to some requirement for backward compatibility of tools and application software that would "break" if the DTD is not backwardly compatible.
XML Schema tools for publishers are, for the most part nonexistent.
XML Schema processing tools that support validation and data augmentation are readily available.
SGML or XML DTDs may be used in a hybrid publishing environment right along with XML Schemas. The choice is not necessarily one or the other.
The team decided that the best approach for coding data is through the use of XML, as opposed to HTML or SGML. This decision is based on the following premises:
HTML, while great for presentation, is not robust enough to handle structured data applications on the Web
SGML is too difficult to use
SGML is too expensive
Mainstream SGML tools have not emerged
XML began as a simplified dialect of SGML that has been optimized for the Web. It removes much of the complexity found within SGML. There are mainstream tools for authoring, editing, and delivering data that has been coded using XML and new and better tools are emerging regularly. XML may not be "simple" to use, but a variety of improved tools greatly facilitate its use.
The team decided to utilize XML DTDs, rather than schemas, for the time being. This decision is a result of analyses of the current environment. Moving from SGML DTDs, which the organization currently uses, to XML DTDs is a natural progression that will facilitate the movement toward even more delivery channels. Using a schema, which adds a layer of complexity and constraint around data, is a next possible step, but is not an immediate goal.
Schemas enable data to be constrained more than a DTD will allow. This is valuable for normalizing data and achieving content consistency. While this project does not currently envision utilizing schema in the near term, it is important to note that transforming an XML DTD into an XML schema is not difficult to do and is a definite possibility for the future.
![]() ![]() |
Design & Development by deepX Ltd. 2002 |