IDEAlliance

© 2008 IDEAlliance
Incorporated. Contact
us at (703) 837-1070.

Feature

What's a Schema Anyway?

by Dave Peterson, Copyright 2001

Answer:  An XML Schema is a DTD on steroids wearing different clothes.

True, but it doesn’t really help. So we must take a closer look.

A DTD is a set of information about a document or class of documents.  That’s abstract.  A DTD is also a written description of that information.  Now we’re getting somewhere:  exactly the same words describe a Schema.  The difference, on the abstract side, is the “steroids”:  a lot more information can be in a Schema.  And on the written side, it’s expected to be written in a different language—that’s the “different clothes”.

The different language is simple to describe, so that comes first.  You’ve probably seen the traditional language for DTDs:  Full of ‘<!ELEMENT’, ‘<!ATTLIST’., and ‘<!ENTITY’.  It’s a concise language, special purpose for DTDs.  It was designed that way, because back in the ’80s there were no authoring tools already existing; DTDs had to be read and written entirely by hand.  The SGML designers considered writing DTDs using SGML “tag” markup, but found it too hard to read.

On the other hand, now there are tools for authoring XML (and SGML).  And there is in the XML culture a desire to do everything with tags when possible.  So the Schema designers chose to describe a Schema using an XML document—that is, using “tag” syntax.  This makes written Schemas (properly called Schema Documents) somewhat “wordier”—more difficult to read—than DTDs, in the raw, but also makes it easier to find tools to help deal with them.

The standard that describes schemas makes an explicit distinction between the abstract set of information (the Schema) and the written description thereof (one or more Schema Documents).  Not only does this make it easier to understand, it also makes it easier to consider alternate description languages.  So this series of articles on Schemas will start by discussing the abstractions.  Once you understand them, the language becomes relatively easy.

Let’s consider what you can say about a document with a DTD.  From the abstract point of view, an XML (or SGML) document is a typed hierarchical dataset with some extra links across the hierarchy (from ID/IDREF attributes).  “Typed hierarchical” just means that you can view the structure as a tree, or as a nested partition, and you can specify a “type” for each node of the tree (or each segment in the partition).  All data points are character strings, or identified clumps of character strings and smaller clumps.  The character strings are the leaves of the tree and the clumps are the branch points, or nodes.  The clumps are elements; the character strings are either PCDATA or attribute values.  Attributes have names; PCDATA strings (and subelements) are identified by where in their sequential order they occur.  For each type of element, you can tell what sequences of PCDATA and subelements (identified by type) can occur, but you can’t say anything about the lexical structure of the character strings.  On the other hand, attribute values can’t have any significant tree structure, but you can say a little bit about their lexical structure—what sequences of characters can occur (for example, the value of some particular attribute must be a “name”).  That’s it.  Sound familiar?

Schemas are similar, but you can say a lot more about permitted structures.  First of all, not all data in the abstract has to be character strings.  You can specify other types: various kinds of numbers, dates, times, etc., and require that certain occurances of data be interpreted as these other types.  This is especially handy if you are dealing with all or partly non-character data.

Most Schema processors will convert the character strings in the “concrete” document into appropriate internal representations of the “abstract” typed data for you, rather than having to have each application have the subroutines to do the conversion.  This needs some explanation, since the bit-pattern representations are not specified by the Schema Recommendation—these representations generally vary from implementation to implementation.  Officially, a Schema processor will take the data structure generated by an ordinary XML parser and add to it information contained in the Schema, including information on how each of the character strings of data in the document is to be interpreted.  I expect that most Schema processors will in addition directly provide the non-character representation appropriate for the system on which they are running—why not?.  Less likelihood of error here, as well as cheaper applications.

Schema lets you make these specifications rather uniformly for both attributes and PCDATA:  the equivalent in DTD terms of being able to make lexical structure restrictions on PCDATA as well as attributes.  (After all, in the DTD world all of your data consists of character strings; lexical structure is the only thing you can restrict.)  Schemas also give you the option of using other abstract data types, and provides standard translations from the “concrete document” character strings to the abstract data (and vice versa).

Schemas also let you describe the permitted structures of various types of elements, in somewhat more detail than you can with DTDs.  Some of the additions are pretty simple; others are quite complex.  Here’s an example of a simple one:  In SGML DTDs, ;you could say something like “elements of this type must have one subelement of each of the following types”, which was fine—but that structure was somewhat difficult to validate, and its extensions, such as “...one or more subelements...” didn‘t do what most people wanted:  it required all the subelements of each type to be grouped together, not intermixed with the other types.  Since that wasn’t useful and the construct was somewhat difficult to vaidate, XML DTDs just don’t allow that construct.  In Schemas, you can get it, and you get it in the “intermixed” version, which is much more useful.

There is just one thing you can do with a DTD that you can’t do with a Schema:  you can’t declare entities (at least, not entities that are to be parsed as part of the document)!  This is because of the way documents are processed in XML.  A pure or standard XML processing system reads the incoming “concrete” document and converts it to an abstract tree structure for internal consumption by an application.  This involves reading the DTD and, among other things, doing all of the entity replacements during the parsing process.  XML Schemas are a separate standard, just like style sheets and transformations.  So a Schema processor is expected to take as input the document tree produced by the XML parser and translate it into a tree that has additional Schema-derived data where appropriate (in addition to validating the data format and dataset structure against the rules in the Schema).  Since the entity processing is already complete when the Schema processor comes  into play, it’s too late to declare and process entities then.  To do so would violate the general models of XML processing.  (Some people like this; others would like Schema processing to be more integrated so that Schemas could be used entirely in lieu of DTDs.)   If there is anything else that a DTD can do that a Schema can’t do, it’s likely to be pretty wierd, and something no one is likely to want to do in real life, because no one has come up with any examples yet.

Return to TOC

 

Home  |  Events  |  Standards  |  Membership  |  News  |  Resources  |  About