|

What's
a Schema Anyway?
by
Dave
Peterson, Copyright 2001
Answer: An XML Schema is a DTD on steroids wearing different
clothes.
True,
but it doesn’t really help. So we must
take a closer look.
A
DTD is a set of information about a
document or class of documents.
That’s abstract. A DTD
is also a written description of that
information.
Now we’re getting somewhere:
exactly the same words describe
a Schema.
The difference, on the abstract
side, is the “steroids”:
a lot more information can be
in a Schema. And on the written side, it’s expected to be written in a different
language—that’s the “different clothes”.
The
different language is simple to describe,
so that comes first.
You’ve probably seen the traditional
language for DTDs:
Full of ‘<!ELEMENT’, ‘<!ATTLIST’.,
and ‘<!ENTITY’. It’s a concise language, special purpose for DTDs.
It was designed that way, because
back in the ’80s there were no authoring
tools already existing; DTDs had to
be read and written entirely by hand.
The SGML designers considered
writing DTDs using SGML “tag” markup,
but found it too hard to read.
On
the other hand, now there are tools
for authoring XML (and SGML).
And there is in the XML culture
a desire to do everything with tags
when possible.
So the Schema designers chose
to describe a Schema using an XML document—that
is, using “tag” syntax.
This makes written Schemas (properly
called Schema
Documents) somewhat “wordier”—more
difficult to read—than DTDs, in the
raw, but also makes it easier to find
tools to help deal with them.
The
standard that describes schemas makes
an explicit distinction between the
abstract set of information (the Schema)
and the written description thereof
(one or more Schema Documents).
Not only does this make it easier
to understand, it also makes it easier
to consider alternate description languages.
So this series of articles on
Schemas will start by discussing the
abstractions. Once you understand them, the language becomes relatively easy.
Let’s
consider what you can say about a document
with a DTD.
From the abstract point of view,
an XML (or SGML) document is a typed
hierarchical dataset with some extra
links across the hierarchy (from ID/IDREF
attributes). “Typed hierarchical” just means that you can view the structure
as a tree, or as a nested partition,
and you can specify a “type” for each
node of the tree (or each segment in
the partition). All data points are character strings, or identified clumps
of character strings and smaller clumps.
The character strings are the
leaves of the tree and the clumps are
the branch points, or nodes.
The clumps are elements;
the character strings are either PCDATA
or attribute values. Attributes
have names; PCDATA strings (and subelements)
are identified by where in their sequential
order they occur.
For each type of element, you
can tell what sequences of PCDATA and
subelements (identified by type) can
occur, but you can’t say anything about
the lexical structure of the character
strings. On the other hand, attribute values can’t have any significant
tree structure, but you can say a little
bit about their lexical structure—what
sequences of characters can occur (for
example, the value of some particular
attribute must be a “name”).
That’s it.
Sound familiar?
Schemas
are similar, but you can say a lot more
about permitted structures.
First of all, not all data in
the abstract has to be character strings. You can specify other types: various kinds of numbers, dates,
times, etc., and require that certain
occurances of data be interpreted as
these other types.
This is especially handy if you
are dealing with all or partly non-character
data.
Most
Schema processors will convert the character
strings in the “concrete” document into
appropriate internal representations
of the “abstract” typed data for you,
rather than having to have each application
have the subroutines to do the conversion.
This needs some explanation,
since the bit-pattern representations
are not specified by the Schema Recommendation—these
representations generally vary from
implementation to implementation.
Officially, a Schema processor
will take the data structure generated
by an ordinary XML parser and add to
it information contained in the Schema,
including information on how each of
the character strings of data in the
document is to be interpreted.
I expect that most Schema processors
will in addition directly provide the
non-character representation appropriate
for the system on which they are running—why
not?.
Less likelihood of error here,
as well as cheaper applications.
Schema
lets you make these specifications rather
uniformly for both attributes and PCDATA:
the equivalent in DTD terms of
being able to make lexical structure
restrictions on PCDATA as well as attributes.
(After all, in the DTD world
all of your data consists of character
strings; lexical structure is the only
thing you can restrict.)
Schemas also give you the option
of using other abstract data types,
and provides standard translations from
the “concrete document” character strings
to the abstract data (and vice versa).
Schemas
also let you describe the permitted
structures of various types of elements,
in somewhat more detail than you can
with DTDs.
Some of the additions are pretty
simple; others are quite complex.
Here’s an example of a simple
one:
In SGML DTDs, ;you could say
something like “elements of this type
must have one subelement of each of
the following types”, which was fine—but
that structure was somewhat difficult
to validate, and its extensions, such
as “...one or more subelements...” didn‘t
do what most people wanted:
it required all the subelements
of each type to be grouped together,
not intermixed with the other types.
Since that wasn’t useful and
the construct was somewhat difficult
to vaidate, XML DTDs just don’t allow
that construct.
In Schemas, you can get it, and
you get it in the “intermixed” version,
which is much more useful.
There
is just one thing you can do with a
DTD that you can’t do with a Schema:
you can’t declare entities (at
least, not entities that are to be parsed
as part of the document)!
This is because of the way documents
are processed in XML.
A pure or standard XML processing
system reads the incoming “concrete”
document and converts it to an abstract
tree structure for internal consumption
by an application.
This involves reading the DTD
and, among other things, doing all of
the entity replacements during the parsing
process.
XML Schemas are a separate standard,
just like style sheets and transformations.
So a Schema processor is expected
to take as input the document tree produced
by the XML parser and translate it into
a tree that has additional Schema-derived
data where appropriate (in addition
to validating the data format and dataset
structure against the rules in the Schema).
Since the entity processing is
already complete when the Schema processor
comes into play, it’s too late to declare and process entities then.
To do so would violate the general
models of XML processing.
(Some people like this; others
would like Schema processing to be more
integrated so that Schemas could be
used entirely in lieu of DTDs.) If there is anything else that a DTD can do that a Schema
can’t do, it’s likely to be pretty wierd,
and something no one is likely to want
to do in real life, because no one has
come up with any examples yet.
Return
to TOC
Home
| Events
| Standards
| Membership
| News
| Resources
| About
|