|
Table of contents | Author | City | Company | Country | State/Province | Term | Interchange | ![]() |
Piez, Wendell
, Consultant ,
Mulberry Technologies, Inc.
,
U.S.A.
Wendell Piez was born in Frankfurt, Germany, and spent his early years in Central Asia and the Far East. Educated at Yale and Rutgers, he has a Ph.D. in English. He has been working with markup languages since 1994. At Mulberry Technologies since 1997, he has designed and implemented XML-based systems both for clients and for internal use. A frequent contributor to XSL-List, he also sits on the Executive Council of the Association for Computers and the Humanities.
What does it mean to migrate a web-based publishing system from HTML to XML? It turns out that this question has no simple answer. Although XML is often -- misleadingly -- described as the Next Big Thing after HTML, as HTML's successor, or even as its competitor, in fact the choice between HTML and XML is not really best understood as "either/or". Many XML-based systems, if they publish to the web, will still use some kind of HTML as a "front end", even while the data in the back is in XML.
We are commonly faced with the proposition that, in order to gain this or that capability for our site, we should move to XML. In order to understand what such migration will mean, it helps to have a grasp of some of the technical issues involved. Bringing a critical eye to the markup itself reveals immediately -- even to a neophyte or non-expert -- how the data may be put to use and what kinds of applications it is capable of supporting. Accordingly, this presentation proceeds by breaking the problem down into several separate cases, each of which we have illustrated with a simple example. After briefly noting what the actual relation is between XML and HTML as standards, we will look at each of these cases in turn. Gaining a sense of what the differences are between the cases, and what the challenges are of migrating data from one approach to another, will clarify the alternatives, along with their advantages and costs.
In untangling the complexities of what it means to migrate from HTML to XML, and what issues it raises, there are some important topics that we will not address. We will not consider strategic questions such as why do this in the first place: the assumption is that we can take for granted XML's advantages for data longevity, reusability, and so forth. Likewise, we will not be considering architectural or implementation issues such as which server technology to use, whether to store our data in a database or a file system, how to organize and administer the data, whether to do necessary processing in batch mode, or dynamically on the server, or on the client, etc. etc. These issues are important; and a very useful and interesting paper could be written describing how to approach them and the various dependencies these problems have (since the answer to any of these questions seems to be, "it depends"): but that paper is not this one. Instead, we will be looking at the formats themselves, to assess their differing capabilities as underlying technologies.
| HTML | XML |
|---|---|
| Fixed set of tags | User-made tags (infinite variety) |
| Flat | Structured (nested) |
| Total freedom (at your own risk) | Draconian error handling |
| ***Case insensitive | ***Case sensitive |
| ***End tags (mostly) optional | ***All start and end tags required |
| ***Syntax loose | ***Syntax strict |
| One linking tag/One link type | Any element may be a link/Many types and roles for links |
| Tags built into browser | Tags and specification input to processor |
(Most word processors and desk-top publishing applications are like this, too)
(These are logical steps; actual steps may blend/merge)
There are many objectives, but two are most important:
These two objectives may go together, but don't have to : decide what your needs are.
See exhibit-1 (some HTML code) and exhibit-2 (a screenshot of how it appears in a browser).
</p> <b><font face=sans-serif size=-1 color=#008B00>Habitat:</b> </font>By river banks, in ditches and in wet spots.</p>
See exhibit-4 (some HTML code) and exhibit-5 (a screenshot of how it appears in a browser).
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Sample XHTML Document</title> </head> <body> <p>This is a small XHTML document.</p> </body> </html>
All the gains from Well-formed HTML-XML, plus
Can still use XHTML and reflect (some) structure (see exhibit-9)
Specific case of more general problem: mixing format with content
See exhibit-11, exhibit-12
HTML authors are used to named entities for characters not in basic character set
These are built into browsers, but in XML must be declared
If not XHTML, we must declare them ourselves (see exhibit-13 for an example)
If we declare entities ourselves, we break our pages in generation 4 browsers!
See exhibit-15
Why? because the HTML parser in Netscape Navigator 4.x does not parse XML DOCTYPE declaration properly, treats closing ]> as data
Another reason to have an external DTD (no declarations in page code)
An XML document is valid if it:
(The document is valid "according to the DTD/schema")
exhibit-16 shows a very high-level markup language used to create catalog listings (document and DTD are shown)
Some formatting information is specified as parameters, but not much
Document model is descriptive, but only of pages in a very abstract way
But enough information is here to create something like exhibit-3 (see code in exhibit-2) or even exhibit-13 (see code in exhibit-11)
This kind of markup is versatile and easy to create/maintain; but of limited usefulness.
See exhibit-17
If HTML source is very clean, this code could be autogenerated
<html> <body link=#8B0000 vlink=#8B0000 bgcolor=#FFFFFF> <!-- coded by hand, with content from herbal.simple.html --> <H1><FONT color=#008B00>Comfrey</h1></font> <h4><i>Symphytum officinale</h4></i> </p> <b><font face=sans-serif size=-1 color=#008B00> Habitat:</b></font> By river banks, in ditches and in wet spots. </p> <b><font face=sans-serif size=-1 color=#008B00></font> <font face=sans-serif size=-1 color=#008B00>Also called:</font> </b> Knitbone; Knitback; Consound; Blackwort; Bruisewort; Slippery Root; Boneset; Consolida; Ass Ear <br> <br> <b><font face=sans-serif size=-1 color=#008B00>Treatment for: <i></font></b><b><font color=#008B00>wounds; broken bones; ulcer; hernia; haemorrhage; bronchitis</font></b> </H4></I> <DL> <DT> <b><font face=sans-serif size=-1 color=#008B00>Preparation: </b></font> <br>[Root, rhizome, leaf] <DD>Unearth the roots in spring or autumn. Split and dry in fairly cool place. Infuse one to three tsp of the dried herb in a cup of water, bring to a boil and let simmer for 10-15 minutes. </dl> </p> <b><font face=sans-serif size=-1 color=#008B00>Active ingredient:</b></font> Allantoin <h4>vulnerary; demulcent; anti-inflammatory; astringent; expectorant</H4> <font size=-1>A relative of the forget-me-not, comfrey is recognizable by its broad, hairy leaves. One of the best known of traditional herbal treatments; its use goes back at least to the Middle Ages and into the indefinite past. Has been used for gout and aching joints as well as for all kinds of breaks, wounds and ulcers.</FONT> </body>
|
See exhibit-1 for the code: this stuff displays just fine (in Netscape 4.x). This shows how forgiving the browser is. (Or how forgiving this browser is.) |
<html> <body link="#8B0000" vlink="#8B0000" bgcolor="#FFFFFF"> <!-- hand-corrected well-formed version of comfrey.bad.html --> <h1><font color="#008B00">Comfrey</font></h1> <h4><i>Symphytum officinale</i></h4> <p> <b><font face="sans-serif" size="-1" color="#008B00"> Habitat:</font></b> By river banks, in ditches and in wet spots.</p> <p> <b><font face="sans-serif" size="-1" color="#008B00"> Also called:</font></b> Knitbone; Knitback; Consound; Blackwort; Bruisewort; Slippery Root; Boneset; Consolida; Ass Ear</p> <p> <b><font face="sans-serif" size="-1" color="#008B00">Treatment for: </font></b><b><font color="#008B00">wounds; broken bones; ulcer; hernia; haemorrhage; bronchitis</font></b></p> <dl> <dt> <b><font face="sans-serif" size="-1" color="#008B00">Preparation:</font></b> <br class="br"/>[Root, rhizome, leaf]</dt> <dd>Unearth the roots in spring or autumn. Split and dry in fairly cool place. Infuse one to three tsp of the dried herb in a cup of water, bring to a boil and let simmer for 10-15 minutes.</dd> </dl> <p> <b><font face="sans-serif" size="-1" color="#008B00"> Active ingredient:</font></b> Allantoin</p> <h4>vulnerary; demulcent; anti-inflammatory; astringent; expectorant</h4> <p><font size="-1">A relative of the forget-me-not, comfrey is recognizable by its broad, hairy leaves. One of the best known of traditional herbal treatments; its use goes back at least to the Middle Ages and into the indefinite past. Has been used for gout and aching joints as well as for all kinds of breaks, wounds and ulcers.</font> </p> </body> </html>
<html> <!-- well-formed version with content from herbal.simple.html --> <body link="#8B0000" vlink="#8B0000" bgcolor="#FFFFFF"> <h1><a name="intro">A Garland of Herbs</a></h1> <p>This miniature herbal is created as a demonstration of structured data published in non-proprietary formats (HTML and XML). It is <i>not</i> intended for use as a reference on herbal remedies. Compiled by a non-expert from publicly-available sources, its content is not deliberately falsified or distorted; nevertheless it should not be regarded as authoritative in any way.</p> <p>While the contents of the herbal may not be trustworthy, however, its <i>structure</i> should be perfectly serviceable for the need: to present organized information in a way that both improves access for readers, and renders the dataset suitable for such automated processes as indexing and filtering.</p> <h2><i><font face="sans-serif"> <a name="Organization">Organization</a></font></i> </h2> <p>There is a consistent organization to each entry. Note that not all entries have all sections.</p> <h3><font face="sans-serif"><a name="Primary.Names"> Primary Names</a></font></h3> <p>Each herb is listed with its common name and its formal (Latin) botanical name.</p> <h3><font face="sans-serif"><a name="Habitat"> Habitat</a></font></h3> <p>Where the herb is commonly found is listed as its habitat. This section is mainly for interest: amateurs are not encouraged to go to anyplace described, harvest a likely candidate, and boil it up.</p> <h3><font face="sans-serif"><a name="Also.called">Also called</a></font></h3> <p>Any names by which the herb or plant may also be commonly known are listed here.</p> <h3><font face="sans-serif"><a name="Treatment.for"> Treatment for</a></font></h3> <p>Common ailments for which the herb is a known palliative (or even a cure), are listed here. This list is not exhaustive, of course; nor is it necessarily correct. (Herbal medicine has largely been an inexact science.)</p> <h3><font face="sans-serif"><a name="Preparation"> Preparation</a></font></h3> <p>Any preparation(s) for the herb is (are) described in this section. If different parts of the plant are used, their preparations are described separately.</p> <h3><font face="sans-serif"><a name="Active.Ingredients"> Active Ingredients</a></font></h3> <p>In some cases, where the active chemical component or components of an herb are known, they are listed here.</p> <h3><font face="sans-serif"><a name="Effects">Effects </a></font></h3> <p>Medical terms describing the pharmacological effects of the herb (e.g. <b>sedative</b>) are listed here.</p> <h3><font face="sans-serif"><a name="Description"> Description</a></font></h3> <p>Each herb is briefly described in one or more paragraphs.</p> <h3><font face="sans-serif"><a name="Notes">Notes </a></font></h3> <p>Any supplemental notes on the herb, especially respecting possible warnings associated with it, appear here.</p> <h2><i><font face="sans-serif"><a name="Sources"> Sources</a></font></i></h2> <p>This guide is adapted from several sources on the Internet (see references below). </p> <h2><i><font face="sans-serif"><a name="references"> References</a></font></i></h2> <p><font size="-1">Botanical.com, a hypertext edition of A Modern Herbal (M. Grieve, 1931) <a href="http://www.botanical.com/botanical/mgmh/mgmh.html"> http://www.botanical.com/botanical/mgmh/mgmh.html</a></font> </p> <p><font size="-1">Herbal Medicine Center at Healthworld Online: see <a href="http://www.healthy.net/clinic/therapy/herbal/"> http://www.healthy.net/clinic/therapy/herbal/</a></font> </p> <p><font size="-1">The Warnings page of Dr. Yang's Herbs and Gems for Health: <a href="http://www.ocnsignal.com/yangwarn2.shtml"> http://www.ocnsignal.com/yangwarn2.shtml</a></font></p> <hr /> </body> </html>
<html>
<!-- well-formed version with content from herbal.simple.html
enhanced with non-HTML tagging -->
<body link="#8B0000" vlink="#8B0000" bgcolor="#FFFFFF">
<H1-DIV>
<h1><a name="intro">A Garland of Herbs</a></h1>
<p>This miniature herbal is created as a demonstration of structured
data published in non-proprietary formats (HTML and XML). It is <i>not</i> intended for use as a reference on herbal remedies. Compiled by a non-expert from publicly-available sources, its content is not deliberately falsified or distorted; nevertheless it should not be regarded as authoritative in any way.</p>
<p>While the contents of the herbal may not be trustworthy, however,
its <i>structure</i> should be perfectly serviceable for the
need: to present organized information in a way that both improves access
for readers, and renders the dataset suitable for such automated processes
as indexing and filtering.</p>
<H2-DIV>
<h2><i><font face="sans-serif">
<a name="Organization">Organization</a></font></i>
</h2>
<p>There is a consistent organization to each entry. Note that not
all entries have all sections.</p>
<H3-DIV>
<h3><font face="sans-serif"><a name="Primary.Names">
Primary Names</a></font></h3>
<p>Each herb is listed with its common name and its formal (Latin)
botanical name.</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Habitat">
Habitat</a></font></h3>
<p>Where the herb is commonly found is listed as its habitat. This
section is mainly for interest: amateurs are not encouraged to go to
anyplace described, harvest a likely candidate, and boil it up.</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Also.called">Also
called</a></font></h3>
<p>Any names by which the herb or plant may also be commonly known
are listed here.</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Treatment.for">
Treatment for</a></font></h3>
<p>Common ailments for which the herb is a known palliative (or even
a cure), are listed here. This list is not exhaustive, of course; nor is it
necessarily correct. (Herbal medicine has largely been an inexact science.)
</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Preparation">
Preparation</a></font></h3>
<p>Any preparation(s) for the herb is (are) described in this section.
If different parts of the plant are used, their preparations are described
separately.</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Active.Ingredients">
Active Ingredients</a></font></h3>
<p>In some cases, where the active chemical component or components
of an herb are known, they are listed here.</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Effects">Effects
</a></font></h3>
<p>Medical terms describing the pharmacological effects of the herb
(e.g. <b>sedative</b>) are listed here.</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Description">
Description</a></font></h3>
<p>Each herb is briefly described in one or more paragraphs.</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Notes">Notes</a>
</font></h3>
<p>Any supplemental notes on the herb, especially respecting possible
warnings associated with it, appear here.</p>
</H3-DIV>
</H2-DIV>
<H2-DIV>
<h2><i><font face="sans-serif"><a name="Sources">
Sources</a></font></i></h2>
<p>This guide is adapted from several sources on the Internet (see
references below). </p>
</H2-DIV>
<H2-DIV>
<h2><i><font face="sans-serif"><a name="references">
References</a></font></i></h2>
<p><font size="-1">Botanical.com, a hypertext edition of A
Modern Herbal (M. Grieve, 1931)
<a href="http://www.botanical.com/botanical/mgmh/mgmh.html">
http://www.botanical.com/botanical/mgmh/mgmh.html</a>
</font></p>
<p><font size="-1">Herbal Medicine Center at Healthworld
Online: see <a href="http://www.healthy.net/clinic/therapy/herbal/">
http://www.healthy.net/clinic/therapy/herbal/</a></font>
</p>
<p><font size="-1">The Warnings page of Dr. Yang's Herbs and
Gems for Health: <a href="http://www.ocnsignal.com/yangwarn2.shtml">
http://www.ocnsignal.com/yangwarn2.shtml</a></font></p>
</H2-DIV>
</H1-DIV><hr />
</body>
</html> Compare to exhibit-3
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"lib/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<!-- hand-corrected XHTML-valid version of comfrey.wf.html -->
<head>
<title>Comfrey</title>
</head>
<body link="#8B0000" vlink="#8B0000" bgcolor="#FFFFFF">
<h1><font color="#008B00">Comfrey</font></h1>
<h4><i>Symphytum officinale</i></h4>
<p><b><font face="sans-serif" size="-1" color="#008B00">
Habitat:</font></b> By river banks, in ditches and in wet spots.
</p>
<p><b><font face="sans-serif" size="-1" color="#008B00">
Also called:</font></b> Knitbone; Knitback; Consound;
Blackwort; Bruisewort; Slippery Root; Boneset; Consolida; Ass Ear</p>
<p><b><font face="sans-serif" size="-1" color="#008B00">
Treatment for: </font></b><b><font color="#008B00">
wounds; broken bones; ulcer; hernia; haemorrhage; bronchitis</font>
</b></p>
<dl>
<dt><b><font face="sans-serif" size="-1" color="#008B00">
Preparation:</font></b>
<br class="br"/>[Root, rhizome, leaf]</dt>
<dd>Unearth the roots in spring or autumn. Split and dry in fairly
cool place. Infuse one to three tsp of the dried herb in a cup of water,
bring to a boil and let simmer for 10-15 minutes.</dd></dl>
<p><b><font face="sans-serif" size="-1" color="#008B00">
Active ingredient:</font></b> Allantoin</p>
<h4>vulnerary; demulcent; anti-inflammatory; astringent; expectorant
</h4>
<p><font size="-1">A relative of the forget-me-not, comfrey is
recognizable by its broad, hairy leaves. One of the best known of
traditional herbal treatments; its use goes back at least to the Middle
Ages and into the indefinite past. Has been used for gout and aching joints
as well as for all kinds of breaks, wounds and ulcers.</font></p>
</body>
</html>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"lib/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<!-- hand-corrected XHTML-valid version of comfrey.wf.html -->
<head>
<title>Comfrey</title>
<style type="text/css">
body { background-color: #FFFFFF }
a:link { color: #8B0000 }
a:vlink { color: #8B0000 }
</style>
</head>
<body>
<h1 style="color: #008B00">Comfrey</h1>
<h4><i>Symphytum officinale</i></h4>
<p><b>
<span style="font-face: sans-serif; font-size: -1; color: #008B00">
Habitat:</span></b> By river banks, in ditches and in wet spots.
</p>
<p><b>
<span style="font-face: sans-serif; font-size: -1; color: #008B00">
Also called:</span></b> Knitbone; Knitback; Consound; Blackwort;
Bruisewort; Slippery Root; Boneset; Consolida; Ass Ear</p>
<p><b>
<span style="font-face: sans-serif; font-size: -1; color: #008B00">
Treatment for: </span></b><b>
<span style="color: #008B00">wounds; broken bones; ulcer; hernia;
haemorrhage; bronchitis</span></b></p>
<dl>
<dt>
<b>
<span style="font-face: sans-serif; font-size: -1; color: #008B00">
Preparation:</span></b>
<br class="br"/>[Root, rhizome, leaf]</dt>
<dd>Unearth the roots in spring or autumn. Split and dry in fairly
cool place. Infuse one to three tsp of the dried herb in a cup of water,
bring to a boil and let simmer for 10-15 minutes.</dd></dl>
<p><b>
<span style="font-face: sans-serif; font-size: -1; color: #008B00">
Active ingredient:</span></b> Allantoin</p>
<h4>vulnerary; demulcent; anti-inflammatory; astringent; expectorant
</h4>
<p><span style="font-size: -1">A relative of the forget-me-not,
comfrey is recognizable by its broad, hairy leaves. One of the best known
of traditional herbal treatments; its use goes back at least to the Middle
Ages and into the indefinite past. Has been used for gout and aching joints
as well as for all kinds of breaks, wounds and ulcers.</span></p>
</body>
</html>
Compare to exhibit-6. Here, we have div elements with class attributes (valid in XHTML).
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"lib/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<!-- XHTML-valid version of intro.struct.html:
structural tagging changed to be XHTML-compliant -->
<head>
<title>A Garland of Herbs</title>
</head>
<body link="#8B0000" vlink="#8B0000" bgcolor="#FFFFFF">
<div class="h1">
<h1><a name="intro">A Garland of Herbs</a></h1>
<p>This miniature herbal is created as a demonstration of structured
data published in non-proprietary formats (HTML and XML). It is
<i>not</i> intended for use as a reference on herbal remedies.
Compiled by a non-expert from publicly-available sources, its content is
not deliberately falsified or distorted; nevertheless it should not be
regarded as authoritative in any way.</p>
<p>While the contents of the herbal should not be regarded as
trustworthy, however, its <i>structure</i> should be perfectly
serviceable for the need: to present organized information in a way that
both improves access for readers, and renders the dataset suitable for such
automated processes as indexing and filtering.</p>
<div class="h2">
<h2><i><font face="sans-serif">
<a name="Organization">Organization</a></font></i>
</h2>
<p>There is a consistent organization to each entry. Note that not
all entries have all sections.</p>
<div class="h3">
<h3><font face="sans-serif"><a name="Primary.Names">
Primary Names</a></font></h3>
<p>Each herb is listed with its common name and its formal (Latin)
botanical name.</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Habitat">
Habitat</a></font></h3>
<p>Where the herb is commonly found is listed as its habitat. This
section is mainly for interest: amateurs are not encouraged to go to
anyplace described, harvest a likely candidate, and boil it up.</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Also.called">Also
called</a></font></h3>
<p>Any names by which the herb or plant may also be commonly known
are listed here.</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Treatment.for">
Treatment for</a></font></h3>
<p>Common ailments for which the herb is a known palliative (or even
a cure), are listed here. This list is not exhaustive, of course; nor is it
necessarily correct. (Herbal medicine has largely been an inexact science.)
</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Preparation">
Preparation</a></font></h3>
<p>Any preparation(s) for the herb is (are) described in this section.
If different parts of the plant are used, their preparations are described
separately.</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Active.Ingredients">
Active Ingredients</a></font></h3>
<p>In some cases, where the active chemical component or components
of an herb are known, they are listed here.</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Effects">
Effects</a></font></h3>
<p>Medical terms describing the pharmacological effects of the herb
(e.g. <b>sedative</b>) are listed here.</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Description">
Description</a></font></h3>
<p>Each herb is briefly described in one or more paragraphs.</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Notes">
Notes</a></font></h3>
<p>Any supplemental notes on the herb, especially respecting possible
warnings associated with it, appear here.</p>
</div>
</div>
<div class="h2">
<h2><i><font face="sans-serif"><a name="Sources">
Sources</a></font></i></h2>
<p>This guide is adapted from several sources on the Internet (see
references below). </p>
</div>
<div class="h2">
<h2><i><font face="sans-serif"><a name="references">
References</a></font></i></h2>
<p><font size="-1">Botanical.com, a hypertext edition of A
Modern Herbal (M. Grieve, 1931)