Table of contents Author City Company Country State/Province Term Interchange  

Converting HTML to XML

Piez, Wendell , Consultant ,   Mulberry Technologies, Inc. ,    U.S.A. 

Biography

Wendell Piez was born in Frankfurt, Germany, and spent his early years in Central Asia and the Far East. Educated at Yale and Rutgers, he has a Ph.D. in English. He has been working with markup languages since 1994. At Mulberry Technologies since 1997, he has designed and implemented XML-based systems both for clients and for internal use. A frequent contributor to XSL-List, he also sits on the Executive Council of the Association for Computers and the Humanities.



What does it mean to migrate a web-based publishing system from HTML to XML? It turns out that this question has no simple answer. Although XML is often -- misleadingly -- described as the Next Big Thing after HTML, as HTML's successor, or even as its competitor, in fact the choice between HTML and XML is not really best understood as "either/or". Many XML-based systems, if they publish to the web, will still use some kind of HTML as a "front end", even while the data in the back is in XML.

We are commonly faced with the proposition that, in order to gain this or that capability for our site, we should move to XML. In order to understand what such migration will mean, it helps to have a grasp of some of the technical issues involved. Bringing a critical eye to the markup itself reveals immediately -- even to a neophyte or non-expert -- how the data may be put to use and what kinds of applications it is capable of supporting. Accordingly, this presentation proceeds by breaking the problem down into several separate cases, each of which we have illustrated with a simple example. After briefly noting what the actual relation is between XML and HTML as standards, we will look at each of these cases in turn. Gaining a sense of what the differences are between the cases, and what the challenges are of migrating data from one approach to another, will clarify the alternatives, along with their advantages and costs.

In untangling the complexities of what it means to migrate from HTML to XML, and what issues it raises, there are some important topics that we will not address. We will not consider strategic questions such as why do this in the first place: the assumption is that we can take for granted XML's advantages for data longevity, reusability, and so forth. Likewise, we will not be considering architectural or implementation issues such as which server technology to use, whether to store our data in a database or a file system, how to organize and administer the data, whether to do necessary processing in batch mode, or dynamically on the server, or on the client, etc. etc. These issues are important; and a very useful and interesting paper could be written describing how to approach them and the various dependencies these problems have (since the answer to any of these questions seems to be, "it depends"): but that paper is not this one. Instead, we will be looking at the formats themselves, to assess their differing capabilities as underlying technologies.

Organization

What is HTML?

What do HTML Tags Say They Do?

HTML Tags Really Do Nothing

What is XML?

XML Documents

XML Markup Identifies Document Components

Content Markup

What type of information is this?

Structure Markup

What part of the document is this?

"Location or Navigation" Information

Added to text to make it more functional, useful or manageable

Metadata (Data about the Data)

Rendering/Processing Markup

How text should print, display, or behave

HTML is (Implicitly) Formatting Markup

Many HTML tags sound like generic elements (<emph>, <p> for paragraph, <li> for list item) but they are abused to create the desired display. For example, words that aren't "definition terms" are tagged <dt>.

HTML has codes for:

XML Tags Really Do Nothing

Differences Between XML and HTML

HTML XML
Fixed set of tags User-made tags (infinite variety)
Flat Structured (nested)
Total freedom (at your own risk) Draconian error handling
***Case insensitive ***Case sensitive
***End tags (mostly) optional ***All start and end tags required
***Syntax loose ***Syntax strict
One linking tag/One link type Any element may be a link/Many types and roles for links
Tags built into browser Tags and specification input to processor

XML has no Pre-defined Tags

HTML Markup is "Loose"

XML Markup is Strict

Tags indicate the beginning and end of all elements

. . . closing tags are required .

HTML Structure is Flat

(Most word processors and desk-top publishing applications are like this, too)

XML Makes Nested Structures

Structured Documents Contain Nested, Retrievable Objects

Advantages of Structured Documents

An XML Document is a Tree

Tree Structure Indicates Nesting

HTML Error Handling: Permissive

XML Error Handling: Draconian

Converting Documents from HTML to XML

(These are logical steps; actual steps may blend/merge)

Objectives of Conversion to XML

There are many objectives, but two are most important:

These two objectives may go together, but don't have to : decide what your needs are.

Relation between XML and HTML

Many Levels of Conversion

Case 1: HTML to Well-formed HTML

Rules of Well-formed XML

All Elements that Start Must End

Since Document is a Tree, Elements May Not Overlap

XML Naming Rules

Element names, attribute names, entity names, etc.

XML Attribute Rules

               <element attribute="value">

Markup Delimiters Inside Data

A Special "Gotcha": Empty Elements

Solution: To "Trick" the Browser

Consider Some Bad Code

See exhibit-1 (some HTML code) and exhibit-2 (a screenshot of how it appears in a browser).

An example:

</p>
<b><font face=sans-serif size=-1 color=#008B00>Habitat:</b>
</font>By river banks, in ditches and in wet spots.</p>

What Makes This So "Bad"?

What We Do to Fix It

What Did We Achieve?

A Trivial Document Conversion (Usually)

What Conversion Will Mean

What are the Complicating Factors?

What Hasn't Changed?

Processing: The Good with the Bad

What Do You Gain

Case 2: HTML to Structured HTML-XML

What Conversion Will Mean

Sorts of Questions that Need to be Asked

Adding Structure To Our Example

See exhibit-4 (some HTML code) and exhibit-5 (a screenshot of how it appears in a browser).

Notice how:

A Moderate Conversion (Usually)

Automation and Tools

What are the Complicating Factors?

Processing: Structured Data is More Useful

Gains From Explicit Structure

All the gains from well-formed HTML-XML, plus hierarchy can be used:

Case 3: HTML to XHTML

XHTML is

XHTML provides

XHTML specification available at:

A Caveat

What Conversion Will Mean

Sorts of Questions that Need to be Asked

What's the Difference (Between Kinds of XHTML)?

XHTML-1.0-Transitional

Definition:

Permits all the tags in HTML 4.0.

XHTML-1.0-Strict

Definition:

Most tags to control formatting are not allowed.

The idea is to use CSS (stylesheets) to control formatting instead.

XHTML-1.0-Frameset

Definition:

Use for documents that use frames.

Conforming to XHTML

A Bit of Trivia

In XHTML, all tags are lower case

Example of XHTML

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html 
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Sample XHTML Document</title>
</head>
<body>
<p>This is a small XHTML document.</p>
</body>
</html>

An Example of Transitional XHTML

The Same Example, Only Strict

Problem: Structure in XHTML

HTML Tidy: An Off-the-shelf Tool

Results of Running HTML Tidy

(See exhibit-10)

A Moderate Conversion (Usually)

Automation and Tools

What are the Complicating Factors?

Gains from Valid XHTML

All the gains from Well-formed HTML-XML, plus

Can still use XHTML and reflect (some) structure (see exhibit-9)

Interlude: Some Real-World Problems

HTML Tables (and Their Discontents)

Dealing With Tables Used for Layout

Two choices:

  • Mix presentational code (e.g. tables, font elements) with descriptive code, or

  • Separate problem into layers:

    • Source document contains only clean descriptive code: "content"

    • Presentation document created from source

Choice One: Mixing Descriptive and Presentational Code

  • Advantages:

    • "Quick and easy" (as least until there's a lot of it)

  • Disadvantages:

    • Hard to maintain

    • Hard to validate; DTD (if any) is a mess!

    • Pages get large, unwieldy, obtuse

See exhibit-14

Choice Two: Separating Content from Format

  • Advantages:

    • Data no longer locked into one presentation/platform

    • Content and presentation can be designed/maintained separately, so system is more scalable and long-lived

  • Disadvantages:

    • Usually cannot use off-the-shelf tag set (since it must describe your content)

    • Requires validation outside browser (via custom DTD or schema)

    • Requires infrastructure (application) to convert from source (one format) to presentation (another format)

    • Typically a stylesheet application

How do we do this? Stay tuned....

Navigation Apparatus in HTML

Entity Declarations for Special Characters

Note: Particular Problem With Internal Declarations

  • If we declare entities ourselves, we break our pages in generation 4 browsers!

    • See exhibit-15

    • Why? because the HTML parser in Netscape Navigator 4.x does not parse XML DOCTYPE declaration properly, treats closing ]> as data

    • Also tested in IE5.5 (same problem)

  • Another reason to have an external DTD (no declarations in page code)

Case 4: HTML to User-defined Structure with a DTD/Schema

"Valid" is a Step Beyond "Well-formed"

An XML document is valid if it:

(The document is valid "according to the DTD/schema")

A DTD (Document Type Definition)

  • Models one type (class) of document (email message, memo, help-topic, journal article, bank transfer)

  • Is a set of rules describing how documents of that type can be marked up

  • Is written in the formal syntax of XML

DTDs Express Rules

for example:

  • article =metadata followed by article body , followed by optional back matter

  • paragraph = data characters including emphasis and/or hyperlinks and/or index terms

  • error message = error code number followed by message text

  • link has a "name" attribute and a "target" attribute

DTDs Model Document Types

  • DTDs:

    • Name all the tags

    • Model each element's content

    • Name and define all attributes

    • Name and define all entities (e.g. special characters)

  • DTDs do not:

    • Provide real data typing

    • Model multi-element or other complex dependencies

Why Are DTDs Useful?

  • To share information, share the DTD

    • DTD ensures documents conform to model

    • Parties don't have to share processors or applications

  • DTDs often used in sets. Different but related DTDs can:

    • Provide conformance-testing at milestones in document lifecycle

    • Be more or less permissive

  • DTDs can also automatically customize applications

Architecture of an XML Document

  • XML Declaration (names XML version and options)

  • Document Type Declaration (optional)

    • Names the class of document (the root element)

    • May contain declarations that are part of a DTD

    • May point to an external DTD

  • The Document (tags and text)

DTDs versus Schema

  • DTD is current XML modeling syntax

  • Future modeling syntax will include at least one schema language

  • There have been several schema languages proposed (XML-Data, SOX, DCD, X-Schema. etc.)

  • W3C Working Group is working on one as this is written (may be done as we speak)

What the Schema Proposals Have In Common

How a DTD/Schema Helps Conversion

What Conversion will Mean

Sorts of Questions that Need to be Asked

Descriptive versus Prescriptive

Example: A Very High-level Structural XML

A More Descriptive Structural XML

Medium to Difficult Document Conversion

What are the Complicating Factors?

Tool Issues

Repository Architecture Issues

Gains During the Conversion Process

What Can You Now Do with the Data

-- All depending on richness of data

XML-aware Tools Make it Possible

Case 5: HTML to User-defined Content with a DTD/Schema

A Fully-tagged Herbal

Difficult Conversion

What Subject Conversion Adds to Previous Cases

Subject Tagging is an Implied Promise

Content Tagging: An Opportunity

Data Can Now Be Used to Provide

Backfile Conversion

When the Document Does Not Match the DTD

There are (only!) three options

  • Change the DTD

    • Public DTD

    • Private DTD

  • Rearrange/rewrite the document

  • Leave this document out of the conversion effort

The W3C has Provided Tools

Many Other Tools Also Available

How Do I Translate from HTML to XML?

HTML to XML May Not Be Simple

Translation may be

Other Element Mapping Complications

Elements may

XML Must be Hierarchical

Subject Tagging Complications

Is Round Trip Possible? XML to HTML

Exhibit 1: Some Bad HTML

file: comfrey.bad.html

<html>
<body link=#8B0000 vlink=#8B0000 bgcolor=#FFFFFF>
<!-- coded by hand, with content from herbal.simple.html -->
<H1><FONT color=#008B00>Comfrey</h1></font>
<h4><i>Symphytum officinale</h4></i>
</p>
<b><font face=sans-serif size=-1 color=#008B00>
Habitat:</b></font> By river banks, in ditches and in wet spots. 
</p>
<b><font face=sans-serif size=-1 color=#008B00></font>
<font face=sans-serif size=-1 color=#008B00>Also called:</font>
</b> Knitbone; Knitback; Consound; Blackwort; Bruisewort; Slippery 
Root; Boneset; Consolida; Ass Ear <br>&nbsp;
<br>
<b><font face=sans-serif size=-1 color=#008B00>Treatment for: 
<i></font></b><b><font color=#008B00>wounds; 
broken bones; ulcer; hernia; haemorrhage; bronchitis</font></b>
</H4></I>
<DL>
<DT>
<b><font face=sans-serif size=-1 color=#008B00>Preparation:
</b></font>
<br>[Root, rhizome, leaf]
<DD>Unearth the roots in spring or autumn. Split and dry in fairly 
cool place. Infuse one to three tsp of the dried herb in a cup of water, 
bring to a boil and let simmer for 10-15 minutes.
</dl>
</p>
<b><font face=sans-serif size=-1 color=#008B00>Active 
ingredient:</b></font> Allantoin 
<h4>vulnerary;  demulcent; anti-inflammatory; astringent; 
expectorant</H4>
<font size=-1>A relative of the forget-me-not, comfrey is 
recognizable by its broad, hairy leaves. One of the best known of 
traditional herbal treatments; its use goes back at least to the Middle 
Ages and into the indefinite past. Has been used for gout and aching 
joints as well as for all kinds of breaks, wounds and ulcers.</FONT>
</body>

Exhibit 2: Screen Shot of Some Bad HTML

file: comfrey.bad.jpg

See exhibit-1 for the code: this stuff displays just fine (in Netscape 4.x). This shows how forgiving the browser is. (Or how forgiving this browser is.)

Exhibit 3: The Same Page Cleaned Up

file: comfrey.wf.html

<html>
<body link="#8B0000" vlink="#8B0000" bgcolor="#FFFFFF">
<!-- hand-corrected well-formed version of comfrey.bad.html -->
<h1><font color="#008B00">Comfrey</font></h1>
<h4><i>Symphytum officinale</i></h4>
<p>
<b><font face="sans-serif" size="-1" color="#008B00">
Habitat:</font></b> By river banks, in ditches and in wet spots.</p>
<p>
<b><font face="sans-serif" size="-1" color="#008B00">
Also called:</font></b> Knitbone; Knitback; Consound; 
Blackwort; Bruisewort; Slippery Root; Boneset; Consolida; Ass Ear</p>
<p>
<b><font face="sans-serif" size="-1" color="#008B00">Treatment for: </font></b><b><font color="#008B00">wounds; broken bones; ulcer; hernia; haemorrhage; bronchitis</font></b></p>
<dl>
<dt>
<b><font face="sans-serif" size="-1" color="#008B00">Preparation:</font></b>
<br class="br"/>[Root, rhizome, leaf]</dt>
<dd>Unearth the roots in spring or autumn. Split and dry in fairly cool place. Infuse one to three tsp of the dried herb in a cup of water, bring to a boil and let simmer for 10-15 minutes.</dd>
</dl>
<p>
<b><font face="sans-serif" size="-1" color="#008B00">
Active ingredient:</font></b> Allantoin</p>
<h4>vulnerary;  demulcent; anti-inflammatory; astringent; 
expectorant</h4>
<p><font size="-1">A relative of the forget-me-not, comfrey is 
recognizable by its broad, hairy leaves. One of the best known of 
traditional herbal treatments; its use goes back at least to the Middle 
Ages and into the indefinite past. Has been used for gout and aching joints 
as well as for all kinds of breaks, wounds and ulcers.</font>
</p>
</body>
</html>

Exhibit 4: HTML With Implicit Structure

file: intro.html

<html>
<!-- well-formed version with content from herbal.simple.html -->
<body link="#8B0000" vlink="#8B0000" bgcolor="#FFFFFF">
<h1><a name="intro">A Garland of Herbs</a></h1>
<p>This miniature herbal is created as a demonstration of structured 
data published in non-proprietary formats (HTML and XML). It is 
<i>not</i> intended for use as a reference on herbal remedies. 
Compiled by a non-expert from publicly-available sources, its content is 
not deliberately falsified or distorted; nevertheless it should not be 
regarded as authoritative in any way.</p>
<p>While the contents of the herbal may not be trustworthy, however, 
its <i>structure</i> should be perfectly serviceable for the 
need: to present organized information in a way that both improves access 
for readers, and renders the dataset suitable for such automated processes 
as  indexing and filtering.</p>
<h2><i><font face="sans-serif">
<a name="Organization">Organization</a></font></i>
</h2>
<p>There is a consistent organization to each entry. Note that not 
all entries have all sections.</p>
<h3><font face="sans-serif"><a name="Primary.Names">
Primary Names</a></font></h3>
<p>Each herb is listed with its common name and its formal (Latin) 
botanical name.</p>
<h3><font face="sans-serif"><a name="Habitat">
Habitat</a></font></h3>
<p>Where the herb is commonly found is listed as its habitat. This 
section is mainly for interest: amateurs are not encouraged to go to 
anyplace described, harvest a likely candidate, and boil it up.</p>
<h3><font face="sans-serif"><a name="Also.called">Also 
called</a></font></h3>
<p>Any names by which the herb or plant may also be commonly known 
are listed here.</p>
<h3><font face="sans-serif"><a name="Treatment.for">
Treatment for</a></font></h3>
<p>Common ailments for which the herb is a known palliative (or even 
a cure), are listed here. This list is not exhaustive, of course; nor is 
it necessarily correct. (Herbal medicine has largely been an inexact 
science.)</p>
<h3><font face="sans-serif"><a name="Preparation">
Preparation</a></font></h3>
<p>Any preparation(s) for the herb is (are) described in this 
section. If different parts of the plant are used, their preparations are 
described separately.</p>
<h3><font face="sans-serif"><a name="Active.Ingredients">
Active Ingredients</a></font></h3>
<p>In some cases, where the active chemical component or components 
of an herb are known, they are listed here.</p>
<h3><font face="sans-serif"><a name="Effects">Effects
</a></font></h3>
<p>Medical terms describing the pharmacological effects of the herb 
(e.g. <b>sedative</b>) are listed here.</p>
<h3><font face="sans-serif"><a name="Description">
Description</a></font></h3>
<p>Each herb is briefly described in one or more paragraphs.</p>
<h3><font face="sans-serif"><a name="Notes">Notes
</a></font></h3>
<p>Any supplemental notes on the herb, especially respecting possible 
warnings associated with it, appear here.</p>
<h2><i><font face="sans-serif"><a name="Sources">
Sources</a></font></i></h2>
<p>This guide is adapted from several sources on the Internet (see 
references below). </p>
<h2><i><font face="sans-serif"><a name="references">
References</a></font></i></h2>
<p><font size="-1">Botanical.com, a hypertext edition of A 
Modern Herbal (M. Grieve, 1931) 
<a href="http://www.botanical.com/botanical/mgmh/mgmh.html">
http://www.botanical.com/botanical/mgmh/mgmh.html</a></font>
</p>
<p><font size="-1">Herbal Medicine Center at Healthworld 
Online: see <a href="http://www.healthy.net/clinic/therapy/herbal/">
http://www.healthy.net/clinic/therapy/herbal/</a></font>
</p>
<p><font size="-1">The Warnings page of Dr. Yang's Herbs and 
Gems for Health: <a href="http://www.ocnsignal.com/yangwarn2.shtml">
http://www.ocnsignal.com/yangwarn2.shtml</a></font></p>
<hr />
</body>
</html>

Exhibit 5: Screen Shot of HTML With Implicit Structure

file: intro.jpg

Exhibit 6: New Tags Make Structure Explicit

file: intro.struct.html

<html>
<!-- well-formed version with content from herbal.simple.html
     enhanced with non-HTML tagging -->
<body link="#8B0000" vlink="#8B0000" bgcolor="#FFFFFF">
<H1-DIV>
<h1><a name="intro">A Garland of Herbs</a></h1>
<p>This miniature herbal is created as a demonstration of structured 
data published in non-proprietary formats (HTML and XML). It is <i>not</i> intended for use as a reference on herbal remedies. Compiled by a non-expert from publicly-available sources, its content is not deliberately falsified or distorted; nevertheless it should not be regarded as authoritative in any way.</p>
<p>While the contents of the herbal may not be trustworthy, however, 
its <i>structure</i> should be perfectly serviceable for the 
need: to present organized information in a way that both improves access 
for readers, and renders the dataset suitable for such automated processes 
as  indexing and filtering.</p>
<H2-DIV>
<h2><i><font face="sans-serif">
<a name="Organization">Organization</a></font></i>
</h2>
<p>There is a consistent organization to each entry. Note that not 
all entries have all sections.</p>
<H3-DIV>
<h3><font face="sans-serif"><a name="Primary.Names">
Primary Names</a></font></h3>
<p>Each herb is listed with its common name and its formal (Latin) 
botanical name.</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Habitat">
Habitat</a></font></h3>
<p>Where the herb is commonly found is listed as its habitat. This 
section is mainly for interest: amateurs are not encouraged to go to 
anyplace described, harvest a likely candidate, and boil it up.</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Also.called">Also 
called</a></font></h3>
<p>Any names by which the herb or plant may also be commonly known 
are listed here.</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Treatment.for">
Treatment for</a></font></h3>
<p>Common ailments for which the herb is a known palliative (or even 
a cure), are listed here. This list is not exhaustive, of course; nor is it 
necessarily correct. (Herbal medicine has largely been an inexact science.)
</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Preparation">
Preparation</a></font></h3>
<p>Any preparation(s) for the herb is (are) described in this section. 
If different parts of the plant are used, their preparations are described 
separately.</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Active.Ingredients">
Active Ingredients</a></font></h3>
<p>In some cases, where the active chemical component or components 
of an herb are known, they are listed here.</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Effects">Effects
</a></font></h3>
<p>Medical terms describing the pharmacological effects of the herb 
(e.g. <b>sedative</b>) are listed here.</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Description">
Description</a></font></h3>
<p>Each herb is briefly described in one or more paragraphs.</p>
</H3-DIV>
<H3-DIV>
<h3><font face="sans-serif"><a name="Notes">Notes</a>
</font></h3>
<p>Any supplemental notes on the herb, especially respecting possible 
warnings associated with it, appear here.</p>
</H3-DIV>
</H2-DIV>
<H2-DIV>
<h2><i><font face="sans-serif"><a name="Sources">
Sources</a></font></i></h2>
<p>This guide is adapted from several sources on the Internet (see 
references below). </p>
</H2-DIV>
<H2-DIV>
<h2><i><font face="sans-serif"><a name="references">
References</a></font></i></h2>
<p><font size="-1">Botanical.com, a hypertext edition of A 
Modern Herbal (M. Grieve, 1931) 
<a href="http://www.botanical.com/botanical/mgmh/mgmh.html">
http://www.botanical.com/botanical/mgmh/mgmh.html</a>
</font></p>
<p><font size="-1">Herbal Medicine Center at Healthworld 
Online: see <a href="http://www.healthy.net/clinic/therapy/herbal/">
http://www.healthy.net/clinic/therapy/herbal/</a></font>
</p>
<p><font size="-1">The Warnings page of Dr. Yang's Herbs and 
Gems for Health: <a href="http://www.ocnsignal.com/yangwarn2.shtml">
http://www.ocnsignal.com/yangwarn2.shtml</a></font></p>
</H2-DIV>
</H1-DIV><hr />
</body>
</html>

Exhibit 7: Transitional XHTML

file: comfrey.trans.xhtml

Compare to exhibit-3

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "lib/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<!-- hand-corrected XHTML-valid version of comfrey.wf.html -->
<head>
<title>Comfrey</title>
</head>
<body link="#8B0000" vlink="#8B0000" bgcolor="#FFFFFF">
<h1><font color="#008B00">Comfrey</font></h1>
<h4><i>Symphytum officinale</i></h4>
<p><b><font face="sans-serif" size="-1" color="#008B00">
Habitat:</font></b> By river banks, in ditches and in wet spots.
</p>
<p><b><font face="sans-serif" size="-1" color="#008B00">
Also called:</font></b> Knitbone; Knitback; Consound; 
Blackwort; Bruisewort; Slippery Root; Boneset; Consolida; Ass Ear</p>
<p><b><font face="sans-serif" size="-1" color="#008B00">
Treatment for: </font></b><b><font color="#008B00">
wounds; broken bones; ulcer; hernia; haemorrhage; bronchitis</font>
</b></p>
<dl>
<dt><b><font face="sans-serif" size="-1" color="#008B00">
Preparation:</font></b>
<br class="br"/>[Root, rhizome, leaf]</dt>
<dd>Unearth the roots in spring or autumn. Split and dry in fairly 
cool place. Infuse one to three tsp of the dried herb in a cup of water, 
bring to a boil and let simmer for 10-15 minutes.</dd></dl>
<p><b><font face="sans-serif" size="-1" color="#008B00">
Active ingredient:</font></b> Allantoin</p>
<h4>vulnerary;  demulcent; anti-inflammatory; astringent; expectorant
</h4>
<p><font size="-1">A relative of the forget-me-not, comfrey is 
recognizable by its broad, hairy leaves. One of the best known of 
traditional herbal treatments; its use goes back at least to the Middle 
Ages and into the indefinite past. Has been used for gout and aching joints 
as well as for all kinds of breaks, wounds and ulcers.</font></p>
</body>
</html>

Exhibit 8: Strict XHTML

file: comfrey.strict.xhtml

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "lib/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<!-- hand-corrected XHTML-valid version of comfrey.wf.html -->
<head>
<title>Comfrey</title>
<style type="text/css">
body    { background-color: #FFFFFF }
a:link  { color: #8B0000 }
a:vlink { color: #8B0000 }
</style>
</head>
<body>
<h1 style="color: #008B00">Comfrey</h1>
<h4><i>Symphytum officinale</i></h4>
<p><b>
<span style="font-face: sans-serif; font-size: -1; color: #008B00">
Habitat:</span></b> By river banks, in ditches and in wet spots.
</p>
<p><b>
<span style="font-face: sans-serif; font-size: -1; color: #008B00">
Also called:</span></b> Knitbone; Knitback; Consound; Blackwort; 
Bruisewort; Slippery Root; Boneset; Consolida; Ass Ear</p>
<p><b>
<span style="font-face: sans-serif; font-size: -1; color: #008B00">
Treatment for: </span></b><b>
<span style="color: #008B00">wounds; broken bones; ulcer; hernia; 
haemorrhage; bronchitis</span></b></p>
<dl>
<dt>
<b>
<span style="font-face: sans-serif; font-size: -1; color: #008B00">
Preparation:</span></b>
<br class="br"/>[Root, rhizome, leaf]</dt>
<dd>Unearth the roots in spring or autumn. Split and dry in fairly 
cool place. Infuse one to three tsp of the dried herb in a cup of water, 
bring to a boil and let simmer for 10-15 minutes.</dd></dl>
<p><b>
<span style="font-face: sans-serif; font-size: -1; color: #008B00">
Active ingredient:</span></b> Allantoin</p>
<h4>vulnerary; demulcent; anti-inflammatory; astringent; expectorant
</h4>
<p><span style="font-size: -1">A relative of the forget-me-not, 
comfrey is recognizable by its broad, hairy leaves. One of the best known 
of traditional herbal treatments; its use goes back at least to the Middle 
Ages and into the indefinite past. Has been used for gout and aching joints 
as well as for all kinds of breaks, wounds and ulcers.</span></p>
</body>
</html>

Exhibit 9: Structured XHTML

file: intro.struct.xhtml

Compare to exhibit-6. Here, we have div elements with class attributes (valid in XHTML).

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "lib/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<!-- XHTML-valid version of intro.struct.html:
     structural tagging changed to be XHTML-compliant -->
<head>
<title>A Garland of Herbs</title>
</head>
<body link="#8B0000" vlink="#8B0000" bgcolor="#FFFFFF">
<div class="h1">
<h1><a name="intro">A Garland of Herbs</a></h1>
<p>This miniature herbal is created as a demonstration of structured 
data published in non-proprietary formats (HTML and XML). It is 
<i>not</i> intended for use as a reference on herbal remedies. 
Compiled by a non-expert from publicly-available sources, its content is 
not deliberately falsified or distorted; nevertheless it should not be 
regarded as authoritative in any way.</p>
<p>While the contents of the herbal should not be regarded as 
trustworthy, however, its <i>structure</i> should be perfectly 
serviceable for the need: to present organized information in a way that 
both improves access for readers, and renders the dataset suitable for such 
automated processes as  indexing and filtering.</p>
<div class="h2">
<h2><i><font face="sans-serif">
<a name="Organization">Organization</a></font></i>
</h2>
<p>There is a consistent organization to each entry. Note that not 
all entries have all sections.</p>
<div class="h3">
<h3><font face="sans-serif"><a name="Primary.Names">
Primary Names</a></font></h3>
<p>Each herb is listed with its common name and its formal (Latin) 
botanical name.</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Habitat">
Habitat</a></font></h3>
<p>Where the herb is commonly found is listed as its habitat. This 
section is mainly for interest: amateurs are not encouraged to go to 
anyplace described, harvest a likely candidate, and boil it up.</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Also.called">Also 
called</a></font></h3>
<p>Any names by which the herb or plant may also be commonly known 
are listed here.</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Treatment.for">
Treatment for</a></font></h3>
<p>Common ailments for which the herb is a known palliative (or even 
a cure), are listed here. This list is not exhaustive, of course; nor is it 
necessarily correct. (Herbal medicine has largely been an inexact science.)
</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Preparation">
Preparation</a></font></h3>
<p>Any preparation(s) for the herb is (are) described in this section. 
If different parts of the plant are used, their preparations are described 
separately.</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Active.Ingredients">
Active Ingredients</a></font></h3>
<p>In some cases, where the active chemical component or components 
of an herb are known, they are listed here.</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Effects">
Effects</a></font></h3>
<p>Medical terms describing the pharmacological effects of the herb 
(e.g. <b>sedative</b>) are listed here.</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Description">
Description</a></font></h3>
<p>Each herb is briefly described in one or more paragraphs.</p>
</div>
<div class="h3">
<h3><font face="sans-serif"><a name="Notes">
Notes</a></font></h3>
<p>Any supplemental notes on the herb, especially respecting possible 
warnings associated with it, appear here.</p>
</div>
</div>
<div class="h2">
<h2><i><font face="sans-serif"><a name="Sources">
Sources</a></font></i></h2>
<p>This guide is adapted from several sources on the Internet (see 
references below). </p>
</div>
<div class="h2">
<h2><i><font face="sans-serif"><a name="references">
References</a></font></i></h2>
<p><font size="-1">Botanical.com, a hypertext edition of A 
Modern Herbal (M. Grieve, 1931)