XML 2001 logo

Converting HTML to XML

Wendell Piez

What does it mean to migrate a web-based publishing system from HTML to XML? It turns out that this question has no simple answer. Although XML is often - misleadingly - described as the Next Big Thing after HTML, as HTML's successor, or even as its competitor, in fact the choice between HTML and XML is not really best understood as “either/or”. Many XML-based systems, if they publish to the web, will still use some kind of HTML as a “front end”, even while the data in the back is in XML.

We are commonly faced with the proposition that, in order to gain this or that capability for our site, we should move to XML. In order to understand what such migration will mean, it helps to have a grasp of some of the technical issues involved. Bringing a critical eye to the markup itself reveals immediately - even to a neophyte or non-expert - how the data may be put to use and what kinds of applications it is capable of supporting. Accordingly, this presentation proceeds by breaking the problem down into several separate cases, each of which we have illustrated with a simple example. After briefly noting what the actual relation is between XML and HTML as standards, we will look at each of these cases in turn. Gaining a sense of what the differences are between the cases, and what the challenges are of migrating data from one approach to another, will clarify the alternatives, along with their advantages and costs.

In untangling the complexities of what it means to migrate from HTML to XML, and what issues it raises, there are some important topics that we will not address. We will not consider strategic questions such as why do this in the first place: the assumption is that we can take for granted XML's advantages for data longevity, reusability, and so forth. Likewise, we will not be considering architectural or implementation issues such as which server technology to use, whether to store our data in a database or a file system, how to organize and administer the data, whether to do necessary processing in batch mode, or dynamically on the server, or on the client, etc. etc. These issues are important; and a very useful and interesting paper could be written describing how to approach them and the various dependencies these problems have (since the answer to any of these questions seems to be, “it depends”): but that paper is not this one. Instead, we will be looking at the formats themselves, to assess their differing capabilities as underlying technologies.

1. Organization

2. What is HTML?

3. What do HTML Tags Say They Do?

4. HTML Tags Really Do Nothing

5. What is XML?

6. XML Documents

7. XML Markup Identifies Document Components

7.1. Content Markup

What type of information is this?

7.2. Structure Markup

What part of the document is this?

7.3. “Location or Navigation” Information

Added to text to make it more functional, useful or manageable

7.4. Metadata (Data about the Data)

7.5. Rendering/Processing Markup

How text should print, display, or behave

7.6. HTML is (Implicitly) Formatting Markup

Many HTML tags sound like generic elements (<emph>, <p> for paragraph, <li> for list item) but they are abused to create the desired display. For example, words that aren't “definition terms” are tagged <dt>.

HTML has codes for:

8. XML Tags Really Do Nothing

9. Differences Between XML and HTML

HTML XML
Fixed set of tags User-made tags (infinite variety)
Flat Structured (nested)
Total freedom (at your own risk) Draconian error handling
***Case insensitive ***Case sensitive
***End tags (mostly) optional ***All start and end tags required
***Syntax loose ***Syntax strict
One linking tag/One link type Any element may be a link/Many types and roles for links
Tags built into browser Tags and specification input to processor

10. XML has no Pre-defined Tags

11. HTML Markup is “Loose”

12. XML Markup is Strict

Tags indicate the beginning and end of all elements

. . . closing tags are required.

13. HTML Structure is Flat

(Most word processors and desk-top publishing applications are like this, too)

14. XML Makes Nested Structures

15. Structured Documents Contain Nested, Retrievable Objects

16. Advantages of Structured Documents

17. An XML Document is a Tree

18. Tree Structure Indicates Nesting

19. HTML Error Handling: Permissive

20. XML Error Handling: Draconian

21. Converting Documents from HTML to XML

(These are logical steps; actual steps may blend/merge)

21.1. Objectives of Conversion to XML

There are many objectives, but two are most important:

These two objectives may go together, but don't have to: decide what your needs are.

21.2. Relation between XML and HTML

21.3. Many Levels of Conversion

22. Case 1: HTML to Well-formed HTML

22.1. Rules of Well-formed XML

22.2. All Elements that Start Must End

22.3. Since Document is a Tree, Elements May Not Overlap

22.4. XML Naming Rules

Element names, attribute names, entity names, etc.

22.5. XML Attribute Rules

               <element attribute="value">

22.6. Markup Delimiters Inside Data

22.7. A Special “Gotcha”: Empty Elements

22.8. Solution: To “Trick” the Browser

22.9. Consider Some Bad Code

See (some HTML code) and (a screenshot of how it appears in a browser).

An example:

</p>
<b><font face=sans-serif size=-1 color=#008B00>Habitat:</b>
</font>By river banks, in ditches and in wet spots.</p>

22.10. What Makes This So “Bad”?

22.11. What We Do to Fix It

22.12. What Did We Achieve?

22.13. A Trivial Document Conversion (Usually)

22.14. What Conversion Will Mean

22.15. What are the Complicating Factors?

22.16. What Hasn't Changed?

22.17. Processing: The Good with the Bad

22.18. What Do You Gain

23. Case 2: HTML to Structured HTML-XML

23.1. What Conversion Will Mean

23.2. Sorts of Questions that Need to be Asked

23.3. Adding Structure To Our Example

See (some HTML code) and (a screenshot of how it appears in a browser).

Notice how:

23.4. A Moderate Conversion (Usually)

23.5. Automation and Tools

23.6. What are the Complicating Factors?

23.7. Processing: Structured Data is More Useful

23.8. Gains From Explicit Structure

All the gains from well-formed HTML-XML, plus hierarchy can be used:

24. Case 3: HTML to XHTML

XHTML is

24.1. XHTML provides

XHTML specification available at: http://www.w3.org/TR/xhtml1

24.2. A Caveat

24.3. What Conversion Will Mean

24.4. Sorts of Questions that Need to be Asked

24.5. What's the Difference (Between Kinds of XHTML)?

XHTML-1.0-Transitional

Permits all the tags in HTML 4.0.

XHTML-1.0-Strict

Most tags to control formatting are not allowed.

The idea is to use CSS (stylesheets) to control formatting instead.

XHTML-1.0-Frameset

Use for documents that use frames.

24.6. Conforming to XHTML

24.7. A Bit of Trivia

In XHTML, all tags are lower case

24.8. Example of XHTML

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html 
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Sample XHTML Document</title>
</head>
<body>
<p>This is a small XHTML document.</p>
</body>
</html>

24.9. An Example of Transitional XHTML

24.10. The Same Example, Only Strict

24.11. Problem: Structure in XHTML

24.12. HTML Tidy: An Off-the-shelf Tool

24.13. Results of Running HTML Tidy

(See )

24.14. A Moderate Conversion (Usually)

24.15. Automation and Tools

24.16. What are the Complicating Factors?

24.17. Gains from Valid XHTML

All the gains from Well-formed HTML-XML, plus

Can still use XHTML and reflect (some) structure (see )

25. Interlude: Some Real-World Problems

25.1. HTML Tables (and Their Discontents)

25.1.1. Dealing With Tables Used for Layout

Two choices:

1.1.1. Choice One: Mixing Descriptive and Presentational Code

See

1.1.2. Choice Two: Separating Content from Format

How do we do this? Stay tuned....