Converting HTML to XML
What does it mean to migrate a web-based publishing system from HTML to XML? It turns out that this question has no simple answer. Although XML is often - misleadingly - described as the Next Big Thing after HTML, as HTML's successor, or even as its competitor, in fact the choice between HTML and XML is not really best understood as “either/or”. Many XML-based systems, if they publish to the web, will still use some kind of HTML as a “front end”, even while the data in the back is in XML.
We are commonly faced with the proposition that, in order to gain this or that capability for our site, we should move to XML. In order to understand what such migration will mean, it helps to have a grasp of some of the technical issues involved. Bringing a critical eye to the markup itself reveals immediately - even to a neophyte or non-expert - how the data may be put to use and what kinds of applications it is capable of supporting. Accordingly, this presentation proceeds by breaking the problem down into several separate cases, each of which we have illustrated with a simple example. After briefly noting what the actual relation is between XML and HTML as standards, we will look at each of these cases in turn. Gaining a sense of what the differences are between the cases, and what the challenges are of migrating data from one approach to another, will clarify the alternatives, along with their advantages and costs.
In untangling the complexities of what it means to migrate from HTML to XML, and what issues it raises, there are some important topics that we will not address. We will not consider strategic questions such as why do this in the first place: the assumption is that we can take for granted XML's advantages for data longevity, reusability, and so forth. Likewise, we will not be considering architectural or implementation issues such as which server technology to use, whether to store our data in a database or a file system, how to organize and administer the data, whether to do necessary processing in batch mode, or dynamically on the server, or on the client, etc. etc. These issues are important; and a very useful and interesting paper could be written describing how to approach them and the various dependencies these problems have (since the answer to any of these questions seems to be, “it depends”): but that paper is not this one. Instead, we will be looking at the formats themselves, to assess their differing capabilities as underlying technologies.
1. Organization
-
Background
-
HTML
-
XML
-
Differences between HTML and XML
-
-
Conversion Cases
-
HTML to Well-formed XML
-
HTML to structured HTML-XML
-
HTML to XHTML
-
HTML to Valid XML
-
Structural Markup
-
Content-based Markup
-
-
2. What is HTML?
-
Document (web page) layout “language”
-
Link specification “language”
-
A set of tags
-
Mark start of document component
-
Mark end of component (maybe)
-
Name a hyperlink
-
3. What do HTML Tags Say They Do?
-
Link specific parts of your document (<a>)
-
To other places in your document
-
To sources outside your document
-
-
Format parts of your document (<b>, <h1>, <hr>, <font>)
-
Mark the structure of your document (<div>, <p>, <body>)
-
Describe your document's content (<abbr>, <address>)
4. HTML Tags Really Do Nothing
-
Tags have no inherent meaning
-
Browsers do things based on tags
-
Tags built into browser
-
Browser chooses tags to implement (<blink>)
-
Every browser may do differently
-
5. What is XML?
-
A W3C recommendation that provides:
-
Description of a data format
-
A data modeling language
-
-
The use of the XML data format in an application (e.g. a browser)
-
A “meta” language for:
-
Creating markup languages
-
Using markup languages
-
6. XML Documents
-
In XML jargon, your data (no matter what form) is called a document
-
Examples:
-
invoice
-
journal article
-
topic in a help system
-
database load file
-
email message
-
7. XML Markup Identifies Document Components
-
Content of the data
-
Structure of the document
-
Value-added information
-
Location and Navigation
-
Metadata
-
-
Rendering/Processing information (presentation and formatting)
7.1. Content Markup
What type of information is this?
-
Part number
-
Environmental impact
-
City, state, zip code
-
Question, answer
-
Methodology section
-
Hardware platform
-
Error message and error code
7.2. Structure Markup
What part of the document is this?
-
Email header
-
Paragraph
-
Title
-
Figure
-
Table
-
Signature block
-
List
7.3. “Location or Navigation” Information
Added to text to make it more functional, useful or manageable
-
Hypertext links
-
Cross-references
-
Indexing terms
7.4. Metadata (Data about the Data)
-
Bibliographic information
-
Revision or version
-
Status and workflow tracking information
-
Data source
-
Editor's or reviewer's comments
-
Abstracts, teasers, cataloging data
7.5. Rendering/Processing Markup
How text should print, display, or behave
-
Position of graphics on the page
-
Line breaks in titles
-
Iconization
-
Visual or auditory highlighting (sometimes a word is bold just because the author said so)
7.6. HTML is (Implicitly) Formatting Markup
-
H1-style, H2-style
-
skip a line <p>
Many HTML tags sound like generic elements (<emph>, <p> for paragraph, <li> for list item) but they are abused to create the desired display. For example, words that aren't “definition terms” are tagged <dt>.
HTML has codes for:
-
A little bit of metadata
-
One type of location tag
8. XML Tags Really Do Nothing
-
Tags still have no inherent meaning
-
Tags could be built into special-purpose software
-
General-purpose XML software won't build in tags
-
What general-purpose XML browsers and processors do is based on
-
Tagging in the document PLUS
-
Stylesheet, behavior sheet, output or processing specification
-
9. Differences Between XML and HTML
| HTML | XML |
|---|---|
| Fixed set of tags | User-made tags (infinite variety) |
| Flat | Structured (nested) |
| Total freedom (at your own risk) | Draconian error handling |
| ***Case insensitive | ***Case sensitive |
| ***End tags (mostly) optional | ***All start and end tags required |
| ***Syntax loose | ***Syntax strict |
| One linking tag/One link type | Any element may be a link/Many types and roles for links |
| Tags built into browser | Tags and specification input to processor |
10. XML has no Pre-defined Tags
-
User communities make up tag sets
-
Tags match user needs/language
-
Structural tags are specific to a class of documents (<part>, <chapter>, <header>, <section>)
-
Content tags are domain-specific (<part_number>, <methodology>, <person_name>, <job_function>, <zip>)
11. HTML Markup is “Loose”
-
Most tags say where something starts (<h1>, <h2>, <h3>, <p>)
-
A few tags say where something ends
-
Some you need: </b>
-
Some you don't: </tr>, </p>
-
12. XML Markup is Strict
Tags indicate the beginning and end of all elements
. . . closing tags are required.
13. HTML Structure is Flat
-
Title, followed by
-
Paragraph, followed by
-
Heading Level 1, followed by
-
Paragraph, followed by
-
Paragraph
(Most word processors and desk-top publishing applications are like this, too)
14. XML Makes Nested Structures
-
Tags identify the start and end of each structure, not simply the start of a format
-
A document might contain:
-
Title followed by
-
Paragraph followed by
-
Section, containing
-
Title followed by
-
Paragraph followed by
-
Paragraph followed by ...
-
-
15. Structured Documents Contain Nested, Retrievable Objects
16. Advantages of Structured Documents
-
Store, retrieve, and reuse objects
-
Limit a search to within an object (or exclude an object from a search)
-
Handle at any level of granularity (detail)
-
Automatically derive tables of contents, lists of figures, indices, etc., from named structures
-
Cut and paste complete structures
-
Manage and manipulate at any level
17. An XML Document is a Tree
-
Nested document structure can be thought of as a tree
-
Top level element is the “root”
-
Only one root (<html></html>)
-
All elements inside that root NOT <html></html><body></body><img></img>
18. Tree Structure Indicates Nesting
19. HTML Error Handling: Permissive
-
Early design decision
-
If you don't know what a tag or structure is, ignore and keep going
-
Recover at all costs, even if you skip stuff
-
Result of this
-
Browser wars
-
Competition on tag sets
-
Unauthorized extensions
-
20. XML Error Handling: Draconian
-
Microsoft and Netscape (together) proposed to XML commitee
-
An XML document is, by definition, “well-formed”
-
If there is a tagging or structure error, processor must quit
-
Result of this:
-
All XML parsers must produce the same parse tree (output to processor) from the same XML document
-
Consistent parsing means interoperability
-
21. Converting Documents from HTML to XML
(These are logical steps; actual steps may blend/merge)
-
Change HTML tags to XML tags
-
Make document follow XML rules
-
Enhance with more content tagging
-
Enhance with new links
-
Validate to a document model (optional)
21.1. Objectives of Conversion to XML
There are many objectives, but two are most important:
-
(Well-formed) XML is easier to process
-
More consistent across vendors
-
More choice in tools and methods
-
Tools are more lightweight, APIs more standard....
-
-
XML tagging can be richer, more descriptive
-
“Separate format from content”
-
Support more data reuse
-
Publish in other formats?
-
These two objectives may go together, but don't have to: decide what your needs are.
21.2. Relation between XML and HTML
-
XML is defined as a syntax
-
HTML is defined as a vocabulary
-
A given markup language might be either or both (or neither)
21.3. Many Levels of Conversion
-
Case 1: HTML to Well-formed XML
-
Case 2: HTML to Structured HTML-XML
-
Case 3: HTML to XHTML
-
Case 4: HTML to User-defined Structure with a DTD/Schema
-
Case 5: HTML to User-defined Content with a DTD/Schema
22. Case 1: HTML to Well-formed HTML
-
This is the simplest case
-
XML can be any tag set, so:
-
Keep the HTML tags
-
Create “well-formed” XML
-
22.1. Rules of Well-formed XML
-
Document is a tree with a single root
-
All elements must start and end
-
No elements overlap
-
There are rules for
-
Names
-
Attributes
-
Markup characters inside content
-
22.2. All Elements that Start Must End
-
Every element has a start tag and an end tag <html></html>, <font></font>
-
No unmatched end tags
-
Exception: empty elements, which may be either:
-
<img src="myfile.jpg"></img>
-
<img src="myfile.jpg"/>
-
22.3. Since Document is a Tree, Elements May Not Overlap
-
Bad: <element>text<another>aaa</element>bbb</another>
-
Okay: <element>text</element> <another><element>aaa</element>text</another>
22.4. XML Naming Rules
Element names, attribute names, entity names, etc.
-
Can be as long as needed
-
May not contain spaces
-
Are CaSe SenSiTive: <author> ≠ <Author> ≠ <AUTHOR>
-
Start and end tag names must match exactly: <author>...</author>
22.5. XML Attribute Rules
-
Element name required
-
Attribute name required
-
Equals sign “=” required
-
Paired (single or double) quotes are required
<element attribute="value">
22.6. Markup Delimiters Inside Data
-
Inside textual data and
-
Inside attribute values
-
A less-than sign “<” must be “<”
-
An ampersand “&” must be “&”
-
22.7. A Special “Gotcha”: Empty Elements
-
Empty elements are elements with no content
-
Examples in HTML:
-
<br> for line breaks
-
<hr> for horizontal rules
-
<img src="myfile.jpg"> for images
-
-
XML tagging must have both open/close, or“sole” tag, e.g.<br/>, <hr/>, <img src="myfile.jpg"/>
-
— But Netscape 4.x doesn't recognize<br/> or <hr/> (IE does)
22.8. Solution: To “Trick” the Browser
-
Netscape does recognize <br /> (note extra space!)
-
Force this by using a “dummy” attribute, e.g.
-
<br class="br"/>
-
<br class="x"/>
-
-
These will display correctly, should not be munged by tools, and are well-formed XML
22.9. Consider Some Bad Code
See (some HTML code) and (a screenshot of how it appears in a browser).
An example:
</p> <b><font face=sans-serif size=-1 color=#008B00>Habitat:</b> </font>By river banks, in ditches and in wet spots.</p>
22.10. What Makes This So “Bad”?
-
Attribute values unquoted
-
Unbalanced tagging
-
Start tags, no end tags (e.g. </p>)
-
End tags, no start tags (e.g. <dl>)
-
Unmatched tags (e.g. <H4>...</h4>)
-
Overlapping elements
-
No </html> close tag!
-
-
Gratuitous tagging
-
Extra <font> tags, <i> tags...
-
-
Inconsistency
-
Different tagging, same effect
-
E.g. sometimes </p>, sometimes <br> <br>...
-
22.11. What We Do to Fix It
-
Supply missing quotes in tags (attribute values)
-
Fixed empty elements
-
<hr> becomes <hr />, etc.
-
-
Supply missing tags (start or end)
-
Fix element overlapping
22.12. What Did We Achieve?
-
HTML file is now well-formed
-
Can be parsed, processed, in general-purpose XML software
-
Displayed (with stylesheet)
-
Processed (sorted, filtered etc.)
-
Analyzed
-
Enhanced
-
-
Can still use HTML software (since tags are still HTML)
22.13. A Trivial Document Conversion (Usually)
-
Few to no decisions to be made
-
No subject knowledge required
-
No XML or parsing knowledge required (just well-formedness rules)
-
Whole process can be automated
-
Any tool will do (XSL, Perl, Python, OmniMark)
22.14. What Conversion Will Mean
-
Add root <html> if not there
-
Solve name case matches by case-folding
-
Quote all attributes
-
Add end tags to keep structure flat
-
Most structures end when next one starts <H1>, <p>...
-
Only a few structures contain other structures (<html>, <body>, <table>...)
-
Inline tagging (e.g. <i>) should already be balanced ... just make nesting clean
-
22.15. What are the Complicating Factors?
-
End tags may not match anything that ever started
-
Element overlap may need human intervention
-
There is no requirement that existing structure make sense
-
Sometimes difficult to determine where elements really end
22.16. What Hasn't Changed?
-
Document is still HTML
-
Any HTML browser can display it
-
All the old errors (except spare end tags) are still there
-
There is no improvement in
-
Data reuse
-
Retrieval precision
-
Print (or other) formatting capabilities
-
22.17. Processing: The Good with the Bad
-
Good: no extra information is needed
-
Bad: no extra information is added
-
Good: data interchange improves
-
Easier to determine start/end/completion
-
Processing marginally easier
-
-
Bad: No enriched linking or content
22.18. What Do You Gain
-
Document may still be ugly, but data can be used with
-
Most standard HTML tools (served as media type text/html)
-
All standard XML tools (served as media type text/xml or media type application/xml with style sheet support)
-
DOM (Document Object Model) applications for XML or HTML will work: consistent processing can be ensured
-
-
Clean markup is easier to maintain!
-
XML editors can make maintenance, cleanup even easier
23. Case 2: HTML to Structured HTML-XML
-
All XML documents are nested structures
-
More useful conversion would
-
Keep HTML tags
-
Add tags for structure
-
Capture the nesting in your HTML document
-
Create proper hierarchical structure
-
23.1. What Conversion Will Mean
-
Same as well-formed HTML-XML case, plus
-
Add tags intelligently to indicate structure, e.g.,
-
<H1-div>
-
Starts with a <H1>
-
Ends just before the next <H1> in source
-
-
Most elements contain other elements
-
23.2. Sorts of Questions that Need to be Asked
-
Rules need to be established, e.g.,
-
Ordinary rules of logical structure
-
Style or writing guidelines
-
Formal document modeling rules (DTD or schema)
-
-
Rule examples:
-
Can paragraphs nest, or does a new one end the previous?
-
Is an <H2> allowed inside a <p>?
-
Is this a strict hierarchy? (H1-sections contain H2-sections, which contain H3-sections, ad infinitum)
-
Is this a loose hierarchy? (any kind of section in any order, they don't nest)
-
-
What tags for structure? All <div>? Or enforce hierarchy with levels?
-
Must tags conform to HTML standard? (browsers will forgive even if not)
23.3. Adding Structure To Our Example
See (some HTML code) and (a screenshot of how it appears in a browser).
Notice how:
-
Section on Organization appears “inside” main text
-
Sections on Primary Names, Habitat etc. appear “inside” Organization
-
But this organization is implicit — levels of nesting are indicated (only) by heading level
23.4. A Moderate Conversion (Usually)
-
If documents are cleanly structured
-
Rules are easy to devise
-
Conversion is simple
-
-
Knowledge of structure now a serious requirement
-
Still no subject knowledge required
-
Still no XML or parsing knowledge required
-
Special care must be given for your result to be W3C-conformant HTML (our example isn't)
23.5. Automation and Tools
-
Process can be automated
-
Automation is more complex
-
Exception file may need to be built
-
Results must be checked by both person and parser
-
-
Easier with tools that understand structure (XSLT, OmniMark, Balise, et al.)
23.6. What are the Complicating Factors?
-
Random end tags break hierarchy
-
Element overlap may still need human intervention
-
There is no requirement that existing structure make sense
-
If documents are a mess
-
Automatic conversion breaks down
-
Human judgement will be needed
-
23.7. Processing: Structured Data is More Useful
-
Structural information has been added
-
Because data now consists of discrete, hierarchically related chunks
-
Data interchange improved
-
Data reuse easier
-
-
Only minor improvements in retrieval precision
-
No enriched linking or content
23.8. Gains From Explicit Structure
All the gains from well-formed HTML-XML, plus hierarchy can be used:
-
To produce navigation aids, like automatic ToC
-
To increase searching precision, a little
-
For intelligent reuse in electronic cut and paste
24. Case 3: HTML to XHTML
XHTML is
-
First major change to HTML since 1997
-
A reformulation of HTML 4.0 in XML
-
Designed for internationalization, accessibility, and user-agent and alternate platform access
-
Modularized for easy reuse, use in other DTDs
24.1. XHTML provides
-
XML DTDs for HTML 4.0
-
(Did you know that HTML 4.0 was defined with three SGML DTDs? Most people and applications used the tags from HTML 4.0 but ignored the DTDs. Most current HTML isn't SGML, but it could have been)
-
-
The ability to create HTML that is also Valid XML (using a public DTD)
-
The advantages of Case 2, plus expectation that others will know how to process your HTML.
XHTML specification available at: http://www.w3.org/TR/xhtml1
24.2. A Caveat
-
XHTML means code is valid to a model (DTD)
-
This assures not only that code is well-formed, but also that tags, attributes are as described in formal definition of XHTML
-
This does not assure that code is “good”
-
Structural information (cf. Case 2) may still be missing
-
“Tag abuse” still an option
-
24.3. What Conversion Will Mean
-
Same as HTML to Well-formed HTML-XML, plus adjustment as needed to match XHTML definition
-
May require adding wrapper tags or re-tagging if illegal tags used
-
Remove embedded stylesheet and scripts
-
preferably to an XSL stylesheet associated with the document
-
or (CSS or Javascript) to external file
-
Can be left in place if well-formed
-
-
Comments no longer a safe hiding place; XML processors may strip
-
24.4. Sorts of Questions that Need to be Asked
-
Which of the XHTML DTDs to use:
-
XHTML-1.0-Transitional
-
XHTML-1.0-Strict
-
XHTML-1.0-Frameset
-
-
Make the XHTML “Strictly Conforming”?
24.5. What's the Difference (Between Kinds of XHTML)?
| XHTML-1.0-Transitional |
Permits all the tags in HTML 4.0. |
| XHTML-1.0-Strict |
Most tags to control formatting are not allowed. The idea is to use CSS (stylesheets) to control formatting instead. |
| XHTML-1.0-Frameset |
Use for documents that use frames. |
24.6. Conforming to XHTML
-
Valid according to one of the XHTML DTDs
-
Root element
-
is <html>
-
designates XHTML namespace using xmlns attribute
-
-
DOCTYPE declaration
-
before root element
-
uses Formal Public Identifier provided by W3C
-
May use local System Identifier
-
24.7. A Bit of Trivia
In XHTML, all tags are lower case
-
<h1> not <H1>
-
<img src="PrettyPic.jpg"/> not <IMG SRC="PrettyPic.jpg">
24.8. Example of XHTML
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Sample XHTML Document</title> </head> <body> <p>This is a small XHTML document.</p> </body> </html>
24.9. An Example of Transitional XHTML
-
What are the differences?
-
DOCTYPE declaration (line 1)
-
Namespace declarations (in html start tag)
-
All tags in lower case
-
Required elements supplied (head element)
-
-
This is enough to be valid!
24.10. The Same Example, Only Strict
-
What are the differences?
-
Some (formatting) tags not allowed
-
font elements
-
attributes on body
-
-
-
Use CSS instead, e.g.
-
<span style="...">...</span>
-
style element (for body element formatting)
-
-
Warning: current (generation 4) browsers don't support CSS completely or consistently
24.11. Problem: Structure in XHTML
-
XHTML doesn't allow us to make up new tags
24.12. HTML Tidy: An Off-the-shelf Tool
-
From W3C (Dave Raggett)
-
Free!
-
Download from http://www.w3.org/People/Raggett/tidy/
-
Will clean up most basic problems
-
By default, makes HTML 4.0 (not XML)
-
Can make XHTML
-
24.13. Results of Running HTML Tidy
-
Missing head, title elements were supplied, but title has no content
-
Bad </p> construction replaced not with wrapping p elements, but with multiple <br />
-
Pretty good for a dumb machine!
24.14. A Moderate Conversion (Usually)
-
If documents are cleanly structured
-
Rules are easy to devise
-
Conversion is simple
-
-
XML and parsing knowledge required
-
Tagged document must be valid according to an XHTML DTD
-
Validation errors must be fixed; error messages from most tools confusing to beginners
-
24.15. Automation and Tools
-
Process can be automated
-
Automation is still more complex
-
Results must be checked by both person and parser: tools that make valid XML may add a lot of undesirable tagging to create a valid document; often should be cleaned up by hand
-
-
Easier with tools that understand structure (XSLT, OmniMark, Balise, et al.)
24.16. What are the Complicating Factors?
-
Selection of DTD
-
Addition of XML infrastructure to document
-
Validation and correction require understanding parser messages, XHTML DTD.
24.17. Gains from Valid XHTML
All the gains from Well-formed HTML-XML, plus
-
General purpose software increasingly available to manipulate XHTML
-
Data interchange improved
-
Data reuse easier
Can still use XHTML and reflect (some) structure (see )
25. Interlude: Some Real-World Problems
-
Tables for formatting
-
Navigation information in pages
-
Entities for special characters
25.1. HTML Tables (and Their Discontents)
-
The problem: HTML tables often used to control layout
-
Specific case of more general problem: mixing format with content
-
-
This example has a table nested three deep. In real life they can be much more complex.
-
25.1.1. Dealing With Tables Used for Layout
Two choices:
-
Mix presentational code (e.g. tables, font elements) with descriptive code, or
-
Separate problem into layers:
-
Source document contains only clean descriptive code: “content”
-
Presentation document created from source
-
1.1.1. Choice One: Mixing Descriptive and Presentational Code
-
Advantages:
-
“Quick and easy” (as least until there's a lot of it)
-
-
Disadvantages:
-
Hard to maintain
-
Hard to validate; DTD (if any) is a mess!
-
Pages get large, unwieldy, obtuse
-
1.1.2. Choice Two: Separating Content from Format
-
Advantages:
-
Data no longer locked into one presentation/platform
-
Content and presentation can be designed/maintained separately, so system is more scalable and long-lived
-
-
Disadvantages:
-
Usually cannot use off-the-shelf tag set (since it must describe your content)
-
Requires validation outside browser (via custom DTD or schema)
-
Requires infrastructure (application) to convert from source (one format) to presentation (another format)
-
Typically a stylesheet application
-
How do we do this? Stay tuned....

