XML 2002 logo

TagSoup: A SAX Parser for Ugly, Nasty HTML

Abstract

For the last year I have been working on a new parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup is now ready for its first public Open Source release under the Academic Free License, a cleaned-up and patent-safe BSD-style license which allows proprietary re-use.

TagSoup is a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly. Therefore, it does not convert presentation HTML to CSS or anything similar. It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on.

The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the same reason, overlapping tags are correctly restarted whenever possible: text like:

This is <B>bold, <I>bold italic, </b>italic, </i>normal text

gets correctly rewritten as:

This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

By intention, TagSoup is small and fast. After release, I will spend some time making it faster if it turns out to be too slow. It does not depend on the existence of any framework other than SAX, and should be able to work with any framework that can accept SAX parsers.

If your tag soup is not HTML, TagSoup can use a custom schema (written in Tag Soup Schema Language, a subset of RELAX NG compact syntax) instead of using the default HTML schema. You can also replace the low-level HTML scanner with one based on Sean McGrath's PYX format (very close to James Clark's ESIS format). You can also supply an AutoDetector that peeks at the incoming byte stream and guesses a character encoding for it. (Otherwise, the platform default is used. If someone supplies a good AutoDetector I may package it with later releases.)

The presentation will focus on practical results: you will learn how to use TagSoup in its simple HTML mode, and get an idea of which features can be customized and how.

Keywords

»HTML, »SAX.

1. Vendor Paper

Since this was a vendor presentation, no paper was prepared for the proceedings.

Biography

John Cowan is the senior Internet systems developer for Reuters Health, a very small subsidiary of Reuters, a wire service and financial news company. He was responsible for Reuters Health's current news publication system, which distributes about 100 articles per day to about 200 wholesale news customers, mostly in XML. (Yes, so most of them want HTML and get XHTML. Deal.)John is a member or de-facto member of the W3C XML Core WG, the W3C XML Linking WG, the OASIS RELAX NG TC, and the OASIS Geography and Language Published Subjects TC, and the closed Unicore mailing list of the Unicode Technical Committee. He also hangs out on far too many other technical mailing lists, masquerading as the expert on A for the B mailing list and the expert on B for the A mailing list. His friends say that he knows at least something about almost everything; his enemies, that he knows far too much about far too much.John presented a tutorial on Unicode at XML 2001, and was co-chair of the Schema Comparisons town hall meeting at the same conference. At that time, many in the XML community had heard of him, but only about five people had seen him. This anomaly is now rectified.In his copious spare time, John constructed and maintains the Itsy Bitsy Teeny Weeny Simple Hypertext DTD, a small subset of XHTML Basic suitable for adding rich text to otherwise bald and unconvincing document types (now available in RELAX NG, too). He is interested in languages -- natural, constructed, and computer -- and is the author of _The Complete Lojban Language_, ISBN 0-9660283-0-9. He is also the current maintainer of FIGlet, the world's only Unicode rendering engine that uses ASCII characters instead of pixels.