Abstract
While the markup community is fond of its embedded markup, the very practices at its foundation are considered dangerously limiting by some, notably hypertext pioneer Ted Nelson. This presentation explores the possibilities of using markup-based tools and techniques to achieve the separation that Nelson argues is necessary, building a bridge between embedded markup and the more complete separation of metadata from content.
Keywords
Table of Contents
"I want to discuss what I consider one of the worst mistakes of the current software world, embedded markup, which is, regrettably, the heart of such current standards as SGML and HTML... There is no one reason this approach is wrong; I believe it is wrong in almost every respect. But I must be honest and acknowledge my objection as a serious paradigm conflict, or (if you will) religious conflict.... SGML's advocates expect, or wish to enforce, a universal linear representation of hierarchical content. I believe that if this is a factual claim of appropriateness, it is a delusion; if it is an enforcement, it is an intolerable imposition which drastically curtails the representation of of non-hierarchical media structure." - Ted Nelson, "Embedded Markup Considered Harmful" [Nelson97]
While markup practitioners are usually so comfortable in their work that it's not clear why they might what to consider alternative approaches, there are still some outsiders who don't believe that markup is as good as it often claims. Ted Nelson, a pioneer of hypertext development, is one of the unreconstructed advocates for the separation of content (text) and descriptive structure and metadata. His classic screed on why embedding markup causes problems appeared in the Winter 1997 World Wide Web Journal, and has been making me think ever since.
Just to explore whether markup was capable of supporting the kinds of things Nelson wanted to do, I wrote a set of processors which separate the content from the markup in one phase, and which recombine the content and the markup in another phase. Maintaining the relationships between markup and content is much more difficult when they are separated because of the obvious problem of maintaining correct connections if the content changes. There are some substantial benefits well worth considering, but they're not necessarily the benefits developers want today.
While markup is capable of supporting out-of-line markup, it requires an entirely more sophisticated framework. Markup is pretty much hackery compared to what Nelson wants - perhaps good hackery. Looking over this separation led me to think about how exactly we apply markup to information. While Nelson's vision may just be too difficult for most common use across loosely-connected networks, it has a lot to tell us about how we interact with information, even as we violate his most dearly-held principles.
Nelson has three primary concerns about markup, one of which only makes sense outside of a markup framework. The other two are much harder to deal with, and have driven much of the complexity in markup-based hypertext systems.
Nelson's first objection is that markup interferes with editing, complaining that "Tags throw off the counts." There are plenty of markup-aware editors available, though Nelson may feel it easier to write editors which keep separate track of metadata than to manage the mixture of metadata and content. Alternatively, he may just object to changeable markup being used in documents which enter his count-based systems.
Nelson also sees major problems with markup and transpublishing in sophisticated hypertext systems. Partly an intellectual property issue, partly a set of difficulties brought on by the complex mechanics of including context outside of the including author's control, this set of issues should be familiar to anyone who's thought hard about external parsed entities or XInclude. Nelson's solutions to these - parallel markup and tag override - both foreshadow possibilities for solving the third problem.
Nelson also claims that SGML (and XML) markup creates another enormous problem, enforcing a single structure on to information. Nelson hates the insistence on clean hierarchical structures - "When SGML fanciers say 'structure,' they mean structure where everything is contained and sequential, with no overlap, no sharing of material in two places, no relations uncontained." He would prefer that we ask "What is the real structure of a thing or a document?" and notes that "Enforcing sequence and hierarchy simply restricts the possibilities." This is the toughest of Nelson's criticisms to address in markup.
Nelson proposes that separating a content layer, a structure layer, and a "special-effects-and-primping layer" make it possible to have the information cake and eat it too. XML, for the most part, mixes at least the first two of these layers.
The foundation of using markup this way is a parallel markup approach, separating markup and content. Nelson suggests that "I believe that sequential formatted objects are best represented by a format in which the text and the markup are treated as separate parallel members, presumably (but not necessarily) in different files. The tags can be like those of SGML, but they are not embedded in the text itself."
Once the markup is separated from the content, the markup needs a means of referencing the content. Using character ranges, we can reference the text from a separate file. Ool uses (start, end) but other alternatives include (start, length) or just (offset from the previous) - Nelson recommends the last of these.
In this system, the markup itself must remain hierarchical, but two forms of overlap are possible. Multiple markup documents may of course provide completely different markup for the text, or a single markup document may point repeatedly to the same or overlapping pieces of text. (The second approach is something XPointer attempts to do in a more traditional markup context.)
This approach also has the advantage that it's possible to go back and forth from XML. In Ool, one process separates text and markup (with an optional clean-up process), while another process permits the recombination of text and markup for use in more traditional XML contexts. Nelson might not approve of this, but it certainly simplifies interoperability with existing markup systems.
The Ool implementation is a set of Java SAX2 filters. SAX2 conveniently reports content using a separate method from markup, though it takes a bit of juggling to make everything come out smoothly. One set of filters separates content from markup, creating a content file and a markup file containing references to that content.
For an example, we'll take a small chunk of well-formed HTML (stuff Ted Nelson likely wouldn't be fond of) and run it through the processor and back again. First, the HTML:
<HTML> <HEAD> <TITLE>simonstl.com - News</TITLE> <LINK REL="stylesheet" HREF="simonstl.css" TYPE="text/css" /> </HEAD> <BODY BGCOLOR="#FFFFFF" LINK="#42426fF" VLINK="#000000"> <H1>Welcome to simonstl.com!</H1> <P>This site hosts information on the books and other projects I'm working on. Most of it focuses on XML, but my work in general networking and Web development is also featured here. This remains a personal site, reflecting my work as an author and XML developer, rather than my current work for O'Reilly & Associates.</P> <H2>News</H2> <P>March 12, 2002 - The slides for <a href="/articles/lexical/">Re-valuing the Lexical in XML</a>, describing my work with Regular Fragmentations and MOE, are now available, as are <a href="/articles/lexical/regfragsamples.zip">sample rule and result files</a> for Regular Fragmentations.</P> <P>February 5, 2002 - I've released the first alpha of <a href="projects/ents/">Ents</a>, a Java library for working with XML character references and entities.</P> <P><A HREF="oldnews.html">Older News</A></P> <P ALIGN="CENTER">Copyright 2000 <A HREF="mailto:simonstl@simonstl.com">Simon St.Laurent</A></P></BODY></HTML>
Running this through the Ool filter produces two files. The first contains the (element) content of the document:
simonstl.com - News Welcome to simonstl.com! This site hosts information on the books and other projects I'm working on. Most of it focuses on XML, but my work in general networking and Web development is also featured here. This remains a personal site, reflecting my work as an author and XML developer, rather than my current work for O'Reilly & Associates. News March 12, 2002 - The slides for Re-valuing the Lexical in XML, describing my work with Regular Fragmentations and MOE, are now available, as are sample rule and result files for Regular Fragmentations. February 5, 2002 - I've released the first alpha of Ents, a Java library for working with XML character references and entities. Older News Copyright 2000 Simon St.Laurent
The second result file is just markup, with references to the text file:
<?xml version="1.0" standalone="yes"?> <HTML> <ool:text ool:start="0" ool:end="1" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <HEAD> <ool:text ool:start="1" ool:end="2" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <TITLE> <ool:text ool:start="2" ool:end="21" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> </TITLE> <ool:text ool:start="21" ool:end="22" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <LINK REL="stylesheet" HREF="simonstl.css" TYPE="text/css"></LINK> <ool:text ool:start="22" ool:end="23" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> </HEAD> <ool:text ool:start="23" ool:end="24" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <BODY BGCOLOR="#FFFFFF" LINK="#42426fF" VLINK="#000000"> <ool:text ool:start="24" ool:end="25" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <H1> <ool:text ool:start="25" ool:end="49" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> </H1> <ool:text ool:start="49" ool:end="50" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <P> <ool:text ool:start="50" ool:end="367" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> </P> <ool:text ool:start="367" ool:end="368" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <H2> <ool:text ool:start="368" ool:end="372" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> </H2> <ool:text ool:start="372" ool:end="375" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <P> <ool:text ool:start="375" ool:end="407" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <a href="/articles/lexical/"> <ool:text ool:start="407" ool:end="436" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> </a> <ool:text ool:start="436" ool:end="520" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <a href="/articles/lexical/regfragsamples.zip"> <ool:text ool:start="520" ool:end="548" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> </a> <ool:text ool:start="548" ool:end="576" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> </P> <ool:text ool:start="576" ool:end="577" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <P> <ool:text ool:start="577" ool:end="629" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <a href="projects/ents/"> <ool:text ool:start="629" ool:end="633" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> </a> <ool:text ool:start="633" ool:end="705" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> </P> <ool:text ool:start="705" ool:end="706" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <ool:text ool:start="1087" ool:end="1089" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <P> <A HREF="oldnews.html"> <ool:text ool:start="1089" ool:end="1099" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> </A> </P> <ool:text ool:start="1099" ool:end="1100" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <P ALIGN="CENTER"> <ool:text ool:start="1100" ool:end="1115" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> <A HREF="mailto:simonstl@simonstl.com"> <ool:text ool:start="1115" ool:end="1131" xmlns:ool="http://simonstl.com/ns/ool/"></ool:text> </A> </P> </BODY>
This can be simplified a lot by putting the ool attributes in their containing elements rather than in a separate ool:text element, but making that work cleanly with mixed content is tricky. (A future version of Ool will address this.)
The OolRemesh filter lets you recombine the two pieces to re-create the original document, or to create a unified version of whatever markup and text combination you've created. While the Oolified-markup is kind of overwhelming (especially as the default approach preserves all whitespace), it's reasonable for use in smaller doses or in combination with other tools.
Using this very simple toolset, it's very easy to create markup which refers to the same part of a document, even overlapping parts, with a minimum of effort. The sameness of those references is generally lost on 'remeshing' to regular XML, but it may be adequate for a lot of tasks. Ool also provides a handy text-only alternative to XInclude and entities While Ool isn't designed to be a standard by any means, I've already found it useful for incorporating literal XML documents into existing XML documents without the mess of CDATA or entity references. It makes no claim to XInclude's universality, and its processing model is clear: it only ever works if you're doing Ool processing.
On the other hand, parallel markup is still difficult. Separating markup from content is fine, if and only if the content itself is actually stable. Any changes to the content, of course, mess up the counts. There may be ways of avoiding this through additional software or more robust pointer techniques, but the largest single problem is that changes to the source document will produce unpredictable and likely unpleasant changes in the remeshed version. Keeping this straight requires a lot more infrastructure, and likely a shift to relative counts. Also, while this approach can reference multiple source documents, it can't keep track of things like multiple possibilities (in translation, for example) in a single source document. It may improve things a bit over standard embedded markup, but not that much.
Even given that stability, attributes and mixed content are tricky for Ool. Attributes are impossible to represent this way without a complex set of conventions, and mixed content produces some rather verbose results. (Ool requires element placeholders for text in mixed content, not just attributes.)
As long as you don't put too much stress on Ool, and meet some strict conditions, Ool may work for some projects.
Ool is not going to sweep the markup world by storm, and neither is out-of-line (parallel) markup. Embedded markup is clearly not perfect, but a little time with parallel markup makes clear how many (useful) shortcuts it permits. Working with parallel markup has, however, forced me to take a much closer look at how and why markup works.
Ool is an opportunity to explore beyond the conventional limits, but a little use highlights the strength of those limits in conventional work. For some applications, notably annotation, Ool is downright exhilirating (the punch of XPointer without the complications), but in most cases it's a lot of extra work for results that other approaches to breaking hierarchies might solve better. Layered Markup Annotation Language (LMNL)[LMNL02] and Just In Time Trees (JITT)[Durusau02] offer alternate approaches to the same problem set which use embedded markup in ways that different from the SGML hierarchy-view Nelson detests.
This separation and recombination also emphasizes the importance of markup as an active process. Instead of a document as an agglomeration of labeled bits, I now see a sequence of text annotated with extra information. Markup is a process and a practice, an intervention. Some aspects of that intervention become clearer as well; in particular, the poverty of attributes has meaning From this perspective, attributes are limited because they are only containers for information about another container. Suddenly, the rules files I'd written using attributes for data look poorly thought-out, and the use of attribute values where element names would be more appropriate makes me cringe.
Even though the explicitly hypertext aspects of this work most likely appeal to "document-oriented" users, there are implications for the data side as well. In many ways, Ool's approach parallels the use of pointers in programming, and has similar costs and benefits. Beyond that, this approach connects markup to other data-representation approches, like CSV. Many data-transfer approaches use a header to specify a pattern, which is then followed by an arbitrary number of repeats. In some ways, the header is markup, applied through pattern repetition rather than ranges. Building from this view, markup can be seen as an explanation of data, rather than a mere container for holding it. Effectively, markup provides a description of what the information is about without intervening directly on the information - and someone else could provide an alternate description.
![]() ![]() |
Design & Development by deepX Ltd. 2002 |