XML Europe 2004 logo

Refactoring XML

Abstract

The world of structured markup is characterized by a number of frequently recurring questions, ranging from the conceptually large (such as, 'what is the difference between documents and data?') to the technically detailed (such as, 'when should we model information with attributes instead of elements?'). By probing these fissures, it is possible to open up our understanding of where structured markup has got to, of where it might be going, and of how to use what we've got to better effect.

SGML was designed for use in a computing environment where the text-based console was the primary means of working. Many of its design features can be traced to the need for labour-saving when keying, and intelligibility when reading, markup on cramped computer screens. For XML, ease of technical implementation was a prime design consideration, yet the W3C has retrospectively made statements about XML's design objectives which are contradictory. And several design decisions taken in the broader XML family seem to run counter to some of XML's stated design precepts. Again, by probing these discontinuities between design and result (for example within XSLT) it is possible to see how XML breaks down as a usable language creation tool when faced with certain classes of problem.

It is recognised that current modelling mechanisms (DTD and schema languages) are not sufficient for modelling the full range of complexities in many classes of 'real world' documents, such as those containing overlapping structures and context-sensitive grammars. But emerging technologies like DSDL will address these issues better. Another example of this schism is the overkill associated with using XML to create serialisation formats, configuration files, and other day-to-day formats for use in software and data-centric projects.

This paper uses the topics discussed to define a spectrum of markup activities, and to place XML within this spectrum. Given this context for characterizing markup activities it becomes possible to suggest that we have a framework for seeing which language features are required and which are redundant when considering the use of a particular markup technology for a particular application. Ultimately, both SGML and XML are not well-aligned to large domains of problem type, and by refactoring XML (in effect, by re-profiling SGML) it is possible to envisage a larger family of markup metalanguages in which XML and SGML have better-defined places. The paper will conclude by outlining such a re-profiling and calling for participation in its standardisation.


Table of Contents

1. Markup Constituencies
2. Language or Data Format?
3. Schisms
4. The Attribute Question
5. Information Models
6. Refactoring XML
7. Possible Objections
7.1. Application Support
7.2. Attributes vs Elements Redux
7.3. Human Factors
7.4. Controlled Subsets
8. Standardisation
Bibliography
Biography

1. Markup Constituencies

The widespread adoption of XML since its launch in 1998[XML] has taken place within many areas of application of information technology. Yet there has been a steady perception within some communities using XML that it is in some respects ill-suited to addressing their particular requirements.

One such constituency uses structured markup for marking-up the structures inherent in structurally complex documents (for example those in literature or legislation). SGML, XML's predecessor, was avowedly developed for publishing[1], and early document modelling efforts using SGML made heavy use of features of its information model that were not adopted by XML[2]More recently, questions have been asked whether the tree structure of XML is in itself suited to document representation, and attempts made to promote information models better suited for documents with structures than are not entirely tree-like [LMNL] [Durusau].

In a completely different field, among technologists working in resource-constrained environments, a quite different set of considerations apply to XML. Here the problem is that conformant XML processing is too expensive. Developers of embedded systems using XML have typically processed their own ad hoc subsets of XML[3], and the W3C's Simple Object Access Protocol (SOAP) embodies what is effectively an XML subset by forbidding, among other things, processing instructions and document type declarations from appearing in its message construct[SOAP].

In general, the price of adding full XML support[4] to an application incurs a cost which is trivial in the context of the size of today's desktop applications[5], but still significant for embedded development, and applications targetted at mobile devices. The run-time requirements of XML processing too may be resource hungry (particularly if building in-memory tree representations of XML documents).

Furthermore, the increasing interdependence of technologies within the XML family[6] increasingly constrain the number of technology subsets of XML that can be effectively used.

2. Language or Data Format?

The two XML user communities described above can be seen as what might be considered the two extremes of the markup spectrum. At one extreme we have what have come to be termed the 'doc heads', at the other the 'data heads'[7]. A number of differences are often supposed to separate these two communities, and while in practice such polarisation might often be little more than a useful myth when we consider any one project or individual, I suggest that a key differentiator between the two interests can be found in the degree of importance they attribute to the lexical form of XML.

It has already been noted that SGML was apparently 'for' publishing[1]. Strikingly, XML - especially in its early days - appeared to be professing itself 'for' other things entirely [8]. Statements from the W3C such as 'XML is text, but isn't meant to be read'[10Points] suggest something other than a 'language' is being aimed at, for what language is not meant to be read?.

By contrast, in SGML a great deal of attention is paid to allowing flexibility in the lexical form of the markup, precisely where the readability (and 'writeability') of markup is determined. In particular features such as tag omission, tag minimisation and short reference mapping allow a great deal of flexibility at the lexical layer. This is not surprising, since SGML emerged from an era when human-computer interaction took place using keyboards to issue command sequences, and character-based displays (or even teletypes) to read a computer system's interpretation of them. Thus while today 'user interface' conjures images predominantly of mouse manipulations, clicks and graphical displays; formerly it would have conjured images of typing in, and reading, strings of characters.

It is worth reviewing the elegance with which SGML could be used to add flexibility in the lexical layer . For example the following markup<TABLE> Quantity|Item Decription|Price 1|TFT Panel|599.00 1|WAN card|18.95 1|Wireless mouse|27.95 </TABLE> validated against this SGML DTD<!ELEMENT TABLE - - (ROW+)> <!ELEMENT ROW O O (CELL+)> <!ELEMENT CELL O O (#PCDATA)> <!ENTITY cellstag STARTTAG "CELL"> <!ENTITY rowstag STARTTAG "ROW"> <!SHORTREF tablemap "|" cellstag "&#RS;&#RE;" rowstag> <!USEMAP tablemap TABLE>represents this (normalised) markup:<TABLE> <ROW> <CELL>Quantity</CELL> <CELL>Item Decription</CELL> <CELL>Price</CELL> </ROW> <ROW> <CELL>1</CELL> <CELL>TFT Panel</CELL> <CELL>599.00</CELL> </ROW> <ROW> <CELL>1</CELL> <CELL>WAN card</CELL> <CELL>18.95</CELL> </ROW> <ROW> <CELL>1</CELL> <CELL>Wireless mouse</CELL> <CELL>27.95</CELL> </ROW> </TABLE>

Of course such a process is impossible in XML since it is based around some of the very features which XML excluded in its profiling of SGML, features which made development of conformant SGML processors non-trivial.

3. Schisms

The oddest of the 'design goals' for XML was that 'terseness in XML markup is of minimal importance'[XML]. This is odd because it is not a design goal - it seems more like a statement of belief, or a command. If taken to apply to applications of XML it might be taken to encourage the creation of vocabularies and grammars which would result in verbose documents.

Whether or not this was the intention, verbosity in marked-up documents is what XML sometimes seemed to give the world. However, it is interesting to note a pronounced disjunction between the stated belief that terseness is of minimal importance, and the actual practice of XML use, where human beings appear to attach some importance to terseness.

Take for example, this fragment of XSLT 1.0: <xsl:choose> <xsl:when test="contains($string, '&#x9;')"> <xsl:value-of select="substring-before($string, '&#x9;')"/> </xsl:when> <xsl:otherwise> <xsl:value-of select="$string"/> </xsl:otherwise> </xsl:choose> Superficially, the code is verbose[9] especially when compared with the syntactic patterns of more orthodox programming languages, which might be similar to this Java- or C- based pseudocode if( contains($string, '&#x9;') ) { x = substring-before($string, '&#x9;') } else { x = $string }

or even (more tersely still, and too tersely for some), x = contains($string, '&#x9;') ? substring-before($string, '&#x9;') : $string to use the ternary operator form available to both C and Java programmers.

However, while it might be thought that this demonstrates that XML has indeed given us a language that 'isn't meant to be read', perhaps more interesting still are those features within the XSLT language which are pulling in the opposite direction to those XML design precepts which encourage verbosity.

Three examples of these are the use of traditional argument list-based functions within XPath, the use of abbreviated non-XML syntax for XPath expressions[XPath10], and the allowing of literal result elements alongside the more verbose xsl:element element[XSLT10].

From this it might be concluded that verbosity in an instructional programming language, as for a natural human language, is undesirable.

4. The Attribute Question

The tension between economy and verbosity of markup is crystallised in one particular debate which has exercised the markup community continually: the question of when to use an element and when to use an attribute[10].

Leaving aside the interesting variety of of opinion on this matter[11], an interesting observation might be made about the debate itself. The very fact that it can happen, that we have at the heart of XML a feature in search of a rationale, is worrying in itself. The fact the the debate lacks closure savours even more of a 'bad smell'[12] indicating that something is amiss which needs remedial action.

To take an example from the DTD used for this conference's papers: <gcapaper presdate='20040421' prestime='1145'> <!-- etc. etc. --> </gcapaper> Why should this be the markup, as opposed - say - to: <gcapaper> <attributes> <presdate>20040421</presdate> <prestime>1145</prestime> </attributes> <!-- etc. etc. --> </gcapaper> ?

From a processing point of view, it is absurd that we should be faced with a choice between these two forms. Why does XML present us with an extra axis for attributes (to use XPath terminology) when the same effect can be achieved perfectly well using elements? And why does it present us with the choice when the semantics of that extra 'attribute' axis are ill-defined? Clearly, attributes are 'syntactic sugar'.

From the point of view of as human interacting with the markup it is absurd that we should consider the verbose attribute-free form. Why should the expressive potential of attributes, which signal to use the metadata-like nature of their content, be sacrificed to make way for the all-element model? Clearly, it is syntactic sugar that makes markup sweet.

Whatever the merits of these points of view (which might, perhaps be held by our imagined 'data head' and 'doc head' respectively), the ultimate arena for the 'attribute vs element' debate is one concerning the lexical form of markup, its language. This is to be distinguished from debates about the information model of markup, which I now wish to turn to.

5. Information Models

James Clark has claimed that the essential information model, or 'abstraction', for XML is clearly definable:

I would argue that the right abstraction is a very simple one. The abstraction is a labelled tree of elements. Each element has an ordered list of children where each child is a Unicode string or an element. An element is labelled with a two-part name consisting of a URI and local part. Each element also has an unordered collection of attributes where each attribute has a two-part name, distinct from the name of the other attributes in the collection, and a value, which is a Unicode string. [Clark]

If we reduce this model further to exclude attributes and XML Namespaces (the URI in Clark's abstraction), we are left with a very simple abstraction which might be represented by the following class diagram.

click image for full size view

Figure 1. 

Despite the compactness of this model, it embodies some involution and subtlety, by managing to contain the notions of abstract types, polymorphism and same-type containment. Instantiations of Element form what has been termed the 'composite' pattern[13] which is common in many areas of application development, in which nodes of the tree may be terminal nodes, or sub-trees of the whole.

Clark summarises his exposition of XML's abstraction,

[ ... ] That is the complete abstraction. The core ideas of XML are this abstraction, the syntax of XML and how the syntax and abstraction correspond. If you understand this, then you understand XML.

However, even if we re-admit attributes and namespaces into our information model, then it is worth probing further 'how the syntax and abstraction correspond', for in reality Clark's 'very simple' abstraction is over-simplified in comparison to the real XML abstraction, which introduces such concepts as comments, processing instructions, CDATA marked sections, and entity references into the model. The W3C's DOM[DOM] contains not 3 or 4 classes, but 17[14] which would seem to indicate that one answer to the question of how XML syntax and this simple abstraction correspond, is: not very closely.

Taking as a precept of a good design that it should be minimal yet complete, and using XML attributes again as a case in point, the question should be asked, why are attributes modelled in the base layers of our information model? They are not necessary in XML, nor to our mental model of markup, so they violate the design principle of minimality (and the same holds for all the other non-essential XML features). However, completeness dictates that they must be modelled in our XML abstraction in order to mirror XML syntax. So an answer to the question of how XML syntax and abstraction correspond, is: the syntax has driven the data model.

It seems to me a great irony that one of the most widely-accepted design principles of markup is that style and content should not be confused, yet at the beating heart of XML is a parallel condition whereby the base data model is bloated with representations of expressive nuances available at the lexical level.

This is not to argue against expressive nuances in markup - in fact a markup languages targetted to particular problems should have many more than XML permits - but to suggest that a markup language should be based on a clean mental and data model, suitable for layering subsequent levels of complexity so that it can better address the full diversity of markup constituencies than XML does. We should refactor what we have to create this new markup language.

6. Refactoring XML

The call for a simplified XML is not a new one[15], and in practice there have been many proposals driven by a variety of considerations, most commonly it seems, ease of implementation.

'Refactoring' implies changing something internally without affecting its outward form[16] , so producing a subset of XML is only part of an overall refactoring process, which should go on to encompass and extend those features of XML that are found useful.

Practically speaking, as this first stage of refactoring I suggest a minimal metalanguage is required which

  1. corresponds to the abstraction of a labelled tree of elements

  2. is 100% backwards-compatible with XML and WebSGML

What this means (in effect) is XML without

  1. DTDs (and therefore, no user-declared entities or notations)

  2. comments

  3. processing instructions

  4. CDATA marked sections

  5. attributes (and therefore, no XML Namespaces)

What this gives us, syntactically, is tags and Unicode.

7. Possible Objections

A comprehensive and thorough argument against having a 'reduced' XML (SML) was has been put forward by Rick Jelliffe[Jelliffe]. His pertinent objections are summarised and discussed in the following sections.

7.1. Application Support

Jelliffe worries that a layered approach to XML may bring about correspondingly layered support from vendors.

The problem with treating low-level layers as optional rather than required is that it allows vendors to pick and choose which layers to provide. What if one big browser company decided not to use namespaces, while another chose to include them? [...] SML must be justified by a respect for plurality rather than by the pursuit of simplicity.

In reality, a refactored markup language needn't necessarily challenge or displace vendor support for document that use XML syntax. Indeed given the enormous support that XML has enjoyed this seems unlikely. I would argue that, from the perspective of actual or potential users of markup, the respect for plurality it a driver for refactoring XML, rather than for leaving it be.

7.2. Attributes vs Elements Redux

Jelliffe addresses the 'attributes vs elements' and proposes his own reasons why attributes represent a 'natural pattern' in markup languages (which, for certain classes of document it certainly is). He further argues that attributes give us

[...] readability, simplification of paths, simplification of content models, simplification of at least some kinds of programs, and so on

He then invokes the spectre of a 'data head' who questions the element/attribute distinction because it will 'complicate mapping data from databases which do not have this distinction'.

It would seem to me that the image of this imaginary 'data head' is an argument for having markup which properly addresses the needs of different constituencies, again a 'respect for plurality'.

7.3. Human Factors

Jelliffe argues that markup languages are essentially driven by human rather than machine requirements:

[ ... ] the essential feature of XML is that it is a markup language. It is not merely a language for computer-to-computer exchange - it also provides minimal features aimed at making life easier for direct readers and writers of data.

I have argued that the 'minimal features' offered by XML are not enough, and the certain design decisions pull XML away from human beings (and I fully agree with the statements here). However, a practical reason for a refactoring exercise is to allow a markup language to fulfil properly the design precepts of XML, and is fully machine-facing, so that subsequent layers on top of the base model may provide markup syntaxes more amenable to human interaction that base XML. The last few years have seen too much contamination of the human factors of XML by data-centric interests.

7.4. Controlled Subsets

Jelliffe concludes that his putative simplified XML may have a future

in the area of closed data transport and interprocess communication, where it is generated by API, and where human reader/writers do not touch it.

This uncannily predicts the ad hoc subsetting of XML that has taken place in the SOAP specification[SOAP]. On an entirely practical level I would argue that one of the reasons for refactoring XML, and producing initially a minimal markup metalanguage, is to control the future direction of subsetting to impose some degree of standards-enforced order on their proliferation.

8. Standardisation

As a practical next step I propose that a standardisation effort be put in train to carry out this refactoring of XML, the first result of which should be the minimal metalanguage outlined in this paper.

Bibliography

[AltXML] XML Alternatives. (Available http://www.pault.com/pault/pxml/xmlalternatives.html).

[Clark] James Clark, foreword to Eric Van Der Vlist, RELAX NG. O'Reilly (2004).

[Dodds] Leigh Dodds, Time to Refactor XML? (Available http://www.xml.com/pub/a/2001/02/21/deviant.html).

[DOM] W3C, Document Object Model (DOM). (Available http://www.w3.org/DOM/).

[DuCharme] Posting to xml-dev mailing list 1 December 2003. (Available http://lists.xml.org/archives/xml-dev/200312/msg00006.html).

[Durusau] Just-In-Time-Trees (JITTs): Next Step in the Evolution of Markup? (Available http://www.sbl-site2.org/Extreme2002/JITTs.html).

[ERH] Elliotte Rust Harold, Effective XML. Addison-Wesley (2004).

[Fowler] Martin Fowler, Refactoring. Addison-Welsey (1999).

[Go4] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissidies, Design Patterns. Addison-Wesley (1995).

[Jelliffe] Rick Jelliffe, Goldilocks and SML. (Available http://www.xml.com/pub/a/1999/12/sml/goldilocks.html).

[LMNL] Wendell Piez, and Jeni Tennison, The Layered Markup and Annotation Language (LMNL). (Available http://xml.coverpages.org/LMNL-Abstract.html [extended abstract only]).

[McLaughlin] Brett McLaughlin, 'XML Best Practices' in Java Enterprise Best Practices. O'Reilly (2003).

[MinML] The Wilson Partnership, MinML. (Available http://www.wilson.co.uk/embedded/emb1.htm).

[Ogbuji] Uche Ogbuji, When to use elements versus attributes. (Available http://www-106.ibm.com/developerworks/xml/library/x-eleatt.html?ca=dnt-59).

[SGML] ISO 8879:1986. Standard Generalised Markup Language (SGML).

[SOAP] SOAP Version 1.2 Part 1: Messaging Framework. W3C Recommendation 24 June 2003. (Available http://www.w3.org/TR/soap12-part1/).

[10Points] W3C, XML in 10 points. (Available http://www.w3.org/XML/1999/XML-in-10-points).

[TEI] TEI Guidelines for Electronic Text Encoding and Interchange, Version ?. (Available http://etext.lib.virginia.edu/bin/tei-tocs?div=DIV1&id=NH).

[Xerces-C++] Xerces-C++. A validating XML parser written in a portale subset of C++. (Available http://xml.apache.org/xerces-c/index.html).

[XML] Extensible Markup Language (XML) 1.0. W3C Recommendation 10-February-1998. (Available: http://www.w3.org/TR/1998/REC-xml-19980210).

[XPath10] XML Path Language (XPath) Version 1.0. W3C Recommendation 16 November 1999. (Available http://www.w3.org/TR/xpath).

[XSD] W3C, XML Schema. (Available http://www.w3.org/XML/Schema/).

[XSLT10] XSL Transformations (XSLT) Version 1.0. W3C Recommendation 16 November 1999. (Available http://www.w3.org/TR/xslt).

Biography

Alex first became interested in structured markup when analysing literary texts for his doctorate (on early Shakespeare editions) in the late 1980s.

Following this he worked as a developer on heavily object-oriented C++ application framework for cross-platform multimedia publishing, at the height of the CD-ROM boom.

In 1997 Alex was one of the founding directors of Griffin Brown Digital Publishing Ltd, a UK-based company providing XML-based services and products. He is responsible for leading the company's XML consulting and implementation, and his work includes advising clients on XML/IT strategy and practice, mentoring clients' staff, writing DTDs and Schemas, and designing and developing XML software systems in C++, Java and other languages. In 2002, Alex was invited to join the British Standards Institute (BSI) Technical Committee IST/41, where he contributes to ISO/IEC JTC1/SC34 in its formation of the DSDL ISO standard, among other things.

Alex writes and speaks regularly on structured markup technologies and their application to information management.



[1] SGML describes itself in Part 0 of the standard[SGML] as being for 'publishing in its broadest definition'.

[2] See, for example, the TEI DTD Guidelines, which anticipate heavy use being made of SGML's CONCUR feature, (which itself was rarely implemented in SGML processors)[TEI].

[3] See for example John Wilson's MinML (a parser not a language) written in Java for running on embedded systems.[MinML]

[4] By which I mean support for XML 1.0, XML Namespaces, several Unicode encodings, W3C XML Schema, and all the common XML APIs.

[5] A binary of version 1.6 of the Xerces-C++[Xerces-C++] XML suite, for example, takes around 1.5MB.

[6] See [Dodds][XSD].

[7] Bob DuCharme has characterised doc heads as 'people doing XML work with irregularly structured documents that would end up being published in some medium or other'; data heads as 'XML developers doing systems involved in more transactional processes such as web services and database interaction' [DuCharme].

[8] According to the W3C, in item number 1 of their 'XML in 10 points' document, 'XML is for structuring data. ... Structured data includes things like spreadsheets, address books, configuration parameters, financial transactions, and technical drawings'[10Points]. The words 'publishing' and 'documents' are, interestingly, absent from this description.

[9] What XSLT developer has not groaned as they key these constructs?

[10] Most recently by Uche Ogbuji[Ogbuji].

[11] For example Elliotte Rusty Harold proposes reserving attributes for metadata ([ERH], p.69) and adds a pragmatic rider, 'if you have any doubt about whether information is metadata or data, I suggest you place it in element content'. In contrast Brett McLaughlin recommends 'use elements sparingly, attributes excessively'[McLaughlin] to enable easier coding of SAX data handlers. Again this may represent the 'doc head', 'data head' divide.

[12] To adapt Kent Beck and Martin Fowler's colourful turn of phrase ([Fowler], p. 75).

[13] by the so-called 'gang of four'[Go4].

[14] As contained in the org.w3c.dom Java package contained in the JAXP 1.1 distribution. Note that the DOM is interface rather than class based and that NodeList is itself one of the defined types.

[15] Paul T's page lists over 40 XML Alternatives[AltXML],

[16] Fowler's definition of 'refactoring' is 'the process of changing a software system in such a way that it does not alter the external behaviour of the code yet improves its internal structure'[Fowler].