Abstract
This paper reports on the outcome of the W3C Workshop on Binary Interchange of XML Information Item Sets. A prerequisite to attend the workshop was submitting a position paper. A total of 36 position papers were accepted. In the papers, organizations and individual contributors stated their position regarding potential W3C involvement in the standardization of a binary XML format and also shared their experiences using various binary XML solutions that are currently available.
We will not report on the position that each organization or individual took with respect to the need for a W3C-blessed binary XML format. Instead, we will concentrate on the technical aspects of the discussions by (i) elaborating on a list of requirements for an ideal binary XML format and (ii) outlining a taxonomy for the different solutions presented in the position papers.
Keywords
The W3C workshop on Binary Interchange of XML Information Item Sets was held at Sun Microsystem's campus in Santa Clara CA between the 24th and 26th of September, 2003 [W3C Workshop on Binary Interchange of XML Information Item Sets]. In what follows, we will refer to "binary interchange of XML information item sets" simply as binary XML [1].
A requirement to attend the workshop was to submit a position paper. A total of 36 papers from different organizations and individuals were accepted. The majority of the papers submitted included answers to the following questions:
Do we need a W3C-blessed binary XML standard?
What work has your organization done in this area?
What applications or use cases have you considered?
This paper will elaborate some of the answers to questions 2 and 3 ---the reader is referred to [Report on the W3C Workshop] for details on the position of each organization with respect to question 1.
The first day of the workshop consisted of presentations from a number of companies including Adobe, BEA, Expway, IBM, Microsoft, and Sun, among others. Each organization presented their position with respect to question 1 and, in most cases, also discussed some of their use cases and solutions. During the second day, break-out groups were formed in order to collect requirements as well as to give an opportunity to those companies that had not presented their paper to do so. The third day (which was really a half day) was dedicated to discussions on further work within the W3C in the area of binary XML.
The paper is organized as follows. In Section 1.1, “Introduction”, we discuss a list of requirements for an ideal binary XML format. In Section 1.3, “Data Compression”, we address the always controversial topic of data compression for XML. Section 1.4, “Degrees of "Infosetness"” defines a way to classify binary XML encodings and outlines a taxonomy for the different solutions presented in the position papers. Finally, Section 1.6, “Conclusions” lists conclusions and explains future work items.
We divided the requirements collected at the workshop into three categories: format, transmission and processing. The format requirements serve as the basis for the requirements in the other two categories. They are the requirements that most of us think of first when discussing binary XML. However, they are not completely independent: requirements in the transmission and processing categories will influence how the binary format is defined (i.e., support for fragments may rely on features added to the underlying format).
A condensed version of the requirements collected at the W3C workshop, listed in the three categories outlined above, is shown below [2]. The complete set of requirements is available online [Report on the W3C Workshop].
Efficient storage and efficient transmission is important
Support parsing on a low-powered device
Do not want a domain-specific solution
An order of magnitude (10x) better performance is desirable
Want arbitrary precision numerical data formats
Self-describing format
Can use schemas to help encoding
Must support schema version detection, multiple schemas at once
Must support open content, e.g., elements, values, subtrees not in the schema
Want support for existing APIs (SAX, DOM, Pull)
Must require minimal changes to application layer
Must be able to distinguish text XML from binary format on inspection
Performance comparable to (or better than) RMI
Must not rely on HTTP (e.g., file support)
Must be clear about MIME media type to be used
Must support fall-back to text format if receiver can't understand binary
May support schema evolution: download of new schemas and/or codecs
Support for fragments and the ability to send deltas (versioned fragments)
Random access based on infoset: using XPath or other boundaries (e.g image, page, etc.)
Must support progressive downloading (e.g., progressive rendering)
Must support (a form of) XML security (e.g., via canonical XML reconstruction)
Perhaps the most obvious requirement for a binary XML format is that it has to be efficient, both in document size and encoding/decoding time. The general consensus was that, unless the efficiency of a solution can be proven, there is little value in using it as an alternative to the widely successful textual format. How much efficiency? There was no consensus on this topic, but some suggested that in order for XML (and, more importantly, the benefit of using XML standards and tools) to be used in certain industries, an order-of-magnitude improvement is needed. It was agreed that the impact on applications and developers should be minimal, and that support for standard XML APIs such as SAX, DOM and Pull (StAX) was desirable. For many, preserving the self-descriptive aspect of XML was a must; for others, this was a commodity that could be traded in cases where processing power and memory are scarce.
When it comes to transmission of binary XML documents, especially in the context of Web services, there was a natural comparison against existing technologies such as CORBA, RMI and DCOM. Some recent work [Fast Web Services]shows that the performance of Web services is significantly worse than RMI and even RMI/IIOP. Re-gaining some of this performance was regarded as an important goal of a binary XML format. Having more than one wire format for Web services may create interoperability problems; consequently, the consensus was that binary XML should be used as an alternative to textual XML, and that support for the latter was necessary in order to have a fall-back mechanism. Finally, and still within the area of transmission, schema drift detection and schema evolution were deemed necessary for those solutions that optimize the wire format based on schema knowledge.
Other requirements included support for XML fragments and the ability to send deltas (i.e., versioned fragments). It was pointed out that existing technologies [MPEG-7]already provide support for these features in way that is much easier to use when compared to XML. Random access, i.e. the ability to access a portion of a document in sub-linear performance with respect to its size, is another requirement that was brought up several times [Adobe] [CCSDS Packaging Working Group]. It can be regarded as a requirement on the format, but it is listed in the processing category as the unit of random access was loosely defined.
Support for XML security was also mentioned. One of the challenges in supporting XML security is that it relies (perhaps too much?) on the lexical representation of datatype values. For example, a binary XML format that uses schema information may turn the character information items "001" into the binary number 1 whenever its type is xs:int. Even though the value space of the datatype is preserved, its lexical space is not (the use of xs:pattern in an XML schema poses similar challenges). Of course, always preserving the lexical space defeats the purpose of using binary XML in some cases; notably, large XML documents carrying scientific data [Cubewerx]. A possible solution for this problem is outlined in [Schema Centric XML Canonicalization] where a schema centric canonicalization is defined.
Since one of the goals of a binary XML format is to reduce message size, data compression is often mentioned as a possible solution. The question people ask is: "Why don't you use gzip?". Indeed, for some applications, using data compression is a solution. However, in practice only a small percentage of applications can benefit from using these techniques. There are two major problems in using data compression:
It does not perform well on small messages: several people have reported getting larger sizes when compressing small messages (the kind that are not uncommon in Web services).
Data compression algorithms are CPU intensive: data compression adds to the latency of a message exchange.
Data compression is useful when two high-speed power-unconstrained CPUs exchange messages over a low-bandwidth line [Mitre] [Software AG]. Successful use of gzip has also been reported for large SVG files [Nokia]. For other cases involving the exchange of messages, data compression is not a viable solution. Even for storage of XML documents a binary XML representation may be preferred, especially if it supports a form of random access [Adobe].
Conceptually, there is nothing that prevents the use of data compression in conjunction with a binary XML format. Many binary formats store string data in textual form (e.g., ASN.1 supports UTF8), thus the compression of all or part of a binary message may still be desirable.
A convenient way to define a taxonomy for the different binary XML formats is to classify them based on how much of the original XML infoset they preserve. At one extreme, we have solutions that attempt to preserve the entire XML infoset as defined in [XML Information Set]. Atthe other, we have solutions that, aided by external information, attempt to minimize the amount of data by omitting certain infoset items.
In practice, the degree of infosetness can be controlled by how much schema information is employed. The more schema information used, the more compact the resulting format, but the more infoset items tend to be omitted. Perhaps not surprisingly, empirical data presented at the W3C workshop indicates that the more schema information used the more performance gained [Sun] [Expway] [Mitre] [KDDI] [OSS Nokalva].
A pure infoset-based solution will still perform better than textual XML as, in a binary format, it is possible to do clever sharing of items and item properties to avoid the redundancy that characterizes XML documents. Simple examples of sharable parts are element and attribute names, namespace prefixes and namespaces URIs. Conversely, a pure schema-based solution will omit all element and attribute names (as these can be recovered from the schema) and will use datatype information to efficiently encode leaf nodes (i.e., character items) in order to get as close as possible to an Information Theory minimum.
There are also hybrid solutions that take advantage of some schema information whenever it is available. For example, it is possible to use datatype information from a schema to encode numeric data while at the same time preserving all the self-describing information (i.e., element names, attribute names, etc.). In some hybrid solutions, it is even possible to add typing information to the binary stream to aid the decoding process in the absence of the schema [Cubewerx].
The main advantage of infoset-based solutions is that they can be used as drop-in replacements for textual XML. There is no need to use schema or any other information external to the infoset itself. Consequently, they are cut out for the kind of processing based on the XPath data model. This is not to say that schema-based solutions cannot be used for this purpose; simply that, in the case of schema-based solutions, an additional step is necessary to reconstruct infoset parts that are omitted in the format. In this sense, infoset-based solutions provide flexibility that is similar to textual XML, except for the need of special tools for human visualization.
Naturally advantages do not come without disadvantages, especially when compared to schema-based solutions: the format is not as compact, encoding/decoding is not as efficient, numerical data is not represented efficiently, and so on.
Since in schema-based solutions parts of the XML infoset are omitted in the format, they tend to integrate naturally with applications that take advantage of a programming language binding framework. In binding frameworks ---of the kind available in most Web services toolkits as well as general-purpose tools like JAXB, Castor or XML Serializer--- XML data (i.e., leaf nodes) is bound to programming language objects while the rest of the structure (i.e., non-leaf nodes) is used mostly to drive an unmarshaller. Thus, unless requested by the application, there is no need to use schema information to re-synthesize infoset parts which are not carried in the format.
Moving beyond application architecture into transmission of documents, schema-based solutions seem to match the requirements of peer-to-peer exchanges of documents, where there is a producer and a consumer but no (or only a few) intermediaries. The reason for this is that intermediaries often do not have complete information of the schema. To exemplify, an intermediary may only access "/packet/dest-ip/network" without knowing the actual type of the element 'packet'. Another reason is that a simple change in the schema may require updating the decoders in all intermediaries, a task that is difficult or even impossible in some cases (e.g., when an intermediary is maintained by a third party).
On the other hand, schema-based solutions are in some cases the only viable solution. They can be applied in industries where XML is either not considered or used but with very poor results. The telecom sector is a good example of this. Mobile phones are low-powered, have limited batteries and are connected to high-latency low-bandwidth networks. Thus, it is imperative to (a) reduce message size and (b) optimize the encoding/decoding process in order to maximize battery life. A variety of proprietary formats have been devised to solve this problem [KDDI], but the undesirable consequences of these is (a) gatewayed networks and (b) lack of interoperability.
If reducing message size is the main goal, a schema-based solution combined with a data compression algorithm will result in the smallest message for most applications. Some have reported a message size reduction of more than 97% using this approach [Mitre] [Agile Delta].
A third kind of solution explores the use of partial schema information to optimize the wire format. A typical example is the use of datatype information from the schema in order to use a binary encoding for numerical data. Other examples are those that define the binary format as a serialization of a PSVI or an XPath 2.0 data model. They differ from a pure schema-based solution in that the structure of the XML infoset (i.e., parent-child relationships, element and attribute names, etc.) is part of the binary format.
A pure schema-based solution is unlikely to be adopted unless it supports the ability to encode open content (i.e., in XML Schema terms, support for xs:any, xs:anyType, etc.). In essence, an open content is a 'hole' in the schema for which no type information is available. This is clearly a problem as no type information can be used to drive the encoding/decoding process for those holes. As a result, schema-based solutions are often combined with either an infoset-based or a hybrid solution [Sun] [Expway]. One interesting aspect of this combination is the possibility to control properties of the format by under-specifying or over-specifying a schema. For example, a schema may choose to define the type on an element as xs:anyType instead of as a sequence of two elements of type xs:int to ensure that the content of that element is encoded as an XML infoset [3].
Table 1lists solutions presented in the position papers accepted to the workshop. Table 2lists other technologies mentioned in the position papers as well as standards on which solutions fromTable 1 are based. For each row in a table, a "Y" indicates that the feature is supported, an "N" that it is not and a "?" that not enough information was available to determine either way [4].
| Solution | Description | Infoset | Schema | Hybrid |
|---|---|---|---|---|
| Fast Web Services | Sun Microsystems (ITU-T/ISO X.69? specs) | Y | Y | N |
| BXML | Cubewerx | Y | N | Y |
| Xebu | University of Helsinki | Y | N | Y |
| XBIS | Dennis Sosnoski | Y | N | N |
| ESXML | High Performance Technologies | Y | N | N |
| Lionet | XBVM | Y | N | N |
| BinXML | Expway (MPEG-7 BiM with extensions) | Y | Y | N |
| CBXML | IBM | Y | N | N |
| XimpleWare | XimpleWare | Y | N | N |
| Serialized DOM | Media Fusion | Y | N | N |
| Systematic XML Compression | Systematic Software Engineering | N | Y | N |
| CMF-B | L3 Communication | N | Y | N |
| TokenStream | BEA Systems | ? | ? | ? |
| Tarari XML Tokenizer | Tarari | Y | N | N |
| XML-Xpress | Intelligent Compression Technologies | N | Y | N |
| XML Schema Tools | OSS Nokalva (ITU-T/ISO X.69? specs) | Y | Y | N |
| Xeus | KDDI | N | Y | N |
| Xfsp | Web3D | Y | N | Y |
Table 1.
| Technology/Solution | Description | Infoset | Schema | Hybrid |
|---|---|---|---|---|
| WBXML | W3C Note 24 June 1999 | Y | N | Y |
| CVG | Compressed Vector Graphics (SVG only) | N | Y | ? |
| ASN.1 | ITU-T/ISO X.69? specs | Y | Y | N |
| BiM v.1 | MPEG-7 Standards | N | Y | N |
| XMill | AT&T and University of Pennsylvania | Y | N | N |
| Millau | Institut Eurécom and IBM Research | Y | N | Y |
| XMLZip | XML Solutions | Y | N | N |
Table 2.
The number of organizations that attended the workshop indicates that there is great interest in binary XML. It was clear from the discussions that this is not a solution without problem. Some see the adoption of a binary XML format as a way to optimize their existing systems; others as the only possible way by which their systems can take advantage of existing XML technologies and tools.
A significant percentage of the attendees were skeptical that a single binary XML format will be suitable for all purposes. The more optimistic participants indicated that a mixed solution that combines infoset-based and schema-based encodings may successfully hit the 80:20 point, especially if redundancy-based compression was (optionally) supported as well. It was clear from the discussions that further work is needed to ascertain which of these positions will prevail.
One of the outcomes of the workshop was the decision to (i) create a forum for further discussion and (ii) prepare a draft for a possible WG at the W3C to further evaluate the need for a standardized binary format. This WG will not be formed for the purpose of producing a specification for a binary format, but to refine the list of requirements and use cases, and to evaluate functionality and performance (via a normalized method) of existing solutions.
The volunteers for drafting a WG charter are: Mark Nottingham (BEA Systems), Robin Berjon (Expway), Stephen Williams (High Performance Technologies), Santiago Pericas-Geertsen (Sun Microsystems), Selim Balcisoy (Nokia), Kimmo Raatikainen (University of Helsinki), John Schneider (Agile Delta), Alex Danilo (CISCO), Don Brutzman (Web3D Consortium) and Mike Cokus (Mitre).
[W3C Workshop on Binary Interchange of XML Information Item Sets] http://www.w3.org/2003/07/binary-xml-cfp.html
[Report on the W3C Workshop] http://www.w3.org/2003/08/binary-interchange-workshop/Report.html
[Fast Web Services] http://developer.java.sun.com/developer/technicalArticles/WebServices/fastWS/
[WBXML] http://www.w3.org/TR/wbxml/
[ASN.1] http://www.itu.int/ITU-T/asn1/
[Millau] http://www9.org/w9cdrom/154/154.html
[Schema Centric XML Canonicalization] http://www.uddi.org/pubs/SchemaCentricCanonicalization-20020710.htm
[XML Information Set] http://www.w3.org/TR/xml-infoset/
[Compressed Vector Graphics] http://www.3gpp.org/ftp/tsg_t/WG2_Capability/SWG3/SWG3_EMS_03_Paris/Docs
[Advanced Technologies Group NDS] http://www.w3.org/2003/08/binary-interchange-workshop/06-NDS-Position-Paper.pdf
[Telia Sonera] http://www.w3.org/2003/08/binary-interchange-workshop/07-TeliaSonera_Position_Paper_07082003.pdf
[University of Helsinki] http://www.w3.org/2003/08/binary-interchange-workshop/08-xebu.pdf
[Dennis Sosnoski] http://www.w3.org/2003/08/binary-interchange-workshop/09-Sosnoski-position-paper.pdf
[High Performance Technologies] http://www.w3.org/2003/08/binary-interchange-workshop/10-w3cbisposition_sdw.html
[Rick Marshall] http://www.w3.org/2003/08/binary-interchange-workshop/13-XML-Binary-Representation.ps
[CCSDS Packaging Working Group] http://www.w3.org/2003/08/binary-interchange-workshop/14-ccds-w3cposition-updated.pdf
[Software AG] http://www.w3.org/2003/08/binary-interchange-workshop/17-softwareAG-BinaryPosition.html
[XimpleWare] http://www.w3.org/2003/08/binary-interchange-workshop/20-ximpleware-positionpaper-updated.htm
[Media Fusion] http://www.w3.org/2003/08/binary-interchange-workshop/21-PositionPaper_MediaFusion.zip
[Systematic Software Engineering] http://www.w3.org/2003/08/binary-interchange-workshop/22-SSE-0001W3CPositionPaper.pdf
[Swiss federal institute of technology, Zurich] http://www.w3.org/2003/08/binary-interchange-workshop/23-wilde-w3c-bxml.pdf
[L3 communication] http://www.w3.org/2003/08/binary-interchange-workshop/23a-L3IS_BinaryXML_Position_11Aug03.pdf
[Ontonet] http://www.w3.org/2003/08/binary-interchange-workshop/24-ontonet-BinaryInfosetPositionPaper.html
[Agile Delta] http://www.w3.org/2003/08/binary-interchange-workshop/30-agiledelta-Efficient-updated.html
[OSS Nokalva] http://www.w3.org/2003/08/binary-interchange-workshop/32-OSS-Nokalva-Position-Paper-updated.pdf
[1] The use of the phrase "binary XML" is not without controversy. Some people will argue, perhaps rightly so, that the terms "binary" and "XML" should not be used together. As the title of the workshop suggests, it is more precise to talk about the exchange of binary information item sets. However, we prefer the use of "binary XML" simply because it is shorter and, as we will discuss shortly, because there are cases in which the complete XML infoset is not represented in the binary format either.
[2] With the exception of fixing some typos, the list of requirements has been copied verbatim from the workshop notes.
[3] Since decisions such as this have other implications, e.g. how the data is bound to objects in a programming language, it is unclear whether they will actually be of use in practice.
![]() ![]() |
Design & Development by deepX Ltd. |