Abstract
Binary formatted message standards have been designed by the military for use in bandwidth restricted environments such as between airborne platforms and ground based terminals. These standards are typically designed at the binary encoding level rather than through use of abstract modeling languages. At the same time DoD is widely adopting the use of XML for information exchange. As military messaging standards migrate to XML, concern exists in using XML for binary formatted messages. Such messages are orders of magnitude larger, and consequently the increase in bandwidth required to send those messages could exceed the capacity of current communication resources.
Some who are knowledgeable in XML have countered that use of compression technologies, and/or the use of binary XML message encoders and protocols could mitigate these risks. We present the findings of a study initiated to better understand the increase in bandwidth in using XML for dissemination of this binary originated information, and to determine what techniques might alleviate this concern. The goal of the study is to provide concrete data showing equivalent messages represented in the native binary format versus compressed XML representations. The study applies compression techniques to sample messages encoded in the native binary and in XML to determine their resulting sizes. The study also addresses the tradeoffs between abstract message design and application of physical encoding versus design at the physical level.
Table of Contents
The military makes extensive use of binary formatted messages for efficient wireless communications and processing by automated systems. These message standards are typically crafted to express field content and structure using the fewest bits possible. Design approaches vary but normally the binary physical encoding of the structure and fields are captured by rules unique to each specification.
Currently, programs are being directed to implement information exchanges using XML. While exemptions for wireless communications will be made, questions have been raised about whether XML can serve as the lingua franca for all information exchanges. Use of wireless XML poses problems due to the explosion in the number of bytes required to represent the same data and the limits of radio spectrum availability. Additionally, processing of XML is relatively more expensive in terms of CPU and memory which may be a limiting factor for embedded processors.
The XML community generally regards loss of format efficiency as necessary to achieve the benefits they espouse. List discussions on the topic divide into two camps, the XML purists who wish to defend attacks on the technology and outsiders who are grappling with the disruptions that this technology engenders. Purist argue that XML can be compressed if space/bandwidth is an issue while outsiders argue that alternate encodings are needed to provide efficient data exchange and consumption by applications which are concerned only with the data content, not the presentation.
This study investigates the costs of converting one binary message format into XML and the application of several XML compression and encoding methods to reduce the resulting XML message sizes. Our goal is to run concrete tests in order to produce comparative metrics for analysis and draw conclusions as warranted.
The compression methods investigated in the study fall into three categories: redundancy-based, schema-based and hybrid. Redundancy-based compression methods take advantage of the repetition of character groups within the source document. Schema-based methods leverage design knowledge of the structure and content of the source document. Hybrid compression methods apply a combination of redundancy-based and schema-based methods to the same source document.
Redundancy-based compression methods are mature and have been in use for some time. They work by removing redundancy at the character level. They assume no knowledge of the structure or content of the document, but base replacement of redundant character groups on their statistical occurrence in the source document (e.g., Huffman encoding). The replacement of redundant character groups is supported by a code dictionary that matches the redundant characters with reference codes (e.g., Lempel-Ziv[LZ77]). In general, redundancy-based compression methods achieve good compression results on large text files but become less effective as the file size decreases.
The redundancy-based compression methods used for the study are briefly described below.
This is the popular, general text compression tool found on computer desktops around the world. WinZip utilizes mature and efficient redundancy compression algorithms (Lempel-Ziv [LZ77] and Huffman algorithms). More information on WinZip can found at http://www.winzip.com .
Schema-based compression methods for XML have been emerging over the past several years. These methods work by encoding the file based on design knowledge about the permitted content and structure of the source document. In schema based compression, the full XML tags are not represented in the compressed form and can be reconstructed using the design knowledge. The design knowledge is also leveraged to efficiently encode the data contents of the source file. In fact, we expect schema-based compression to be optimal, in terms of size, for encoding data contents. Additionally, Schema-based compression methods allow the XML tags and content to be recovered from the encoded format without decoding the entire file. Thus an advantage of schema-based methods is that they allow decoded values to be accessed by applications without producing the original XML document.
The schema-based compression methods used for the study are briefly described below.
The Moving Picture Experts Group binary encoding format is an ISO/IEC standard for associating meta-data with audio/visual content. The BiM codec is dynamically generated based on the content metadata. XML Schema is the primary option for describing the meta-data content. More information on MPEG/BiM can be found at http://www.expway.tv/bim/bim.html.
Abstract Syntax Notation One is a language for defining tree-structured information without regard to implementation, and has been used in the telecommunications industry for some time. XML support for ASN.1 was added in 2002 and support for XML Schema is at the proposal stage. ANS.1 experimentation for the study was conducted by OSS Nokalva (http://www.oss.comhttp://www.itu.int/ITU-T/asn1/ .
Hybrid compression methods apply both redundancy-based and schema-based compression on the same source document. Specifically, redundancy-based compression is applied after applying schema-based compression to further reduce the size of the compressed format.
The hybrid compression methods used for the study are briefly described below.
XMill, an open source research prototype developed by University of Pennsylvania and AT&T Labs, builds on the gzip library to provide an XML-specific compressor. The path encoding options support selective application of type-specific (i.e., based on design knowledge of the source document) encoders to data content. The data and content of the source document are compressed separately using redundancy compression. More information on XMill can be found at http://www.research.att.com/sw/tools/xmill .
Since XMill is the only native hybrid application selected for this study, the experiment included tests of each schema based method followed by redundancy compression to provide a basis for comparison.
This hybrid method involves applying two previously described methods in succession. First, MPEG-7/BiM is applied to the source document, and then WinZip is applied to the result.
Two sets of test data were used in the study. Since appropriate binary samples were initially unavailable, the study began by focusing on approximate XML representations of binary data. This first set of test data (Test Set A) consisted of two XML samples. The second set of test data (Test Set B) consisted of four binary samples with four corresponding XML representations. The binary samples ranged in size from 674,062 to 764 bytes. The XML samples ranged in size from 11,421,822 to 6,010 bytes. The sizes and study designations of the samples are given in tables 1 and 2, below. The number of messages in each sample are also identified. Because of the variability in size of a single message (approximately 200 to 40,000 bytes) and the difficulty in identifying a single representative message from the binary data provided, we chose to substitute an aggregate of 8 small messages of medium complexity as a surrogate.
| Designation | XML File Size (bytes) | Number of Messages |
| Full | 9,878,662 | 6853 |
| Short | 6,010 | 8 |
Table 1. Designations and File Sizes for Test Set A
| Designation | Binary File Size (bytes) | XML File Size (bytes) | XML to Binary Ratio (%) | Number of Messages |
| Full | 674,062 | 11,421,822 | 1694% | 6853 |
| Subset 1 | 50,703 | 855,297 | 1687% | 500 |
| Subset 2 | 19,990 | 339,629 | 1699% | 200 |
| Subset 3 | 764 | 12,209 | 1598% | 8 |
Table 2. Designations and File Sizes for Test Set B
Table 2, above, also gives the ratio of the binary file size and the size of the corresponding XML file. Representing the data in XML causes a sixteen to seventeen-fold increase in file size. Note that the “cost” of XML in terms of size increase depends on a number of factors. One of these is the verboseness of the tags used to mark up the data. Because of the sensitive nature of the study data, the element names used in the sample XML data cannot be discussed in this paper. It can be noted, however, that the tag names used were unabbreviated, descriptive terms.
As mentioned above, the precise structure and content of the samples cannot be presented here. However, the general structure and data types of the XML documents used for the study can be discussed. These are illustrated in Figure 1, below. Although the study data is not available to the reader, this depiction should indicate that the XML sample structure and content is sufficiently rich for the study purposes.
The compression methods were applied to the test data to compare their performance on the XML samples. Performance is measured as the ratio of the compressed XML file size to the size of the original XML or available binary file. This metric is expressed as a percentage. For test set B, the XML compression results were also compared to the performance of WinZip on the binary files.
In the figures below, the following designations are used to signify the compression methods:
“WinZip (MAX)” refers to WinZip using the maximum compression option
“MPEG-7” refers to MPEG-7/BiM
“MPEG-7 WinZip MAX” refers to WinZiP (with maximum compression) applied to the results of MPEG-7/BiM
“WBXML-like” refers to the application developed for the study to perform WBXML compliant tag tokenization
“XMill Path” refers to XMill using path encoding options
ASN.1 PER (al) refers to ASN.1 using aligned Packed Encoding Rules
ASN.1 PER(al) WinZip (MAX) refers to WinZiP (with maximum compression) applied to the results of ASN.1 (using aligned Packed Encoding Rules)
WinZip Binary refers to WinZip using the maximum compression option applied to the original binary file
Figure 2, below depicts the relative performance of the compression methods on the XML samples in test set A. Since this test set contains XML files only, the performance of the compression methods is measured as the ratio of the compressed file size to the original XML file size. This ratio is reported as a percentage.
WinZip’s pure redundancy based method achieved better results on the larger sample (Full) than on the smaller sample (Short). Conversely, pure schema based methods (MPEG-7 and ASN.1 PER) achieved better results on the smaller sample than on the larger sample.
The hybrid methods (MPEG-7 WinZip (MAX), XMill Path, and ASN.1 PER(al) WinZip (MAX)), without exception, performed better on the larger sample than on the smaller sample. Also, it is interesting to compare the results of the pure schema methods (MPEG-7 and ASN.1 PER) with the hybrid methods constructed from these. For the larger sample, the hybrids performed better than the pure schema based methods. For the smaller sample, adding a stage of redundancy compression (via WinZip) caused no significant improvement or actually degraded performance.
The best performer for the larger sample was XMill Path, which is a hybrid method. This is consistent with the overall results achieved in that it reinforces general speculations that hybrid methods perform better on larger files, and can also provide enhanced performance on large files over pure redundancy based methods. The best performer for the smaller sample was ASN.1 PER, which is a pure schema based method.
Test set B consists of binary samples and their corresponding XML representations. In addition to comparing the relative performance of XML compressors, we also included metrics for WinZip compression of the original binary. Although optimized for general text compression, WinZip achieved good results on the binary samples. This suggests some amount of redundancy in the binary data and format. The results of applying WinZip to the binary samples are given in Figure 3, below.
The performance of the XML compression methods is measured as the ratio of the size of the compressed XML file to both the size of the corresponding binary file and the original XML file size. These ratios are reported as percentages in figures 4 and 5, below. Figure 4 presents the performance with respect to the size of the corresponding binary samples. The performance of WinZip applied to the binary samples is also depicted for comparison purposes in figure 4. Figure 5 presents the performance ratios with respect to the size of the original XML files. It should be noted that due to limited resources we were unable to apply ASN.1 encoding in the results for Test Set B.
Figure 4 provides insight into the potential impact of transmitting compressed XML in place of the binary format. For all but the smallest sample (Subset 3), three methods compressed the XML to less than the size of the corresponding binary file. For two of the samples (Full and Subset 1), XML compression (XMill Path) actually out-performed WinZip compression of the corresponding binary file itself.
The performance of XML compression on the smallest sample (Subset 3) indicates that for smaller files, transmitting XML will involve larger bandwidth overhead. However, even for a binary file this small in size (764 bytes), one of the XML compression methods (MPEG-7 WinZip Max) constrained the “cost” of XML to an increase of only 40 percent over the original binary. It is also important to take into account that bandwidth overhead could be reduced by 50% of the original simply by applying a commonly available tool (WinZip) to the binary.
XMill Path was the overall best performer for compressing the XML samples in Test Set B.
The hybrid methods (MPEG-7 WinZip MAX and XMill Path), without exception, performed better on the larger samples (Full, Subset 1 and Subset 2) than on the smallest (Subset 3). The best performer for the larger samples was XMill Path, which is a hybrid method. The hybrid MPEG-7/redundancy method performed best on the smallest sample.
With respect to the size of the original XML, the compression methods performed well, overall. It is interesting to note that the performance of all the methods was consistent for the three largest samples. In addition, the performance of two methods (MPEG-7 and WBXML-like) was consistent for all of the samples. The generalization that WinZip performed better for larger files can also be made.
As mentioned earlier, tests using ASN.1 were unfortunately not conducted. We observe that if application of ASN.1 aligned Packed Encoding Rules to Test Set B were found to be consistent with the performance observed for Test Set A, then the results for Subset 3 would likely have equaled or exceeded all other methods and perhaps rivaled the size of the original binary encoding. However, no conclusion can be drawn from this educated guess and it remains subject to verification.
First, some caveats should be discussed. The tests and analysis were conducted on a single document type consisting of well formatted data types nested to a maximum of four levels. Use of unformatted strings was not included. The sample binary messages, while representative, adhere to a standard that is currently in final draft form. Additionally, the tests conducted were exclusively to compare size optimization, other optimization factors were not rigorously examined and have not been reported. ASN.1 tests were conducted in collaboration with OSS Nokalva and while the results are not in question they have not been independently verified. The WBXML-like approach is consistent with the WAP specification but was modified to support XML schemas rather than DTDs.
From the test results and analysis the following conclusions are drawn. First the XML representations are highly redundant, but even the original binary formatted messages exhibit significant degrees of redundancy. As expected, the degree of redundancy increases with the size of the message set.
In comparing redundancy and schema based schemes, we observe that there is a point at which redundancy based compression overtakes the efficiency of schema based encoding techniques. Further, this crossover point corresponds to relatively small file sizes. For the sample sets used in this study, it appears that the point is somewhat larger than 12Kb and almost certainly smaller than 100Kb.
Hybrid schemes generally produce better results in comparison to WinZip's redundancy based compression and exceed WinZip's efficiency for the largest sample size. XMill was the best hybrid approach for all but the smallest sample in which case the hybrid combination of a schema based scheme followed by WinZip compression yielded superior results.
Analysis indicates if optimal size compression is the primary goal, multiple techniques will be required. No single technique was optimal for all sample sizes, though XMill's hybrid technique proved optimal for all but the smallest sample.
The results show that XML compressed/encoded messages are competitive in size to the original binary formats. Indeed for all but the smallest sample size optimized XML proved to be smaller than the original binary formatted message set.
Finally, we end with some conjectures to be explored in a follow up study. First, the operating environment will constrain the choice of techniques and optimization factors that predominate. Constraints on memory, CPU, compression or decompression time, accessibility of the compressed data, and streamed decoding are all secondary optimization factors. Further, study will be needed to provide a basis for selection by our customers.
Second, we believe that separating message design from physical encoding requirements need not sacrifice efficiency. Binary message specifications which detail the physical encoding of messages to address limited bandwidth issues may be better served by a separation of concerns that include a combination of logical design and a secondary encoding/compression specification. Adoption of XML Schema as a logical design technique may have merit in standardizing the many approaches taken in the design of binary message specifications that currently lack a common methodology or formalism. While XML serves as a default physical encoding for the logical design, alternate encoding/compression algorithms can be designated to address efficiency requirements and constraints of the operating environment. A set of standard algorithms, which leverage the use a single logical design language, could find wide spread utility as a replacement for current practice. This separation of concerns can promote common standards and efficient message exchange for the military.
This study would not have been possible without the cooperation of Claude Seyrat (Expway) and Bancroft Scott (OSS Nokalva). Claude Seyrat provided access to the MPEG-7/BiM reference software and answered numerous questions concerning the standard and the approach taken to schema based encoding. Bancroft Scott graciously offered support in exercising ASN.1's encoding and emerging XML schema translation capabilities. The time spent in running the ASN.1 tests and interpreting the results contributed greatly to our understanding of ASN.1 and the potential of schema based encoding techniques.
![]() ![]() |
Design & Development by deepX Ltd. 2002 |