XML 2002 logo

Towards a Generalized XML-Based System for Flexible Formatting of Text and Graphics

Abstract

Fenestra Technologies Corporation is nearing the completion of a three-year project in conjunction with the United States Census Bureau to create software for semi-automated generation of both paper and electronic survey forms for use in the upcoming 2002 Economic Census. Previously, we presented an overview of the project (at XMLEurope 2002); this paper focuses on some of the technical issues regarding the layout and presentation of survey forms that arose during software development. It also discusses the design options we are considering as we embark on the development of the next generation of tools for survey creation and administration.


Table of Contents

1. The problem
2. Implementation issues
2.1. Features of survey form layouts
2.2. Separation of layout and content
2.3. Visual fidelity
2.4. Borders
2.5. XML as a data interchange mechanism
3. Summary
Bibliography
Glossary
Biography

1. The problem

Every five years, the Economic Directorate of the United States Census Bureau, a division of the U.S. Department of Commerce, conducts a wide-ranging census of economic activity that involves over 650 individual survey forms averaging 10-12 pages each (see http://www.census.gov/). These surveys are distributed to over 5 million U.S. businesses. The survey forms measure current economic activity, and so the specific questions vary from one census cycle to the next to reflect the changes in the U.S. economy. Consequently, the forms must be redesigned for each census.

For the 2002 Economic Census, the Census Bureau contracted with Fenestra to design and implement solutions to two separate problems. First, it wanted to streamline the process of survey design and production so that subject matter experts would have essentially real-time feedback as they design and edit their form layouts. Second, it wanted to drive the creation of both paper and electronic (online) surveys from a single, common data repository containing the content and layout information for the survey forms.

The end result of Fenestra's work with the Census Bureau is Generalized Instrument Design System (BLObs), which has been described in more detail previously[GIDS].

2. Implementation issues

Examples of some GIDS-created survey forms are shown in the following figures. Figure 1 shows a portion of a paper survey form that was created "manually" using the GIDS Forms Designer application. Each text and graphical item was positioned by hand. Figure 2 and Figure 3 are portions of paper survey forms which were generated using a rule-based layout engine (the GIDS Autoformatter). Finally, Figure 4 shows an electronic survey form displayed in the GIDS Surveyor application.

click image for full size view

Figure 1. A "manually-formatted" section of a paper survey form

click image for full size view

Figure 2. A simple auto-formatted section of a paper survey form

click image for full size view

Figure 3. A more complex, hierarchical auto-formatted section of a paper survey form

click image for full size view

Figure 4. An electronic form displayed in the Surveyor application

When we first began work on GIDS approximately three years ago, we made the decision to use eXtensible Markup Language (BLObs) as the format for information storage and interchange with other systems. Our decision was based on several factors, such as the human-readable nature of XML and the likelihood that XML would be supported by the various other systems with which GIDS would need to interact.

At the same time, we had also planned to use eXtensible Stylesheet Language Formatting Objects (BLObs) as the basis for layout and rendering of survey forms. However, it became apparent early on that the XSL-FO specification would be neither complete enough nor stable enough by the time we needed to use it. For that reason, we developed our own formatting language, Survey Formatting Objects (BLObs), which we based loosely on the existing XSL-FO specification. A key requirement of the Economic Census is precise control over all aspects of the visual appearance of the forms: typography, position of elements, colors, etc. SFO was designed with this requirement in mind, and many of the design decisions derive directly from it.

During the last several months, as GIDS has been placed into service, our experience with GIDS and SFO has given us some insight into what we did right, and what we still need to work on. In particular, we have identified five areas in which the design of SFO has had an impact on performance, usability, or some other aspect of the operation of GIDS. Each of these characteristics of SFO has a counterpart in XSL-FO, and so what we learned may also be useful to users of XSL-FO.

2.1. Features of survey form layouts

Looking at the examples in the preceding figures, it is clear that survey forms, paper forms in particular, have a "two-dimensional" layout. That is, rather than a more conventional document, which consists of a well-defined sequence of character glyphs flowing in one direction, grouped into lines flowing in a perpendicular direction, survey forms tend to consist of a number of "cells," within which the standard flow rules apply, but whose placement and ordering relative to one another can be almost arbitrary. It is for this reason that we chose to deviate from the XSL-FO formatting model and substitute the SFOstacking model. At the top level, an SFO document consists of a number of rectangular regions, which correspond roughly with the regions in an XSL-FO document. Unlike XSL-FO, however, an SFO page may contain any number of absolutely positioned regions, which may be placed arbitrarily on the page (even overlapping). Within an SFO region, one or more rectangular areas, occupying all or part of the space within the region, are stacked against one of the four sides of the region. Furthermore, areas may be stacked within other areas, in a hierarchical relationship.

Figure 5 shows how stacking works. On the left side of the figure, four areas have been added to a region. The first area has been stacked to the left, so that it contacts the left boundary of the region. Its width has been set to 50%, indicating that it should occupy 50% of the remaining space. (Since this is the first area added to the region, the "remaining space" is of course all of the space in the region.) The second area has been stacked to the top, and thus it contacts the top of the remaining space within the region. Its height is set to an explicit value of 1.2 inches. The third and fourth areas are stacked similarly.

click image for full size view

Figure 5. The SFO stacking model

The right side of the figure shows an additional four areas; these are stacked inside the first area that was added to the region. Note especially how area widths or heights expressed as percentages refer to a percentage of the remaining space, not a percentage of the parent area or region as a whole. Continuing in this way, a region may be subdivided into rectangular areas in an almost completely arbitrary manner. Each area has its own attributes (margins, borders, paddings, background color, etc.) and its own content (text or graphics).

In our experience with SFO and GIDS, we have found the stacking model to be adequate, but frequently unwieldy. In complex layouts, stacked areas can become very deeply nested, which affects both the readability of the XML document and rendering performance. As SFO evolved, and its capabilities were enhanced in step with the demands placed on it, it became increasingly clear that we were moving towards a general-purpose text and graphics layout environment, much like that of a commercial off-the-shelf application such as Adobe® Illustrator® or CorelDRAW®.

Naturally, this led to the question: Why not use such an application for the generation of survey forms? In fact, previous generations of Census survey forms had been created using similar products, but Census's experience in that regard had been mixed, at best. In general, the applications were too complicated to use except by a relatively small number of trained experts—they were, in a sense, too flexible. (In contrast, GIDS is in daily use by subject matter experts, people whose expertise is in deciding what questions to ask, not in graphical layout.) Furthermore, these applications did not have the ability to communicate with a content database to retrieve text strings, response-area identifiers and other metadata, etc.

Thus, we find ourselves faced with a dilemma when considering how to improve SFO for the next generation of tools: On the one hand, we need an expressive layout language which allows us to specify virtually any conceivable set of marks to be placed on a page. On the other hand, however, we need to provide our users with a more limited palette of tools, or they may become overwhelmed.

The original incarnation of SFO erred perhaps too far on the side of caution with regards to flexibility. While most ordinary layouts are relatively straightforward to generate using SFO, those having even a modest amount of complexity can become quite intricate, as one faces the various limitations that SFO (purposefully) imposes. (XSL-FO is even more rigid in this respect.) The next generation of SFO will likely contain a fully general layout engine, probably using Scalable Vector Graphics (BLObs) as the basis of the document language, while the next generation of GIDS tools will impose various layout restrictions at the user interface level in order to keep the user interaction manageable.

2.2. Separation of layout and content

Like XSL-FO, SFO separates layout from content. An SFO document consists of a layout section, containing page masters which define the locations of flow elements on the pages of the resulting rendered layout, followed by a flows section which define those flow elements (text, graphics, etc.).

This principle of separation of layout and content works very well for a conventional document, such as a book. In such a document, a very small number of page masters suffice to specify the layouts of a much larger number of pages. In a survey form, however, each page is unique, and typically, each page of a survey has its own page master. Therefore, in a survey form, the separation of layout and content becomes essentially superfluous.

Another limitation imposed by the separation of layout and content arises out of the serial nature of XML parsing. In a 20-page survey, for example, the XML parser must parse 20 page master definitions before the rendering engine can render even the first page of the document. While not an issue for paper forms, which are rendered only once and then printed, this does become a serious performance bottleneck with electronic surveys, as the user must wait, often several seconds, for the XML parser to fully digest the SFO document before displaying the form.

For these reasons, the next generation of SFO will not separate layout from content, at least not in the SFO document itself. Certainly, at earlier stages of form creation, layout data will most likely be maintained separately from content data (this simplifies the process of translating the forms into different languages, for example) but in the final SFO product, the performance advantages of keeping the layout and content information together will outweigh the disadvantages.

2.3. Visual fidelity

One very important principle of SFO is that of visual fidelity. That is, an SFO document rendered on one computer system should be rendered identically (within the resolution limits of the display or print device) as the same document rendered on another system. This principle is crucially important to the Census Bureau, whose experience in the art of surveys has taught them that even seemingly minor variations in the presentation of a question on a survey form can introduce significant biases in the responses.

Some of SFO's visual fidelity behaviors are explicit, such as its precise specification of element positioning information. Others, such as its line-breaking algorithm (based on the algorithm used in Donald Knuth's TeX typesetting system) are implicit. In order to ensure true visual fidelity, in multiple software applications written in a variety of languages and run under a variety of operating systems, it will be necessary to make all of the layout behavior of the next generation of SFO fully explicit.

During the XSL-FO standardization effort, it was made clear that visual fidelity was not a goal of XSL-FO. For this reason alone, XSL-FO, in its present form, cannot support the requirements that drove the design of SFO.

2.4. Borders

Survey forms often contain visible borders between cells; these are evident in Figure 1 and Figure 3 above. The problem of drawing borders is a thorny one, because borders are sometimes considered to exist between cells, sometimes within cells, and sometimes a combination of both. For example, in the simplified layout shown in Figure 6, it is not possible to consider the borders as being entirely between cells, whereas considering them to be entirely within cells raises ambiguities concerning which cell "owns" a common border.

click image for full size view

Figure 6. Cell layout including borders

We believe that the best way to deal with this problem is to separate the concepts of inter-cell and intra-cell borders in tabular layouts, and to specify them separately. A borders model such as that offered by Cascading Style Sheets, version 2 (BLObs) (also used in XSL-FO and SVG) would be sufficient to specify the inter-cell borders, but the CSS2 model would need to be extended to support intra-cell borders; the layout shown in Figure 6 cannot be produced at all using the CSS2 borders model.

The current version of SFO is limited to intra-cell borders, and while every conceivable border layout is achievable with this model, the aforementioned ambiguities did lead to considerable confusion early on, with all possible permutations of undesirable situations (doubled-up borders, missing borders, misaligned borders) occurring. The addition of a second level of borders should alleviate some of the technical problems, but will not be intuitive to users, and therefore great care will need to be taken to ensure that the user interface for specifying borders is not confusing.

2.5. XML as a data interchange mechanism

As mentioned above, we chose XML for its readability as well as the increased probability that we would be able to easily share data with other systems within the Census Bureau. Our success in achieving readability turned out to be relatively low, unfortunately. The heavily nested structure of SFO documents combined with the separation of layout and content makes it difficult to follow the "flow" of a document.

We had more success in the area of data interchange, although we also experienced some problems there as well. One persistent problem arose out of what might be thought of as an "impedance mismatch" between the nature of XML schemas vs. database schemas. If the requirements of a data processing system change, an XML document's Document Type Declaration (BLObs) is easily modified to comply with the new requirements. Modifying a relational database schema, on the other hand, is far more involved. So much so that in many cases, we found that database managers were wholly unwilling to make a modification to their databases in order to accommodate changes in the structure of our XML documents, to the point that we frequently had to encode our XML data as Binary Large Objects (BLObs) and embed them as base64-encoded attributes or element content within the "approved" XML format. (This process further reduces readability, of course.) Our experience was perhaps atypical, but it does serve as a caution to others who may find themselves involved in similar XML-database interactions.

XML documents have a hierarchical structure, which is encoded as a linear sequence of text characters. This leads to the problem mentioned earlier, in which locating page N in a document requires reading through and parsing pages 1 through N – 1; there is simply no way to tell an XML parser to "jump to page 13." There are a couple of ways to address this limitation. One is to split the file into a number of subfiles (one file per page, perhaps), but this leads to a proliferation of files, which is often undesirable. Another technique is to accompany the XML file with an index file, which a "smart" parser could use to jump to a location within the file and extract an XML fragment. This index file could even be included as a preamble within the XML document (in the form of one or more processing instructions, perhaps). In any case, the need to quickly access different portions of an XML document would appear to be prevalent, and some standardized random-access model would be welcomed.

3. Summary

We have presented some of our observations regarding the limitations of both our own SFO and standard XSL-FO that we encountered during development of a system for the creation of paper and electronic survey forms. We plan to use this information while we develop the next generation of form-creation tools, and we also hope that our experiences can be of use to others who are embarking on similar paths.

Bibliography

[GIDS] Schafer, Steven A, 2002. Digitizing the U.S. Census Bureau, XMLEurope 2002 Proc. (http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/04-04-03/04-04-03.html)

Glossary

BLObs

Binary Large Objects

CSS2

Cascading Style Sheets, version 2

DTD

Document Type Declaration

GIDS

Generalized Instrument Design System

SFO

Survey Formatting Objects

SVG

Scalable Vector Graphics

XML

eXtensible Markup Language

XSL-FO

eXtensible Stylesheet Language Formatting Objects

Biography

Steve Schafer is Chief Technology Officer for Fenestra Technologies Corporation, a software research and development company located in the Washington DC metropolitan area.