XML Europe 2002 logo

Digitizing the U.S. Census Bureau

Abstract

Fenestra Technologies Corporation is nearing the completion of a three-year project in conjunction with the United States Census Bureau to create software for semi-automated generation of both paper and electronic survey forms for use in the upcoming 2002 Economic Census. This paper presents an overview of the project, and provides some details regarding the problems that needed to be solved and the implementation of their solutions.


Table of Contents

1. The problem
2. The implementation
2.1. XML and SFO
2.2. The GIDS system
2.2.1. Forms Designer
2.2.2. Autoformatter
2.2.3. Previewer
2.2.4. Publisher
2.2.5. Behavior Editor
2.2.6. Surveyor
3. Further applications
4. Lessons learned
4.1. XML
4.2. XSL-FO and SFO
4.3. Electronic vs. paper layouts
5. Summary
Glossary
Biography

1. The problem

Every five years, the Economic Directorate of the United States Census Bureau, a division of the U.S. Department of Commerce, conducts a wide-ranging census of economic activity that involves over 650 individual survey forms averaging 10–12 pages each (see http://www.census.gov/epcd/www/econ2002.html). These surveys are distributed to over 5 million U.S. businesses. The survey forms measure current economic activity, and so the specific questions vary from one census cycle to the next to reflect the changes in the U.S. economy. Consequently, the forms must be redesigned for each census.

Creating and managing these survey forms is a significant undertaking, involving the combined efforts of hundreds of people over a two- to three-year timeframe. Historically, the survey forms have been created individually by hand. Obviously, this process is labor-intensive, error-prone, and difficult to translate to the electronic world of online surveys. The subject matter experts who create the survey questions and design the forms mark up paper drafts of the form layouts and send them to a separate department where graphic artists use conventional graphics drawing software to compose the forms. The turnaround time for a single iteration of the layout/edit cycle is anywhere from several days to several weeks.

For the 2002 Economic Census, the Census Bureau contracted with Fenestra to design and implement solutions to two separate problems. First, it wanted to streamline the process of survey design and production so that subject matter experts would have essentially real-time feedback as they design and edit their form layouts. Second, it wanted to drive the creation of both paper and electronic (online) surveys from a single, common data repository containing the content and layout information for the survey forms.

The end result of Fenestra’s work with the Census Bureau is Generalized Instrument Design System (GIDS). GIDS consists of several modules:

  • Forms Designer – The Forms Designer is used by the subject matter experts to lay out “custom-formatted” sections of forms; these are sections that are not amenable to fully automated layout.

  • Autoformatter – The Autoformatter automatically lays out regular, repeating sections of forms, based on a set of layout rules and templates.

  • Previewer – The Previewer provides visual feedback to the designers so that they may inspect forms in their entirety, with custom-formatted and autoformatted sections combined along with boilerplate into the final forms.

  • Publisher – The Publisher takes form layouts destined for paper output and produces PostScript® or Adobe®Portable Document Format (PDF) files for printing, either with the Census Bureau’s in-house printing facilities (for limited-run forms), or by commercial printers (for large-quantity production).

  • Behavior Editor – Electronic forms have behavior (auto-calculated response fields, navigation, data validation) in addition to visual layout. The Behavior Editor is the means by which the subject matter experts attach behaviors to the various items on a form.

  • Surveyor – The Surveyor is the electronic equivalent of the Publisher—it presents an electronic form to a respondent, collects the respondent’s response data, and securely transmits those data back to the Census Bureau.

These six components and their interactions are described in more detail below.

2. The implementation

2.1. XML and SFO

When we first began work on GIDS approximately two years ago, we made the decision to use XML (http://www.w3.org/XML/) as the format for information storage and interchange with other systems. Our decision was based on several factors, such as the human-readable nature of XML and the likelihood that XML would be supported by the various other systems with which GIDS would need to interact.

At the same time, we had also planned to use eXtensible Stylesheet Language Formatting Objects (XSL-FO) (http://www.w3.org/TR/xsl/) as the basis for layout and rendering of survey forms. However, it became apparent early on that the XSL-FO specification would be neither complete enough nor stable enough by the time we needed to use it. For that reason, we developed our own formatting language, SFO, which we based loosely on the existing XSL-FO specification. A key requirement of the Economic Census is precise control over all aspects of the visual appearance of the forms: typography, position of elements, colors, etc. SFO was designed with this requirement in mind, and many of the design decisions derive directly from it.

Like XSL-FO, an SFO document consists of a “layout” section, which describes where the various entities are to be placed on the page, and a “flows” section, containing the actual content. A small portion of the beginning of an SFO document, displaying part of the layout section, is shown in Figure 1. Like XSL-FO, SFO divides a page into rectangular regions. Unlike XSL-FO, however, an SFO page may contain any number of absolutely positioned regions, which may be positioned arbitrarily on the page (even overlapping).

Figure 1.

click image for full size view

A portion of an SFO document, showing the layout section.

Where SFO diverges significantly from XSL-FO is in its stacking model. Within an SFO region, one or more rectangular areas, occupying all or part of the space within the region, are stacked against one of the four sides of the region. Furthermore, areas may be stacked within other areas, in a hierarchical relationship. This area hierarchy is reflected in the hierarchy of the <sfo:region> and <sfo:area> elements, as seen in the figure. (Incidentally, the unit of measure used in the document displayed in Figure 1, abbreviated as fu, is the Fenestra unit. The Fenestra unit is the fundamental unit of measure used in SFO and GIDS; there are exactly 7,342,632 Fenestra units in one inch.)

Figure 2 shows how stacking works. On the left side of the figure, four areas have been added to a region. The first area has been stacked to the left, so that it contacts the left boundary of the region. Its width has been set to 50%, indicating that it should occupy 50% of the remaining space. (Since this is the first area added to the region, the “remaining space” is of course all of the space in the region.) The second area has been stacked to the top, and thus it contacts the top of the remaining space within the. Its height is set to an explicit value of 1.2 inches. The third and fourth areas are stacked similarly.

Figure 2.

click image for full size view

The SFO stacking model.

The right side of the figure shows an additional four areas; these are stacked inside the first area that was added to the region. Note especially how area widths or heights expressed as percentages refer to a percentage of the remaining space, not a percentage of the parent area or region as a whole. Continuing in this way, a region may be subdivided into rectangular areas in an almost completely arbitrary manner. Each area has its own attributes (margins, borders, paddings, background color, etc.) and its own content (text or graphics).

Figure 3 shows a portion of the “flows” section of the same SFO document as in Figure 1. SFO flows are much like XSL-FO flows, containing block and inline text and graphic elements. Each flow is associated with its corresponding area via the content-name attribute.

Figure 3.

click image for full size view

A portion of the flows section of a typical SFO document.

The next three figures show examples of rendered output from the SFO processor; all three examples are of output destined for paper, rather than online, forms. Figure 5 shows a portion of a simple autoformatted (i.e., rule-based) layout. Figure 5 shows a more complex hierarchical autoformatted layout. This example also illustrates the auto-numbering and item cross-referencing features built into the GIDS software. The various “Continued” headers are also generated automatically, as required, at page breaks. Figure 6 shows a portion of a custom-formatted section, one that was laid out “by hand” using the Forms Designer. This example also shows some of the non-text capabilities of SFO, such as embedded graphics (the barcode at the left edge of the form is generated programmatically at form assembly time, and embedded as a bitmap image). On a page-count basis, approximately 80% of the Economic Census survey forms consist of autoformatted sections; the remainder are custom-formatted.

Figure 4.

click image for full size view

A portion of an autoformatted section of a survey form.

Figure 5.

click image for full size view

A portion of a hierarchical autoformatted section of a survey form.

Figure 6.

click image for full size view

A portion of a custom-formatted section of a survey form.

SFO-format form files may also be post-processed to extract a variety of layout and metadata information. For example, each response area on a form (e.g., a checkbox) is associated with a named data element. A post-processor can read an SFO file and extract the names of the data elements on a survey form, along with the physical locations and dimensions of the corresponding response areas. This information may then be used with an Optical Character Recognition (OCR) or Key From Image (KFI) system to extract response data from completed forms. This ability to automatically recover the location information for all of the response areas on a form represents a significant savings in the traditionally time-consuming and labor-intensive job of determining response area coordinates for OCR or KFI systems.

For the Economic Census 2002 project, a KFI system using information obtained from an SFO post-processor will be employed to retrieve data from completed forms and enter them into Census databases: A scanned image of a completed form will be presented to a data entry clerk on a computer screen. Each response area will be highlighted in turn (using the coordinate information from the SFO file), and the clerk will key in the corresponding response data.

2.2. The GIDS system

A block diagram of the overall GIDS system is shown in Figure 7. Outside of the boundary of GIDS is the data/metadata Repository. This Repository is implemented as an Oracle database, and is beyond the control of Fenestra. All communication between the Repository and GIDS is via XML.

Figure 7.

click image for full size view

A block diagram of GIDS.

All of the GIDS components are implemented as 32-bit Microsoft® Windows™ applications, and were written in Object Pascal using Borland® Delphi™.

2.2.1. Forms Designer

The Forms Designer is used to manually lay out custom-formatted sections. The Forms Designer accepts content data (text strings, data element names, etc.) from the Repository, which the user then places as desired on a layout “canvas.” A screen shot of the Forms Designer is shown in Figure 8. The upper window displays the current layout, while the lower window displays a hierarchical list of content elements associated with the form section being designed. The user can drag and drop elements from the lower window to the upper window, and from one location to another in the upper window.

Note that the Forms Designer uses a tabular, grid-based placement metaphor, rather than the stacking model used internally by SFO. Early versions of the Forms Designer used the stacking model directly, but this proved to be non-intuitive to the users. The tabular layout is converted automatically to the stacking model when the layout is saved back to the Repository.

To ensure that the layouts displayed in the Forms Designer will be accurately reproduced in the final forms, the same SFO rendering engine is used in the Forms Designer as in the Publisher and Surveyor.

Figure 8.

click image for full size view

The Forms Designer.

2.2.2. Autoformatter

Like the Forms Designer, the Autoformatter also receives content data from the Repository and creates layout data that are subsequently returned to the Repository. The Autoformatter operates as a Windows NT™ service, running in the background in a “push” mode. Content files are transmitted from the Repository to the Autoformatter server (via a user interface application attached to the Repository); the Autoformatter then processes them and transmits the formatted layouts back to the Repository.

In addition to layout rules that are built into the Autoformatter code modules, the Autoformatter also relies on a number of XML templates, which it uses while constructing layouts. These templates are essentially SFO document fragments containing replaceable tokens representing placeholders for text strings, dimensions, etc. Most layout modifications that affect the visual appearance of autoformatted sections can be made simply by modifying the appropriate templates; it is only necessary to modify the actual Autoformatter code when rules having to do with pagination, etc. need to be changed.

2.2.3. Previewer

Once all of the autoformatted and custom-formatted sections of a survey form have been completed, the final result may be viewed in the Previewer. A screen shot of the Previewer is shown in Figure 9.

Figure 9.

click image for full size view

The Previewer.

The Previewer is a passive application; it allows the viewing and printing, but not editing, of a completed form. It provides a final “sanity check” to verify that a form includes all of the proper elements, doesn’t contain any awkward page breaks, etc.

2.2.4. Publisher

The Publisher produces the final production quality PostScript® or PDF files used to print the paper forms. The Publisher has a simple user interface (not shown); it allows the user to select among various printing options, such as output device type, resolution, etc.

2.2.5. Behavior Editor

The Behavior Editor is used to attach behaviors to electronic forms. These behaviors fall into three broad categories:

  • Navigational behaviors – These consist of items such as navigational buttons that transport the respondent from one page of the form to another, hyperlinks, etc.

  • Dependencies – Some response data items in a form are derived fields. For example, a form may contain a data element that represents the sum of the preceding four data elements. These kinds of dependency behaviors attached to the corresponding source and dependent data elements ensure that the values remain synchronized. (It could be suggested that dependent data elements are redundant and need not be displayed. However, Census Bureau rules allow a respondent to supply, for instance, only the sum value and not the component values; in this case, the dependency calculation would be overridden by the response.)

  • Validation – Some data element relationships may be expressed as limits. For example, if a response specifies that a company has 16 employees, but the annual payroll is reported as only $50,000, it is very likely that at least one of the responses is erroneous. These kinds of limits may be flagged and presented to the user to double-check. As another example, firms that have annual gross sales exceeding $100 million might be required to answer a set of questions that other companies do not; a validation behavior can be used in this case to ensure that the appropriate questions have been answered.

(At the time of this writing, the Behavior Editor was in development at Fenestra, and no screen shot was available.)

2.2.6. Surveyor

The Surveyor is the only software component of GIDS that is distributed to respondents. A screen shot of the Surveyor is shown in Figure 10. (This screen shot is actually of an earlier version of the Surveyor used with Quarterly Financial Report forms; the version to be used with the 2002 Economic Census will have a slightly different appearance.) The left pane of the Surveyor window contains a table of contents, listing all pages of the form; clicking on an item listed here will take the user directly to that page. The right pane displays the actual form pages. Because of the limited resolution of a typical computer display, the level of typographic detail used in electronic forms is much simpler than that of paper forms.

Figure 10.

click image for full size view

The Surveyor.

Lines 6 and 8 in the figure display “information” icons next to the response boxes. Clicking on one of these launches a pop-up window that provides additional instructions for the corresponding response item. Validation behavior icons are also displayed in this area: Yellow “warning” icons indicate questionable values, while red “error” icons indicate erroneous or missing values which must be corrected before the form may be submitted. As with the information icons, clicking on a warning or error icon provides additional information about the problem.

3. Further applications

In January 2002, Fenestra began work on a second Census Bureau contract, this time with the Decennial Census division, which most U.S. residents will recognize as the branch that counts (or at least attempts to count) every person in the U.S. every ten years. This contract was for the development of software to format for publication the vast amount of demographic data collected during the 2000 Census. An example of a portion of one of the resulting tables is shown in Figure 11.

Figure 11.

click image for full size view

A portion of a sample demographic Census table.

Adapting the GIDS software to produce these tables was a straightforward exercise. All that was required was a new “front end” that understood the schema of the demographic data XML files, and a new Autoformatter module that understood the desired table layouts. The amount of time invested by Fenestra in making the required changes was approximately 450 person-hours, and the end result will be, when published, over 25,000 pages of formatted tables.

This project has confirmed our belief that the GIDS framework, even though it was designed and constructed specifically to solve the problems of the 2002 Economic Census, is sufficiently general that making modifications for new applications is a relatively painless task. Obviously, not every formatting problem is predisposed to solution within the GIDS framework, but a longer-term goal for Fenestra is to further improve the flexibility and extensibility of GIDS so that the majority of formatting problems may be solved with a single general-purpose descendant of GIDS, using only schemas and scripts to drive the formatting process.

4. Lessons learned

4.1. XML

The use of XML as the standard interchange format proved to be a wise decision, as the problems encountered when transferring data between disparate systems were minimal. With the Decennial Census project, in particular, there was only one very minor glitch: some documents that contained a UTF-8 encoding declaration were in fact encoded as ISO-8859-1; this became apparent when Spanish language files containing accented characters were submitted.

On the other hand, the use of XML as a standardized format for data storage proved to be somewhat more problematic. One significant problem was in the area of performance: the parsing of XML files and the creation of XML-format data from in-memory representations often resulted in a performance bottleneck. Our use of XML was limited to a subset of the XML specification; our documents contained no CDATA sections or processing instructions, for example, and entity references were limited to character references for special characters such as em-dashes and zero-width spaces. This experience suggests that some kind of “XML Lite” parser, which understands a subset of well-formed XML, might be useful in applications such as GIDS.

The XML parsers themselves proved to be inconsistent in their treatment of documents, most notably in the area of whitespace handling. The portion of the XML specification dealing with whitespace is perhaps the most ambiguous section of the entire document. Consequently, the behavior of XML parsers varies from one parser to the next more in this regard than in any other area. We found ourselves jumping through quite a few hoops, inserting numerous zero-width space characters to “shield” ordinary whitespace from being stripped by the parser. This would not have been such a trying task had all of the parsers we used behaved the same way regarding which whitespace to strip and which to preserve.

4.2. XSL-FO and SFO

Throughout the development of GIDS, our goal has been to eventually merge SFO with XSL-FO when the latter specification was finalized. However, at this point it is not clear whether or not this will be feasible. For example, the table border models provided by XSL-FO (http://www.w3.org/TR/xsl/slice6.html#section-N15442-Formatting-Objects-for-Tables) do not appear to offer sufficient flexibility to allow all of the border variations required by the Economic Census forms. In addition, one of the requirements common to both the Economic Census and Decennial Census projects has been the need for absolute precision and reproducibility of the visual appearance of the final output. This idea of “visual fidelity”—the notion that two different XSL-FO processors will, when presented with identical XSL-FO documents, produce identical outputs—has explicitly been stated to be a non-goal of the XSL-FO standardization effort. Therefore, even if it were possible to create an XSL-FO document that produces the desired result with one XSL-FO processor, the chances that the same document will work unchanged with another XSL-FO processor are very slim.

4.3. Electronic vs. paper layouts

One of the most important lessons learned was the realization that layout features that work well with paper-based forms very often do not work well with electronic forms, and vice versa. There are simply too many differences between the two media. In the context of GIDS, this has meant that while content data can be shared between electronic and paper versions of a form, virtually none of the layout can. Thus, while GIDS has substantially reduced the amount of effort required to create high quality paper and electronic forms, it has not yet achieved the elusive goal of automated translation between the two media.

In a similar vein, GIDS has taught us that providing easy-to-use tools is not enough—good form design still requires experience, skill and an aesthetic sense on the part of the tool user. This aspect parallels the desktop publishing experience of a decade ago: As soon as low-cost desktop publishing software, printers, etc. became available, publishing came within easy reach of nearly everyone. But, as it turned out, the availability of powerful publishing tools simply made good publishers more productive; it did not turn bad publishers into good ones.

5. Summary

In summary, the GIDS project stands as a successful example of the use of XML as both a data interchange format and a data storage format. It highlights the many strengths of XML, along with some of its weaknesses. Our experiences with XML suggest the following tasks for Fenestra during the coming months:

  • Continue using XML as the data interchange format of choice for all applications.

  • Explore the possibility of using simplified XML parsers in situations that do not require the full expressive power of XML (i.e., the great majority of applications), especially where performance is a consideration.

  • Consider the use of non-XML formats for data storage; to ensure the greatest level of interoperability, any non-XML format should be designed to allow straightforward translation to and from XML.

  • Merge SFO with XSL-FO as far as possible, and document any limitations that arise; these may form the basis for the design of a second-generation descendant of XSL-FO.

Glossary

GIDS

Generalized Instrument Design System

KFI

Key From Image

OCR

Optical Character Recognition

PDF

Portable Document Format

SFO

Survey Formatting Objects

XML

eXtensible Markup Language

XSL-FO

eXtensible Stylesheet Language Formatting Objects

Biography

Steve Schafer is Chief Technology Officer for Fenestra Technologies Corporation, a research-driven software development company located in the Washington DC metropolitan area.