XML Europe 2002 logo

There are no unstructured documents

Abstract

Printed documents have sufficient visual markers to make their structure obvious. We discuss the theory behind, and demonstrate, a system that converts documents to XML based on their two-dimensional visual structure.


There are no unstructured documents
All of us recognize the usefulness of having our documents in a structured, parsable, reusable format such as XML, but we have assumed that we must add this structure explicitly, at the cost of substantial human effort.
We think this approach to human-computer interaction is backwards: just as we do not expect people to read XML-tagged documents directly, we should not expect them to write them either. What is good enough for humans should be good enough for our increasingly powerful machines.
We have created a system that reads PostScript and PDF files produced by word-processing and desktop-publishing products, and extracts two-dimensional visual representations of the documents. From these representations, document structures can be identified, and valid and rich XML files can be generated. This approach to document conversion is proving more fruitful than the traditional methods of analyzing one-dimensional streams of typographic codes.
The motivation for our approach is our contention that virtually all documents already have sufficient visual markers of structure to make them parsable. The clearest indication of this is the commonplace observation that we rarely encounter a printed document whose logical structure we cannot mentally parse at a glance.
A document that fails this test is often ambiguous to us humanslet alone machineseven after reflection, because its author (or typographer) did not sufficiently follow commonly understood rules of layout. As negative examples, some design-heavy publications deliberately flout as many of these rules as they can get away with; this is entertaining, but hardly helps us to discern their structure.
Dimensions of identification
What do we mean by one- and two-dimensional structure identification? One-dimensional identification parses a document as a linear sequence of content and typographical codes, and then takes advantage of prior knowledge of the meanings of those codes. For example, \b in RTF indicates bold text. The general paradigm is procedural, similar to a tape-driven numerical control machine in a high-tech factory: Do this. Now do this. Now do this. ...
By contrast, two-dimensional identification parses the page into objects, based on their formatting, position, and context. It then considers page layout holistically, building on the observation that pages tend to have a structure or geometry that includes one or more of header, footer, body, and footnotes. Then, a software system can identify objects and higher-level structures using general knowledge of typographic principles. This paradigm is declarative: At this location on the page there is an object with such and such a set of properties. We don't care how (by what set of steps) an object came to be, only that it exists.
A major problem with the one-dimensional approach is that exactly the same construct can be created in many different ways. Both the authorthe user of the applicationand the programmer who created the application, have many choices, and therefore many sets of procedural codes can be encountered that produce the same results. For example, a simple indented bulleted list item can be constructed by the user thus:
For each of these methods, a different set of typographical codes will be generated. And that's to represent identical constructs. In addition, different word-processing and typesetting applications will emit different codes for the same user actions.
By contrast, two-dimensional, declarative structure identification decreases the number of parameters that need to be considered when deciding on the identity of a particular construct. This makes the system more robust. In the examples above, the only parameters that would be considered when determining the identity of the object in question would be the location and appearance of the bullet character (and any content that followed it) in relation to their surroundings.
Two-dimensional parsing also more closely emulates the human eye-brain system for understanding printed documents. (This, as we never tire of pointing out, is a mechanism whose effectiveness has been verified by centuries of QA!)
Examples
As we said above, our approach to structure identification is based on determining which sets of typographical properties are distinctive of specific objects. Some examples will demonstrate how human beings use typographical cues to distinguish structure:
Example 1 - numbered lists

Figure 1.

click image for full size view
The first example is a simple list, with another nested inside it. It's easy to tell this because there are two significant visual clues that the second through fifth items are members of a sub-list.
In the second example the sub-list uses the same numbering style, but it's still indented, so there's really no doubt that it's a nested list.
The third list is a bit more unusual, and many people would say it is typeset badly, since all the items are indented the same. Nonetheless, we can still guess with a high probability of correctness that the second through fifth items are logically nested, since they use a different numbering scheme. Since we can identify the sub-list without it being indented, we believe that numbering scheme has a higher weight than indentation in this context.
The fourth list looks very strange indeed. Not only are all of the items indented the same, they all use the same numbering scheme. You would be quite justified in concluding that the structure of this list can't be deciphered with any confidence. By making certain assumptions, it is possible to guess at a structure, but it might not be the one that the author intended.
Example 2 - bulleted lists
Now let's see how the observations that we made above apply to bulleted (non-numbered) lists.
click image for full size view
The first example clearly contains a nested list: the second through fifth items are indented, and have a different bullet character than the first and sixth items.
The second example is similar. Even though all items have the same bullet character, the indent tells us that some items are nested.
In the third example, even without the indentation we can easily tell that the middle items are in a sub-list because the unfilled bullet characters are clearly distinct.
The fourth example presents somewhat of a dilemma. None of the items are indented, and while we may be able to tell that the second through fifth bullets are different, is that enough to conclude that they indicate items in a sublist? Perhaps the typographer simply made a mistake and the lists have only one level. Here is an example where humans and software programs may rightly conclude that the situation is ambiguous, and just make their best guess. We are led to conclude that, when recognizing a nested bulleted list, indentation has a higher weighting than the choice of list mark.
Example 3 - block quotes
We use several different cues when recognizing a block quote: font and font style, indentation, inter-line spacing, quote marks, and dividers. Let's see how they interact. Note that this is a somewhat simplified example since other constructs such as notes and warnings may have formatting attributes similar to those of block quotes.
The first example is the classic block quote: the quoted material is indented and italicized.
click image for full size view
In example two, the quote is again emphasized in two ways: by indentation and point size.
click image for full size view
Now look what happens in the next two examples:
click image for full size view
click image for full size view
Now the indentation has been removed, but the font changes have been retained. We might recognize these blocks as quotes, but it's not as obvious as before. Indentation seems to be a crucial characteristic of block quotes. If typographers don't want to use this formatting property, they will have to surround the quoted block with explicit quote marks:
click image for full size view
Object taxonomy
Our empirical research, based on an examination of thousands of pages of documents, has produced a taxonomy of objects that are in general use in documents, along with visual (typographic) cues that communicate the objects on the page or screen, and an analysis of which combinations of cues are sufficient to identify specific objects. These results provide a repository of typographic knowledge that is employed in the structure identification process.
This taxonomy classifies objects into the typical categories that you might expect (block/inline, text/non-text), with some finer categorizationsto distinguish different kinds of titles and lists, for example.
What is interesting is that after a few weeks of research the number of new, distinct object types that we were encountering dropped significantly, and the ones that we were discovering were generally in use in a minority of documents. Furthermore, most documents tend to use a relatively small subset of these objects. Naturally, we don't believe that we have captured all possible object types. But this experience leads us to conclude that the set of object types in general use in typography is manageably finite.
The set of visual cues or characteristics that concretely realize graphical objects is not as easy to capture, since for each object, there are many ways that individual authors will format it. Just think of the number of different ways of formatting as simple an object as a title. Even in this case, however, we have found that the majority of documents use a fairly well-defined set of typographical conventions.
Although the list of objects and visual cues is large, it is sufficiently stable over time to provide the common, reliable protocol that authors use to communicate with their readers, just as programs use XML to communicate with each other.
A working system for structure identification and conversion
What follows is an outline of the system.
click image for full size view
Conclusions
Two-dimensional structure identification, which attempts to model the ways in which human beings actually discern the structure of documents, is proving to be a useful approach to the problem of making electronic documents available to computers for purposes that go beyond simple word-processsing. A set of typographical conventions that are used in the vast majority of documents has been compiled, and work is progressing on expressing these conventions in a software system.
We have had a great deal of success in identifying both simple and complex structures via two-dimensional parsing techniques, and have been able to parse a wide class of documents: for example, legal rulings, textbooks, novels and other trade publications, and software manuals.
While the initial motivation for our work is and continues to be largely XML-centric, structure recognition is not just about generating XML files. Some other applications that we envision are:

Biography

Chief Technology Officer

David Slocombe's career in computing began in 1969 while he was a Canadian newspaper reporter. During the next 30 years he developed many applications of computing to journalism and publishing. He was a founder and vice-president of SoftQuad Inc. and was the architect of SoftQuad's first product-line, SoftQuad Publishing Software. He contributed to the early development of DSSSL. Until recently he was a consultant at Tata Infotech Ltd. in Bombay, India. He is currently Chief Technology Officer at Exegenix Research Inc. in Toronto, a company he co-founded in 2001.

Document Conversion Analyst

Rodney Boyd worked at SoftQuad for 10 years, writing about and using SGML, HTML, and XML. He wrote manuals for Author/Editor, XMetaL, HoTMetaL, and other products. He was dragged back bodily from Mexico to work as a Document Conversion Analyst at Exegenix Research Inc.