There are no unstructured documents
Abstract
Printed documents have sufficient visual markers to make their
structure obvious. We discuss the theory behind, and demonstrate, a system that
converts documents to XML based on their two-dimensional visual
structure.
There are no unstructured documents
All of us recognize the usefulness of having our documents in a
structured, parsable, reusable format such as XML, but we have assumed that we
must add this structure explicitly, at the cost of substantial human effort.
We think this approach to human-computer interaction is backwards:
just as we do not expect people to read XML-tagged documents directly,
we should not expect them to write them either. What is good enough for
humans should be good enough for our increasingly powerful machines.
We have created a system that reads PostScript and PDF files
produced by word-processing and desktop-publishing products, and extracts
two-dimensional visual representations of the documents. From these
representations, document structures can be identified, and valid and rich XML
files can be generated. This approach to document conversion is proving more
fruitful than the traditional methods of analyzing one-dimensional streams of
typographic codes.
The motivation for our approach is our contention that virtually all
documents already have sufficient visual markers of structure to make them
parsable. The clearest indication of this is the commonplace observation that
we rarely encounter a printed document whose logical structure we cannot
mentally parse at a glance.
A document that fails this test is often ambiguous to us
humanslet alone machineseven after reflection, because its author
(or typographer) did not sufficiently follow commonly understood rules of
layout. As negative examples, some design-heavy publications deliberately flout
as many of these rules as they can get away with; this is entertaining, but
hardly helps us to discern their structure.
Dimensions of identification
What do we mean by one- and two-dimensional structure identification?
One-dimensional identification parses a document as a linear sequence of
content and typographical codes, and then takes advantage of prior knowledge of
the meanings of those codes. For example, \b in RTF indicates bold
text. The general paradigm is procedural, similar to a tape-driven
numerical control machine in a high-tech factory: Do this. Now do this.
Now do this. ...
By contrast, two-dimensional identification parses the page into
objects, based on their formatting, position, and context. It then considers
page layout holistically, building on the observation that pages tend to have a
structure or geometry that includes one or more of header, footer, body, and
footnotes. Then, a software system can identify objects and higher-level
structures using general knowledge of typographic principles. This paradigm is
declarative: At this location on the page there is an object with
such and such a set of properties. We don't care how (by what set of
steps) an object came to be, only that it exists.
A major problem with the one-dimensional approach is that exactly the
same construct can be created in many different ways. Both the authorthe
user of the applicationand the programmer who created the application,
have many choices, and therefore many sets of procedural codes can be
encountered that produce the same results. For example, a simple indented
bulleted list item can be constructed by the user thus:
-
Inserting a number of spaces using the space bar, and then
inserting a bullet character
-
Inserting a tab, and then a bullet character
-
Doing either of the above, but instead of entering a bullet
character, entering a period, increasing its font size, and superscripting
it
-
Clicking on the word-processor's Bulleted list
tool
For each of these methods, a different set of typographical codes
will be generated. And that's to represent identical constructs. In
addition, different word-processing and typesetting applications will emit
different codes for the same user actions.
By contrast, two-dimensional, declarative structure identification
decreases the number of parameters that need to be considered when deciding on
the identity of a particular construct. This makes the system more robust. In
the examples above, the only parameters that would be considered when
determining the identity of the object in question would be the location and
appearance of the bullet character (and any content that followed it) in
relation to their surroundings.
Two-dimensional parsing also more closely emulates the human
eye-brain system for understanding printed documents. (This, as we never tire
of pointing out, is a mechanism whose effectiveness has been verified by
centuries of QA!)
Examples
As we said above, our approach to structure identification is based
on determining which sets of typographical properties are distinctive of
specific objects. Some examples will demonstrate how human beings use
typographical cues to distinguish structure:
Example 1 - numbered lists
The first example is a simple list, with another nested inside it.
It's easy to tell this because there are two significant visual clues that the
second through fifth items are members of a sub-list.
In the second example the sub-list uses the same numbering style, but
it's still indented, so there's really no doubt that it's a nested list.
The third list is a bit more unusual, and many people would say it is
typeset badly, since all the items are indented the same. Nonetheless, we can
still guess with a high probability of correctness that the second through
fifth items are logically nested, since they use a different numbering scheme.
Since we can identify the sub-list without it being indented, we believe that
numbering scheme has a higher weight than indentation in this context.
The fourth list looks very strange indeed. Not only are all of the
items indented the same, they all use the same numbering scheme. You would be
quite justified in concluding that the structure of this list can't be
deciphered with any confidence. By making certain assumptions, it is possible
to guess at a structure, but it might not be the one that the author
intended.
Example 2 - bulleted lists
Now let's see how the observations that we made above apply to
bulleted (non-numbered) lists.
The first example clearly contains a nested list: the second through
fifth items are indented, and have a different bullet character than the first
and sixth items.
The second example is similar. Even though all items have the same
bullet character, the indent tells us that some items are nested.
In the third example, even without the indentation we can easily tell
that the middle items are in a sub-list because the unfilled bullet characters
are clearly distinct.
The fourth example presents somewhat of a dilemma. None of the items
are indented, and while we may be able to tell that the second through fifth
bullets are different, is that enough to conclude that they indicate items in a
sublist? Perhaps the typographer simply made a mistake and the lists have only
one level. Here is an example where humans and software programs may rightly
conclude that the situation is ambiguous, and just make their best guess. We
are led to conclude that, when recognizing a nested bulleted list, indentation
has a higher weighting than the choice of list mark.
Example 3 - block quotes
We use several different cues when recognizing a block quote: font
and font style, indentation, inter-line spacing, quote marks, and dividers.
Let's see how they interact. Note that this is a somewhat simplified example
since other constructs such as notes and warnings may have formatting
attributes similar to those of block quotes.
The first example is the classic block quote: the quoted material is
indented and italicized.
In example two, the quote is again emphasized in two ways: by
indentation and point size.
Now look what happens in the next two examples:
Now the indentation has been removed, but the font changes have been
retained. We might recognize these blocks as quotes, but it's not as obvious as
before. Indentation seems to be a crucial characteristic of block quotes. If
typographers don't want to use this formatting property, they will have to
surround the quoted block with explicit quote marks:
Object taxonomy
Our empirical research, based on an examination of thousands of pages
of documents, has produced a taxonomy of objects that are in general use in
documents, along with visual (typographic) cues that communicate the objects on
the page or screen, and an analysis of which combinations of cues are
sufficient to identify specific objects. These results provide a repository of
typographic knowledge that is employed in the structure identification
process.
This taxonomy classifies objects into the typical categories that you
might expect (block/inline, text/non-text), with some finer
categorizationsto distinguish different kinds of titles and lists, for
example.
What is interesting is that after a few weeks of research the number
of new, distinct object types that we were encountering dropped significantly,
and the ones that we were discovering were generally in use in a minority of
documents. Furthermore, most documents tend to use a relatively small subset of
these objects. Naturally, we don't believe that we have captured all possible
object types. But this experience leads us to conclude that the set of object
types in general use in typography is manageably finite.
The set of visual cues or characteristics that concretely realize
graphical objects is not as easy to capture, since for each object, there are
many ways that individual authors will format it. Just think of the number of
different ways of formatting as simple an object as a title. Even in this case,
however, we have found that the majority of documents use a fairly well-defined
set of typographical conventions.
Although the list of objects and visual cues is large, it is
sufficiently stable over time to provide the common, reliable protocol that
authors use to communicate with their readers, just as programs use XML to
communicate with each other.
A working system for structure identification and
conversion
What follows is an outline of the system.
-
The structure identification process starts with visual data
acquisition. We begin with a PostScript or PDF file, which is actually a
program for instructing a PostScript printer or its surrogate how to draw the
document on a page or the screen. We execute the PostScript or PDF and
extract information about the basic marks that it paints on the page. In
particular, for each mark we obtain the type of mark (text, image, drawing
command, rule), the mark itself (for example, the word 'wildebeest'), the
mark's location, and its font style (if appropriate). At this stage no
decisions are made as to the identity of the graphical objects that these marks
constitute.
-
The information gleaned from the visual data acquisition phase is
then used in the visual tokenization phase, which groups together marks
that seem to have a relationship to each other, thereby synthesizing the basic
tokens of visual human communication found in the document. This stage still
does not attempt to identify objects.
-
In the structure identification phase, the system
considers the properties of each token (appearance, position, and context) and
arrives at a conclusion as to its identitytitle, paragraph, table,
etc.and perhaps its relationship to a larger structure, based on
knowledge of typographic conventions (that is, the information contained in the
object taxonomy).
-
If the system is being used to perform document conversion, there
will then be an export phase that serializes the structure created in
the previous phase into an XML file. The format of this file was chosen to be
easy to transform into customer-specific formats using tools such as
XSLT.
Conclusions
Two-dimensional structure identification, which attempts to model the
ways in which human beings actually discern the structure of documents, is
proving to be a useful approach to the problem of making electronic documents
available to computers for purposes that go beyond simple word-processsing. A
set of typographical conventions that are used in the vast majority of
documents has been compiled, and work is progressing on expressing these
conventions in a software system.
We have had a great deal of success in identifying both simple and
complex structures via two-dimensional parsing techniques, and have been able
to parse a wide class of documents: for example, legal rulings, textbooks,
novels and other trade publications, and software manuals.
While the initial motivation for our work is and continues to be
largely XML-centric, structure recognition is not just about generating XML
files. Some other applications that we envision are:
-
Natural language parsing will benefit from the ability to
recognize the hierarchical structure of documents.
-
Search engines could use structure identification to identify the
most relevant parts of documents and thereby provide better indexing
capabilities.
David Slocombe's career in computing began in 1969 while he was a
Canadian newspaper reporter. During the next 30 years he developed many
applications of computing to journalism and publishing. He was a founder and
vice-president of SoftQuad Inc. and was the architect of SoftQuad's first
product-line, SoftQuad Publishing Software. He contributed to the early
development of DSSSL. Until recently he was a consultant at Tata Infotech Ltd.
in Bombay, India. He is currently Chief Technology Officer at Exegenix Research
Inc. in Toronto, a company he co-founded in 2001.
Document Conversion Analyst
Rodney Boyd worked at SoftQuad for 10 years, writing about and
using SGML, HTML, and XML. He wrote manuals for Author/Editor, XMetaL,
HoTMetaL, and other products. He was dragged back bodily from Mexico to work as
a Document Conversion Analyst at Exegenix Research Inc.