XML 2003 logo

Converting PDF to XML with Publication-Specific Profiles

Abstract

PDF format has evolved from a pure page-layout format to a sophisticated one that can store structural information about the underlying document. On the other hand many PDF files, especially publications, are generated without any semantics. In this paper, we present software to fill this need for PDF publications. Our “PDF Semantic Extractor”, which is the first-stage implementation of the “Smart Translator” proposed in [“Extracting Semantic Knowledge From PDF Publications”] [1] consists of two main parts: An Acrobat plug-in that retrieves semantics from PDF publications and converts it to XML, and server-side scripts combined with XSLT [5] transformations that convert this raw information to interactive SVG documents with special styling attributes assigned to semantic objects. The PDF Semantic Extractor starts by analyzing lines and drawings on a page. By the use of a set of customizable filters, only vertical and horizontal lines that can be considered as separators are left. A custom closed-path construction algorithm is used at this stage, which draws boundaries around zones. Each zone is defined as an area that can contain one or more articles. At the same time, text blocks are formed out of text-runs. In PDF, text runs can contain a full word, a partial word, or even a single letter. These atomic objects are combined together to form bigger blocks, to which roles can be assigned. Possible text block roles are “title”, “author”, “article text”, etc. A fully customizable, heuristics based rule engine is used to assign roles to the text blocks. A profile engine is used to store different styling attributes for different publications. The last step involves combining two groups of objects and building the final semantic tree. Zones and blocks are further analyzed to construct separate articles, ads, etc. In addition, articles are connected to any continuation they might have on other pages. The extracted semantic information is used for two main purposes. First, it is indexed and stored in a database, allowing field-based queries on archives of PDF publications. Second, it is attached to the same PDF document as XMP (a subset of RDF) metadata. This promotes modularity, and the PDF becomes tailored with the initially missing semantics. Furthermore, the embedded XML can easily be retrieved from the original document using server-side scripting, and converted to SVG [4] with XSLT transformations. The transformation applied customizes the document for different media, allows localization, and puts emphasis on the required/tailored subsets of data such as highlighting and article, underlining searched keywords, etc. The end result is an interactive SVG, which contains both the objects of the original document, and semantic structures related to these objects.

Keywords


Table of Contents

1. Introduction
2. Reconstruction Process Overview
3. Zone Construction
3.1. Filtering and Preparing
3.2. Merging and Connecting Lines
3.3. Zone Construction Algorithm
4. Text Block Construction
5. Profile Engine
6. Assigning Roles To Blocks
7. Article Construction
8. Storage and Visualizations
Bibliography
Biography

1. Introduction

PDF format has evolved from a pure page-layout format to a sophisticated one that can store structural information about the underlying document. On the other hand, virtual printing is still a source for great many PDF files, especially in publishing industry and office document exchange. Virtual printers do not preserve any semantic information, and task of reconstruction of the semantic structure of such documents.

Our team is developing several tools for addressing this problem. All of them are based on the knowledge acquiring technique developed by Y.Khramov. The semantic reconstruction based on publication-specific profiles is used for documents with extremely complex and diverse layout, such as newspapers and magazines, and is the part of the Smart Translator project, under development by NewspaperDirect with SchemaSoft’s participation. The main principles of this approach were discussed in our presentation at XML 2001 http://www.idealliance.org/papers/xml2001/papers/html/04-04-06.html. In parallel with this work, SchemaSoft is developing PDF converter to MS Word based on the same approach, but without publication-specific profiles.

The implementation of the ideas described in the previous paper, required the development of several original algorithms, such as A.Gurcan’s zone reconstruction algorithms, intricate text block construction mechanism, etc. The paper concentrates on the additions and details that we have developed since 2001.

2. Reconstruction Process Overview

In the present form, the PDF Semantic Extractor is implemented as a plug-in to Adobe Acrobat [2] [3]. The process starts when the document (PDF representation of a newspaper or a magazine) is loaded into Acrobat. Simultaneously, the plugin retrieves the profile for the publication from the profile database. This profile (stored as an XML document) customizes reconstruction process based on the knowledge about the selected type of publication. The plugin also contains a UI, developed by NewspaperDirect team that helps to set up the profile for a new type of publication very quickly.

The initial semantic state of a PDF page is shown in Figure 1 . It is a completely flat layout:

Figure 1. Initial Semantic State of a Page

In accordance with the idea of combining “bottom-up” and “top-down” reconstruction, the PDF Semantic Extractor starts by analyzing lines and drawings on the page. By the use of a set of customizable filters, only vertical and horizontal lines that can be considered as separators are left. A custom closed-path construction algorithm is used at this stage, which draws boundaries around zones. Each zone is defined as an area that can contain one or more articles.

At the same time, text blocks are formed out of text-runs. In PDF, text runs can contain a full word, a partial word, or even a single letter. These atomic objects are combined together to form bigger blocks, to which roles can be assigned. Possible text block roles are “title”, “author”, “article text”, etc. A fully customizable, heuristics based rule engine is used to assign roles to the text blocks. A profile engine is used to store different styling attributes for different publications. The last step involves combining two groups of objects and building the final semantic tree. Zones and blocks are further analyzed to construct separate articles, ads, etc. In addition, articles are connected to any continuation they might have on other pages.

The intended final semantic state of a page is shown in Figure 2 :

Figure 2. Intended Final Semantic State of a Page

The extracted semantic information can be used for two main purposes. First, it can be indexed and stored in a database, allowing field-based queries on archives of PDF publications. Second, it can be attached to the same PDF document as XMP (a subset of RDF [5]) metadata. This promotes modularity, and the PDF becomes tailored with the initially missing semantics. Furthermore, the embedded XML can easily be retrieved from the original document using server-side scripting, and converted to SVG with XSLT transformations. The transformation applied can customize the document for different media, allows localization, and puts emphasis on the required/tailored subsets of data such as highlighting and article, underlining searched keywords, etc. The end result is an interactive SVG, which contains both the objects of the original document, and semantic structures related to these objects.

The rest of the paper describes the most important steps in the reconstruction process.

3. Zone Construction

In some publications, separators are extensively used to draw boundaries around articles. Our zone construction algorithm analyzes each page and divides it into zones, which enclose one or more articles, ads etc. Only separators are taken into consideration when trying to construct these zones. Steps of the zone construction algorithm are summarized below:

3.1. Filtering and Preparing

The geometric objects on the page (such as lines, rectangles etc.) are filtered out so that only objects that are believed to be separators are left. Several custom filters are used at this point. For example, looking at the stroke and fill characteristics of a closed path, our algorithm decides whether this could be classified as a separator or not.

In addition, all open and closed paths are converted to lines. This allows us to have a simpler zone construction algorithm. As a result, the only input to the zone construction algorithm is a set of horizontal or vertical lines.

3.2. Merging and Connecting Lines

Although some lines are very close, they do not intersect. The zone construction algorithm makes a second pass on all lines, and tries to merge non-intersecting close ones. There are two types of merging:

  1. Same axis merging (e.g. combining two horizontal lines into one)

  2. Different axis merging (e.g. extending a horizontal line in order to intersect it with a vertical line)

Both merging and intersecting is considered possible when the lines are both horizontally and vertically close enough to each other. The closeness constant can be configured using the profile.

When we are merging two close horizontal or vertical lines, we simply remove the one that is shorter than the other. The exception to this rule is if the length of the lines are very close (or equal to) each other. In this case, we remove the one that is at the bottom, or on the right.

When we are connecting a horizontal and a vertical line, the decision is which line to extend to intersect with the other. If we think of the two lines as a “T” shaped intersection, we always extend the portion that would be the vertical part of “T”, when placed properly. Figure 3 demonstrates this:

Figure 3. Connecting Lines

In all the figures shown above, we extend line “2” to intersect line “1”. The same is applied if one line already intersects and extends beyond the other. In this case, the vertical part of the “T” joint is shrunk to exactly intersect with the horizontal part.

Once merging and connecting is complete, what we intend to end up with is a figure like Figure 4

Figure 4. Filtered and Merged Connectors on a Page

3.3. Zone Construction Algorithm

Our zone construction algorithm can be summarized as follows: We start with a region, initially as big as the page. At the end of each iteration, we find a set of lines that form a closed path, which we consider to be a zone. We then exclude this zone (area enclosed by the path) from the region. We then find a corner of the remaining region, and repeat the procedure until the region is empty. At this time we have a list of paths that are our zones.

Our default search direction path is upwards, leftwards, downwards and then rightwards, but the algorithm changes this sequence on the fly if needed.

Sample Newspaper:

Vancouver Sun

Figure 5. Vancouver Sun

Constructed Zones:

Zones of Vancouver Sun Sample

Figure 6. Zones of Vancouver Sun Sample

4. Text Block Construction

The objective here is to construct blocks of text that we can later assign roles to. In PDF documents, text runs do not necessarily have to be in consecutive order, nor do they need to be complete. Therefore, it is impossible to determine words by plainly iterating text runs. Text block construction involves two steps:

  1. Construction of words based on text-runs

  2. Construction of text blocks based on words

The first step is fairly simple using the Acrobat SDK [2], a tool we use to access the PDF document. For English publications, word construction API performed quite well, while for non-English ones we needed to do some further processing to figure out the hyphenation.

The second step involves iterating the words and combining their bounding rectangles to form bigger and semantically meaningful pieces, to which roles can be assigned. The block construction algorithm uses some heuristics to combine words and form these bigger blocks. These heuristic constants are specific to each newspaper and are read from the profile. The block construction algorithm can be summarized as follows:

For each word in the document:

  1. Find out if there is a matching block for this word, if not create one

  2. Add the word to either its matching block, or the newly created block

  3. Upsize the bounding rectangle of the block to enclose the new word’s bounding rectangle

The condition for finding a matching block works as follows (all must be true):

  • The font height of the block and the word should be close. We allow 5% difference by default; although this can be customized.

  • The bounding rectangle of the word and the block should either be intersecting, or, should be intersecting when inflated horizontally and vertically by a heuristic constant. The vertical and horizontal inflation constants are described individually as coefficients that will be multiplied by the block height, and the average character width, respectively.

In addition, all words inside a text block are stored sorted by reading order (left to right, and top to bottom). This allows us to store the body text of an article as a whole, and reflow it if necessary. Text blocks constructed from the same example is shown in Figure 7. Red lines denote zone boundaries and blue rectangles are the text blocks.

Text Blocks in Vancouver Sun sample

Figure 7. Text Blocks in Vancouver Sun sample

5. Profile Engine

The profile engine is used for two main purposes: Customize the algorithm according to the type of publication we are working on, and define the set of roles and the associated rules. The next section discusses how roles are assigned to objects with the help of the profile engine. The profile is simply stored as an XML file for each type of publication.

For example, the profile engine can be used to customize the neighboring relationships between objects (how close objects should be laid out in order to be considered neighbors, etc), or define heuristic constants used to combine words to form text blocks. Following is an excerpt from the part of the profile xml used to customize the algorithm:

  <zoneConstruction minSeparatorLength="50" maxSeparatorCount="1000">
    <filters>
      <filterFilledRectangles apply="true" maxSeparatorWidth="30"/>
      <filterImage apply="true" filterBorder="true" borderMargin="12"/>
      <lineMerge closeEnough="30"/>
    </filters>
  </zoneConstruction>
  <textBlockConstruction inflateH="1" inflateV="1">
    <wordConstruction wordOverlap="0.02"/>
  </textBlockConstruction>
    

6. Assigning Roles To Blocks

A fully customizable rule engine is used to assign roles to blocks. Definitions:

Role

The semantics assigned to a specific object. For example, “title”, “author”, “body text” are several possible roles for text blocks.

Rule

A single Boolean statement that would evaluate to true or false and help decide whether an object can be assigned a specific role. There are two types of rules: "confirm" and "reject".

Clause

Several clauses make up a rule. It has three main parts: one or two operands and an operator.

The following XML is an example to a definition of a role:

  <role name="title">
    <reject>
      <clause attrName="wordCount" op="gt" value="20"/>
    </reject>
    <reject>
      <clause attrName="isBottommost" op="eq" value="true"/>
    </reject>
    <reject>
      <clasue attrName="fontHeight" op="lt" value="10"/>
    </reject>
  </role>
    

An object can be assigned a role if at least one “confirm” rule is true, and none of the “reject” rules is true. The Boolean statement to confirm a role can be written as:

where Ci are the confirm rules, and Ri are the reject rules, and cj and rj are individual clauses.

Initially all roles are assigned to all objects. The rule engine tries to narrow down the role set of an object by either eliminating some of them by finding offending rules (at least one “reject” rule is sufficient to remove a role from the role set of the object), or, in strong cases simply confirming the role (at least one confirm rules must be true).

7. Article Construction

Each constructed zone can contain one or more articles. The worst case scenario is that the whole page is a single zone. This can happen if there are not enough separators to construct the zones. In any case, article construction algorithm works on zone level and tries to distinguish between several articles within the same zone.

Article construction algorithm takes into account the following:

  • Layout of blocks, and their relationship and closeness to each other

  • The fact that every article should have a title

  • Ordering of text blocks, which should be top-down and left-right

An important step here is finding the continuation of the article if there is one, and storing the article in its entirety. For this purpose, the whole document is post-processed taking into account blocks that have been assigned the following roles:

Reference

This role is assigned to a block that is believed to designate the continuation of an article (e.g. “Continued on A3”)

Back-reference

This role ties the continuation of the article back to the beginning of it (e.g. “Continued from A1”)

8. Storage and Visualizations

After all processing is complete, the constructed semantics of the publication is saved as XMP Metadata (a subset of RDF), and becomes part of the original PDF document. It can also be organized and stored in a database to allow field-based queries on the processed publications.

The constructed semantics can further be enhanced via XSLT transformations, which can convert the raw XML to SVG [4] with special styling assigned to different semantic objects. The transformation applied can customize the document for different media such as a hand-held device, allow localization, and put emphasis on the required/tailored subsets of data such as highlighting and article, underlining searched keywords, etc. The end result is an interactive SVG, which contains both the objects of the original document, and semantic structures related to these objects.

Bibliography

[1] Extracting Semantic Knowledge From PDF Publications, Y. Khramov, A. Kroogman http://www.idealliance.org/papers/xml2001/papers/html/04-04-06.html

[2] Acrobat Core API Reference Version 5.0, Adobe Systems Incorporated

[3] PDF Reference Version 1.4, Adobe Systems Incorporated

[4] Scalable Vector Graphics (SVG) 1.1 Specification, J. Ferraiolo, editor http://www.w3.org/TR/SVG11/

[5] XSL Transformations (XSLT) Version 1.0, J. Clark, editor http://www.w3.org/TR/xslt

[5] Resource Description Framework (RDF), Ora Lassila, Ralph R. Swick, editors http://www.w3.org/RDF

Biography

Ahmet Gurcan completed his BSc. in Electrical and Computer Engineering at Istanbul Techical University in 1997, and M.A.Sc. in the same field at University of British Columbia in 2000. While he was pursuing his Master’s Degree, he also worked as a research and teaching assistant. His topics of interest were real-time operating systems at that time, and he built a dual-processor real-time system mainly used to control machines and robots. Since his graduation, he has been working for a Vancouver based software company, Schemasoft, a leader in file converters and XML technologies and SVG. Ahmet is also an expert on PDF file format

Yuri Khramov has more than 20 years of experience in the software industry; he is involved in XML and other WEB technologies for more than 5 years. He is one of the founding partners of SchemaSoft. Prior to that, he worked at Paradigm Development Corp. in Vancouver, Canada Graphica Corp. in Tokyo, and several industrial and Academic institutions in Moscow. He holds a Ph.D. in Computer Science from Moscow Management Institute. Yuri is a co-director of Vancouver XML Developers Association.

Mr. Kroogman is an IT professional with more than 16 years of project management, system design, applications programming and service delivery experience. At Rydex Industries Corporation, which specializes in data communications and network management for the marine industry, he designed and developed the core software for the global satellite messaging system. (The Rydex e-mail system is deployed in more than 400 organizations in 19 countries.) Most recently at Orion Technologies Inc. (Orion provides electronic commerce services to financial institutions, governments and corporations), where Mr. Kroogman was Vice President of Product Delivery, he was responsible for technical project management of all Orion projects, including management and coordination of several development teams in different countries. He was also responsible for management of the Network Control Center, finalizing contractual arrangements and managing third-party service providers (BC-TEL, Teleglobe, Telecom Malaysia) as well as for functional management of subsidiaries and joint ventures (Philippines and Malaysia). Mr. Kroogman has a Masters Degree in Chemical Engineering from the University of Chemical Engineering in Moscow.

After receiving his Ph.D. in Mathematical Physics from Yale University in 1989, Philip spent a year as Assistant Professor of Physics at Knox College, followed by four years as Assistant Professor of Mathematics at the University of Toronto. His background in Differential Geometry and in computer modelling of physical phenomena served as unorthodox preparation for his subsequent move into industry as a Software Engineer with an emphasis on Computer Graphics. By 1997 Philip was in charge of a software research team creating early Web technologies based on HTML, XML, CSS and Java. Philip now lives and works in Vancouver, Canada, where he is President of SchemaSoft (http://www.schemasoft.com/), a software development consulting company he co-founded in 1999. He is an Advisory Committee Representative of the World Wide Web Consortium (http://www.w3.org/), and has been a member of the W3C Scalable Vector Graphics Working Group (http://www.w3.org/Graphics/SVG/) since its inception in 1998. Philip is Chair of the BC Advanced Systems Institute International Scientific Advisory Board (http://www.asi.bc.ca/). He is also a Director of the Vancouver XML Developers Association (http://www.vanx.org/), an organization that he co-founded in 2000. He regularly writes and lectures on topics related to software engineering, XML and SVG.