Abstract
PDF format has evolved from a pure page-layout format to a sophisticated one that can store structural information about the underlying document. On the other hand many PDF files, especially publications, are generated without any semantics. In this paper, we present software to fill this need for PDF publications. Our “PDF Semantic Extractor”, which is the first-stage implementation of the “Smart Translator” proposed in [“Extracting Semantic Knowledge From PDF Publications”] [1] consists of two main parts: An Acrobat plug-in that retrieves semantics from PDF publications and converts it to XML, and server-side scripts combined with XSLT [5] transformations that convert this raw information to interactive SVG documents with special styling attributes assigned to semantic objects. The PDF Semantic Extractor starts by analyzing lines and drawings on a page. By the use of a set of customizable filters, only vertical and horizontal lines that can be considered as separators are left. A custom closed-path construction algorithm is used at this stage, which draws boundaries around zones. Each zone is defined as an area that can contain one or more articles. At the same time, text blocks are formed out of text-runs. In PDF, text runs can contain a full word, a partial word, or even a single letter. These atomic objects are combined together to form bigger blocks, to which roles can be assigned. Possible text block roles are “title”, “author”, “article text”, etc. A fully customizable, heuristics based rule engine is used to assign roles to the text blocks. A profile engine is used to store different styling attributes for different publications. The last step involves combining two groups of objects and building the final semantic tree. Zones and blocks are further analyzed to construct separate articles, ads, etc. In addition, articles are connected to any continuation they might have on other pages. The extracted semantic information is used for two main purposes. First, it is indexed and stored in a database, allowing field-based queries on archives of PDF publications. Second, it is attached to the same PDF document as XMP (a subset of RDF) metadata. This promotes modularity, and the PDF becomes tailored with the initially missing semantics. Furthermore, the embedded XML can easily be retrieved from the original document using server-side scripting, and converted to SVG [4] with XSLT transformations. The transformation applied customizes the document for different media, allows localization, and puts emphasis on the required/tailored subsets of data such as highlighting and article, underlining searched keywords, etc. The end result is an interactive SVG, which contains both the objects of the original document, and semantic structures related to these objects.
Keywords
Table of Contents
PDF format has evolved from a pure page-layout format to a sophisticated one that can store structural information about the underlying document. On the other hand, virtual printing is still a source for great many PDF files, especially in publishing industry and office document exchange. Virtual printers do not preserve any semantic information, and task of reconstruction of the semantic structure of such documents.
Our team is developing several tools for addressing this problem. All of them are based on the knowledge acquiring technique developed by Y.Khramov. The semantic reconstruction based on publication-specific profiles is used for documents with extremely complex and diverse layout, such as newspapers and magazines, and is the part of the Smart Translator project, under development by NewspaperDirect with SchemaSoft’s participation. The main principles of this approach were discussed in our presentation at XML 2001 http://www.idealliance.org/papers/xml2001/papers/html/04-04-06.html. In parallel with this work, SchemaSoft is developing PDF converter to MS Word based on the same approach, but without publication-specific profiles.
The implementation of the ideas described in the previous paper, required the development of several original algorithms, such as A.Gurcan’s zone reconstruction algorithms, intricate text block construction mechanism, etc. The paper concentrates on the additions and details that we have developed since 2001.
In the present form, the PDF Semantic Extractor is implemented as a plug-in to Adobe Acrobat [2] [3]. The process starts when the document (PDF representation of a newspaper or a magazine) is loaded into Acrobat. Simultaneously, the plugin retrieves the profile for the publication from the profile database. This profile (stored as an XML document) customizes reconstruction process based on the knowledge about the selected type of publication. The plugin also contains a UI, developed by NewspaperDirect team that helps to set up the profile for a new type of publication very quickly.
The initial semantic state of a PDF page is shown in Figure 1 . It is a completely flat layout:
In accordance with the idea of combining “bottom-up” and “top-down” reconstruction, the PDF Semantic Extractor starts by analyzing lines and drawings on the page. By the use of a set of customizable filters, only vertical and horizontal lines that can be considered as separators are left. A custom closed-path construction algorithm is used at this stage, which draws boundaries around zones. Each zone is defined as an area that can contain one or more articles.
At the same time, text blocks are formed out of text-runs. In PDF, text runs can contain a full word, a partial word, or even a single letter. These atomic objects are combined together to form bigger blocks, to which roles can be assigned. Possible text block roles are “title”, “author”, “article text”, etc. A fully customizable, heuristics based rule engine is used to assign roles to the text blocks. A profile engine is used to store different styling attributes for different publications. The last step involves combining two groups of objects and building the final semantic tree. Zones and blocks are further analyzed to construct separate articles, ads, etc. In addition, articles are connected to any continuation they might have on other pages.
The intended final semantic state of a page is shown in Figure 2 :
The extracted semantic information can be used for two main purposes. First, it can be indexed and stored in a database, allowing field-based queries on archives of PDF publications. Second, it can be attached to the same PDF document as XMP (a subset of RDF [5]) metadata. This promotes modularity, and the PDF becomes tailored with the initially missing semantics. Furthermore, the embedded XML can easily be retrieved from the original document using server-side scripting, and converted to SVG with XSLT transformations. The transformation applied can customize the document for different media, allows localization, and puts emphasis on the required/tailored subsets of data such as highlighting and article, underlining searched keywords, etc. The end result is an interactive SVG, which contains both the objects of the original document, and semantic structures related to these objects.
The rest of the paper describes the most important steps in the reconstruction process.
In some publications, separators are extensively used to draw boundaries around articles. Our zone construction algorithm analyzes each page and divides it into zones, which enclose one or more articles, ads etc. Only separators are taken into consideration when trying to construct these zones. Steps of the zone construction algorithm are summarized below:
The geometric objects on the page (such as lines, rectangles etc.) are filtered out so that only objects that are believed to be separators are left. Several custom filters are used at this point. For example, looking at the stroke and fill characteristics of a closed path, our algorithm decides whether this could be classified as a separator or not.
In addition, all open and closed paths are converted to lines. This allows us to have a simpler zone construction algorithm. As a result, the only input to the zone construction algorithm is a set of horizontal or vertical lines.
Although some lines are very close, they do not intersect. The zone construction algorithm makes a second pass on all lines, and tries to merge non-intersecting close ones. There are two types of merging:
Same axis merging (e.g. combining two horizontal lines into one)
Different axis merging (e.g. extending a horizontal line in order to intersect it with a vertical line)
Both merging and intersecting is considered possible when the lines are both horizontally and vertically close enough to each other. The closeness constant can be configured using the profile.
When we are merging two close horizontal or vertical lines, we simply remove the one that is shorter than the other. The exception to this rule is if the length of the lines are very close (or equal to) each other. In this case, we remove the one that is at the bottom, or on the right.
When we are connecting a horizontal and a vertical line, the decision is which line to extend to intersect with the other. If we think of the two lines as a “T” shaped intersection, we always extend the portion that would be the vertical part of “T”, when placed properly. Figure 3 demonstrates this:
In all the figures shown above, we extend line “2” to intersect line “1”. The same is applied if one line already intersects and extends beyond the other. In this case, the vertical part of the “T” joint is shrunk to exactly intersect with the horizontal part.
Once merging and connecting is complete, what we intend to end up with is a figure like Figure 4
Our zone construction algorithm can be summarized as follows: We start with a region, initially as big as the page. At the end of each iteration, we find a set of lines that form a closed path, which we consider to be a zone. We then exclude this zone (area enclosed by the path) from the region. We then find a corner of the remaining region, and repeat the procedure until the region is empty. At this time we have a list of paths that are our zones.
Our default search direction path is upwards, leftwards, downwards and then rightwards, but the algorithm changes this sequence on the fly if needed.
Sample Newspaper:
Constructed Zones:
The objective here is to construct blocks of text that we can later assign roles to. In PDF documents, text runs do not necessarily have to be in consecutive order, nor do they need to be complete. Therefore, it is impossible to determine words by plainly iterating text runs. Text block construction involves two steps:
Construction of words based on text-runs
Construction of text blocks based on words
The first step is fairly simple using the Acrobat SDK [2], a tool we use to access the PDF document. For English publications, word construction API performed quite well, while for non-English ones we needed to do some further processing to figure out the hyphenation.
The second step involves iterating the words and combining their bounding rectangles to form bigger and semantically meaningful pieces, to which roles can be assigned. The block construction algorithm uses some heuristics to combine words and form these bigger blocks. These heuristic constants are specific to each newspaper and are read from the profile. The block construction algorithm can be summarized as follows:
For each word in the document:
Find out if there is a matching block for this word, if not create one
Add the word to either its matching block, or the newly created block
Upsize the bounding rectangle of the block to enclose the new word’s bounding rectangle
The condition for finding a matching block works as follows (all must be true):
The font height of the block and the word should be close. We allow 5% difference by default; although this can be customized.
The bounding rectangle of the word and the block should either be intersecting, or, should be intersecting when inflated horizontally and vertically by a heuristic constant. The vertical and horizontal inflation constants are described individually as coefficients that will be multiplied by the block height, and the average character width, respectively.
In addition, all words inside a text block are stored sorted by reading order (left to right, and top to bottom). This allows us to store the body text of an article as a whole, and reflow it if necessary. Text blocks constructed from the same example is shown in Figure 7. Red lines denote zone boundaries and blue rectangles are the text blocks.
The profile engine is used for two main purposes: Customize the algorithm according to the type of publication we are working on, and define the set of roles and the associated rules. The next section discusses how roles are assigned to objects with the help of the profile engine. The profile is simply stored as an XML file for each type of publication.
For example, the profile engine can be used to customize the neighboring relationships between objects (how close objects should be laid out in order to be considered neighbors, etc), or define heuristic constants used to combine words to form text blocks. Following is an excerpt from the part of the profile xml used to customize the algorithm:
<zoneConstruction minSeparatorLength="50" maxSeparatorCount="1000">
<filters>
<filterFilledRectangles apply="true" maxSeparatorWidth="30"/>
<filterImage apply="true" filterBorder="true" borderMargin="12"/>
<lineMerge closeEnough="30"/>
</filters>
</zoneConstruction>
<textBlockConstruction inflateH="1" inflateV="1">
<wordConstruction wordOverlap="0.02"/>
</textBlockConstruction>
A fully customizable rule engine is used to assign roles to blocks. Definitions:
| Role |
The semantics assigned to a specific object. For example, “title”, “author”, “body text” are several possible roles for text blocks. |
| Rule |
A single Boolean statement that would evaluate to true or false and help decide whether an object can be assigned a specific role. There are two types of rules: "confirm" and "reject". |
| Clause |
Several clauses make up a rule. It has three main parts: one or two operands and an operator. |
The following XML is an example to a definition of a role:
<role name="title">
<reject>
<clause attrName="wordCount" op="gt" value="20"/>
</reject>
<reject>
<clause attrName="isBottommost" op="eq" value="true"/>
</reject>
<reject>
<clasue attrName="fontHeight" op="lt" value="10"/>
</reject>
</role>
An object can be assigned a role if at least one “confirm” rule is true, and none of the “reject” rules is true. The Boolean statement to confirm a role can be written as:
where Ci are the confirm rules, and Ri are the reject rules, and cj and rj are individual clauses.
Initially all roles are assigned to all objects. The rule engine tries to narrow down the role set of an object by either eliminating some of them by finding offending rules (at least one “reject” rule is sufficient to remove a role from the role set of the object), or, in strong cases simply confirming the role (at least one confirm rules must be true).
Each constructed zone can contain one or more articles. The worst case scenario is that the whole page is a single zone. This can happen if there are not enough separators to construct the zones. In any case, article construction algorithm works on zone level and tries to distinguish between several articles within the same zone.
Article construction algorithm takes into account the following:
Layout of blocks, and their relationship and closeness to each other
The fact that every article should have a title
Ordering of text blocks, which should be top-down and left-right
An important step here is finding the continuation of the article if there is one, and storing the article in its entirety. For this purpose, the whole document is post-processed taking into account blocks that have been assigned the following roles:
| Reference |
This role is assigned to a block that is believed to designate the continuation of an article (e.g. “Continued on A3”) |
| Back-reference |
This role ties the continuation of the article back to the beginning of it (e.g. “Continued from A1”) |
After all processing is complete, the constructed semantics of the publication is saved as XMP Metadata (a subset of RDF), and becomes part of the original PDF document. It can also be organized and stored in a database to allow field-based queries on the processed publications.
The constructed semantics can further be enhanced via XSLT transformations, which can convert the raw XML to SVG [4] with special styling assigned to different semantic objects. The transformation applied can customize the document for different media such as a hand-held device, allow localization, and put emphasis on the required/tailored subsets of data such as highlighting and article, underlining searched keywords, etc. The end result is an interactive SVG, which contains both the objects of the original document, and semantic structures related to these objects.
[1] Extracting Semantic Knowledge From PDF Publications, Y. Khramov, A. Kroogman http://www.idealliance.org/papers/xml2001/papers/html/04-04-06.html
[4] Scalable Vector Graphics (SVG) 1.1 Specification, J. Ferraiolo, editor http://www.w3.org/TR/SVG11/
[5] XSL Transformations (XSLT) Version 1.0, J. Clark, editor http://www.w3.org/TR/xslt
[5] Resource Description Framework (RDF), Ora Lassila, Ralph R. Swick, editors http://www.w3.org/RDF
![]() ![]() |
Design & Development by deepX Ltd. |