Abstract
PDF format has evolved from a pure page-layout format to a sophisticated one that can store structural information about the underlying document. On the other hand many PDF files, especially publications, are generated without any semantics. In this paper, we present software to fill this need for PDF publications. Our “PDF Semantic Extractor”, which is the first-stage implementation of the “Smart Translator” proposed in [“Extracting Semantic Knowledge From PDF Publications”] [1] consists of two main parts: An Acrobat plug-in that retrieves semantics from PDF publications and converts it to XML, and server-side scripts combined with XSLT [5] transformations that convert this raw information to interactive SVG documents with special styling attributes assigned to semantic objects. The PDF Semantic Extractor starts by analyzing lines and drawings on a page. By the use of a set of customizable filters, only vertical and horizontal lines that can be considered as separators are left. A custom closed-path construction algorithm is used at this stage, which draws boundaries around zones. Each zone is defined as an area that can contain one or more articles. At the same time, text blocks are formed out of text-runs. In PDF, text runs can contain a full word, a partial word, or even a single letter. These atomic objects are combined together to form bigger blocks, to which roles can be assigned. Possible text block roles are “title”, “author”, “article text”, etc. A fully customizable, heuristics based rule engine is used to assign roles to the text blocks. A profile engine is used to store different styling attributes for different publications. The last step involves combining two groups of objects and building the final semantic tree. Zones and blocks are further analyzed to construct separate articles, ads, etc. In addition, articles are connected to any continuation they might have on other pages. The extracted semantic information is used for two main purposes. First, it is indexed and stored in a database, allowing field-based queries on archives of PDF publications. Second, it is attached to the same PDF document as XMP (a subset of RDF) metadata. This promotes modularity, and the PDF becomes tailored with the initially missing semantics. Furthermore, the embedded XML can easily be retrieved from the original document using server-side scripting, and converted to SVG [4] with XSLT transformations. The transformation applied customizes the document for different media, allows localization, and puts emphasis on the required/tailored subsets of data such as highlighting and article, underlining searched keywords, etc. The end result is an interactive SVG, which contains both the objects of the original document, and semantic structures related to these objects.
Keywords
![]() ![]() |
Design & Development by deepX Ltd. |