|
Table of contents | Author | City | Company | Country | State/Province | Term | Interchange | ![]() |
Khramov, Yuri
, Director of Development ,
Schema Software Inc.
,
Vancouver
British Columbia
Canada
Email: yurik@schemasoft.com
Web site:http://www.schemasoft.com
Yuri Khramov has more than 20 years of experience in the software industry; he is involved in XML and other WEB technologies for more than 5 years. He is one of the founding partners of SchemaSoft. Prior to that, he worked at Paradigm Development Corp. in Vancouver, Canada Graphica Corp. in Tokyo, and several industrial and Academic institutions in Moscow. He holds a Ph.D. in Computer Science from Moscow Management Institute. Yuri is a co-director of Vancouver XML Developers Association.
Kroogman, Alexander
, CTO , NewspaperDirect Inc.,
Vancouver
British Columbia
Canada
Email: alexk@newspaperdirectt.com
Web site:http://www.newspaperdirect.com
Mr. Kroogman is an IT professional with more than 16 years of project management, system design, applications programming and service delivery experience. He is now CTO of NewspaperDirect Inc., a company that comes with many innovative in newspaper content delivery area. Before that, he was working Orion Technologies Inc. (Orion provides electronic commerce services to financial institutions, governments and corporations), where Mr. Kroogman was Vice President of Product Delivery, and was responsible for technical project management of all Orion projects, including management and coordination of several development teams in different countries. Mr. Kroogman earlier worked with Rydex Industries Corporation, Argilla Corporation, and Lamb Cargate Corp. Mr. Kroogman has a Masters Degree in Chemical Engineering from the University of Chemical Engineering in Moscow.
PDF is now the digital format of choice for the publishing industry. Unfortunately, it does not preserve the semantics of the publication content. The conversion of PDF documents to an XML format that preserves presentation information and exposes semantics (for example, a XMLNews-Story or NITF ) is a solution much awaited by the industry. The article describes a semantics reconstruction process (document structure recognition) aimed at solving this problem. First, basic text and picture objects are extracted from PDF. We have to go beyond most of the PDF libraries to reconstruct as much of words and phrases as possible. Then, the recognition procedure begins. The main idea of the recognition algorithm is to split the procedure into several steps or layers from the recognition of basic “primitives” to the recognition of high-level semantic structures. On the basic level, it is done by “algebraic” recognition procedures using structural linguistic description of the fragments. They are encoded with specific sequences of symbols, representing fragment’s properties (words in a special alphabet). The procedure then uses a metric defined on the space of those sequences. On higher semantic levels rule-based approach is used. The recognition is iterative: findings from higher layers could be fed back to lower layers and so on. The whole process is regarded as a number of consecutive steps that narrow the range of possible final answers. The mathematical proof of consistency of the process has been developed. Interactive machine-learning procedures fit well into this recognition process. This semantics reconstruction process is a central part of the “Smart Translator” project that converts a PDF-based newspaper archive into a “smart” and efficient XML database. The other important part of the project is the creation of RDF-based metadata for more efficient search in the XML repository. In order to be practical, this procedure has to be automated. “Smart Translator” will be using specially designed XSLT style sheets to extract metadata from XML files with a specific schema. The details of this approach will also be discussed in the presentation.
PDF is now the digital format of choice for the publishing industry. Most newspapers and magazines go to printers in this format. PDF is also widely used for on-line publications. PDF can be created by any application with printing capability through a virtual printer driver. Viewers for PDF are also widely distributed with popular browsers. Thus, PDF serves as a universal file format, and the basis for a solution to file format incompatibility problems within publishing organizations.
Unfortunately, the universal mechanism used to create PDF is also its downfall. When PDF emerges from the virtual printer pipeline, the geometrical properties of its elements are computed and saved but most semantic structure or higher-level modeling of the data is discarded. For example, software routines will compute glyph coordinates, curve parameters, colours, etc. but discard information about the grouping of characters to make up words, sentences and stories; whether words represent author, title, caption, date or content; whether they are structured as TOC, index, list or table; what diagrams are associated with what stories; etc. Even though latest versions of PDF (especially 1.4) have powerful means to describe the semantic structure, they could not be stored during “virtual printing” to PDF file. There are also a number of software products that could assist user with making a semantic markup of a PDF file; this article, however, concentrates on autoimmunization of the process of extraction of the semantics from PDF files without any additional markup.
One of many real-life examples of the situation when the restoration of the semantics is crucial is a PDF-based newspaper archive. The documents there are produced usually without any structured information. Our project, called Smart Translator, is aimed to reconstruct missing semantic structure for PDF documents, store them in an XML format or formats, and provide sophisticated search, classification, delivery and presentation mechanisms based on the on the semantics of the information. To improve the efficiency of the search, we also plan to create RDF-like metadata information from XML files.
The work of the restoring semantic structure of the PDF documents starts with extracting graphical and text primitives from PDF file. Graphical primitives sometimes could be grouped or merged into one; the “first-level” groups could as well be grouped, etc. We will call these group physical layout objects. The samples of those objects of different levels are lines, images, text runs, as well as framed text blocks or rectangular blocks including both text and graphics. The layout objects form a tree that describes a page. At the initial stage of the reconstruction only leaf nodes (primitives) and a root node (a page object) are known.
All objects can have properties. The meaning of properties varies from font attributes, color, etc., to geometrical parameters of the primitive or a group. We also define a set of semantic roles that could be assigned to a layout object. Examples of the roles are “article”, “column of messages”, “article’s author”, “outlined quotation”, etc. These roles (and more exactly, sets of potentially possible roles) are treated as properties as well. The process of the recognition could be regarded as the reconstruction of the structural tree with assignment of exact semantic roles to all its nodes.
To describe formally the properties of layout objects and the relationship between them a set of so called page grammars has to be developed. These grammars will vary somehow between different types of publications (say, “classical” newspapers versus magazines versus tabloids) and provide the formalism for rule-based reasoning systems. The rule sets have to be specific for every publication, and could be developed initially by formalizing the rules like the following:
The first centered, capitalized, bold line of text following the subtitle of a Vancouver Sun article is the author, except in the case of regularly occurring columns, which have author name near a small picture and with a horizontal line underneath.
The recognition algorithm uses hierarchical or layered approach, and incorporates of machine-learning procedures. The algorithm should thread out two routs in the counter directions: bottom-up (from primitives to page) and top-down (from page to primitives). At some level of a hierarchy these routes overlap (most likely it would be “text block with framing” level, which could correspond to an article or its part) and should adjust each other to coincide and produce together a reasonable result. The key feature of the recognition algorithm that allows combining several approaches is the use of the specific knowledge-acquisition technique developed by author several years ago for Lota project. The process of recognition is treated as a process of narrowing the range of possible response. The formal description of the “behind the scene” logic of such a process and the proof of the consistency for the correspondent calculus could be found in
.
The procedure keeps track of all processed objects on all levels, and at receiving any new information tries to make a step in reconstructing a hierarchy, using the following possible actions:
- merge two “adjacent” nodes into one, merging their properties as well;
- determine parent-child relationship between two nodes, propagating their properties both ways (up and down) as appropriate;
- create a new object (node) as a parent for a group of objects using the “generalization” rules As soon as a new object is created, or properties of an existing object have been changed, the procedure tries to recognize it using rule-based approach. The recognition step then narrows the range of possible roles for the object and can, in turn, trigger a tree-reconstruction action.
The bottom-up reconstruction is based mainly on analysis of text objects’ properties such as glyphs’ foreground/background color, glyphs size, horizontal/vertical orientation, font type, style, metrics, etc. This analysis should result in recognition of several distinguishable groups of text objects on a page with similar attributes in each group. Text objects from different groups are supposed to be semantically different at the lowest level, though it obviously doesn’t mean that all text objects in one group are semantically identical – for example, different articles may use exactly the same style for their body text. However, a set of text attributes is the most important criterion for primary clustering of content-bearing elements in PDF document.
The primary goal of text objects grouping is to form as large meaningful pieces of text data as possible. Since we don’t recognize the real sense of words or phrases at this stage, the meaningful piece in this context means a piece of text that looks meaningful: it comprises text objects with similar attributes and is arranged on a page in such a way that it could be supposed to be a (part of) really meaningful text. It might be a separate headline, a footnote, a text column (may be incomplete). Ideally meaningful piece of text should consist of several uniformly arranged lines of text. The main assumptions of merging algorithm are that a meaningful piece of text:
The first one is the most obvious. The second assumptions serves to distinguish for example article’s title from article’s body underneath. However, it doesn’t mean that an article body uses only a single font to represent all of its text objects. Some words in a body may be highlighted by color, character size, font style (bold, italic) or font effects (underline, stroked, etc.). The associating rule must be tolerant towards the reasonable small variations of text attributes. The last assumption facilitates analysis and further merging. Actually article text or even words in a column may spot a more complicated area than a rectangle. It happens usually when text wraps around a picture. In this case a shaped text column could be represented as a sequence of adjacent rectangular text pieces.
When algorithm merges lower-level objects into an upper-level one, it merges their properties as well. It helps to find similar objects when associate them on this new level. Since objects on neighboring levels (e.g. simple PDF text objects and meaningful pieces of text built of them) are semantically different, the algorithm summarizes and assigns averaged properties, whenever it creates a new upper-level object. Thus, bounding box of meaningful piece of text simply units all comprised text boxes, while its font type information should be calculated as an “aggregation” of its components’ fonts using some simple rules. On the next level the algorithm units adjacent rectangular meaningful pieces of text into text columns or articles, which may have more complicated boundary than a simple box.
The main goal of top-down analysis of newspaper’s content is to propose to bottom-up branch of the algorithm most probable spatial cells where meaningful pieces of text should locate. This part of the algorithm takes into consideration PDF graphic objects mostly: outlines such as various lines or rectangles, and spatial 2D objects (images, forms) that spot the page. All these objects serve as markers that divide a page’s space into separate areas that contain text. At the top of the hierarchy is the page itself. At next level it could be divided into header, footer and main area, which is divided next into sections, articles, etc. Analysis of auxiliary page’s graphics helps to determine these structural components.
In general case it is very complicated task. However, some assumptions about newspaper’s graphics reduce the complexity. First, the set of expected graphic objects on a page is not too rich: it includes straight lines of various styles and rectangles mainly. Then, almost every line graphic element on a page is vertically or horizontally aligned. These assumptions enable simple representation and effective processing of auxiliary graphic elements. Such tasks as storing and looking-up coordinates, finding of intersections, determining of nesting or positional relationship of objects become much easier.
To facilitate top-down reconstruction, we build so called X-Y tree representation for the page geometrical configuration. This is spatial data structure in which each node corresponds to a rectangular block. The children of each node represent the locations of the subdivisions of the parent block in a particular (horizontal or vertical) direction. The building of this tree is not completely straightforward, as many dividers in the newspaper do not touch each other, some blocks do not have rectangular form. We use a number of heuristics to get the results that may bring in an additional (non-existing) partition, but minimize the risk of putting parts of semantically different objects together into one X-Y block.
At the point when the recognition process is not able to add any new information to the tree (automatic recognition is finished), we can assign unique roles to the nodes that still have more than one possible role using some algorithm. To do so, we should assume that we have very powerful recognition system just from the very beginning. Luckily, machine learning allows us to develop the procedure that builds the system gradually and almost automatically, using so called “learning with a teacher” approach.
After the automatic recognition step, the results of the recognition are presented to an operator, which acts as a teacher for the recognition system. For every object in the semantic tree, there are three possible situations.
A single semantic role has been recognized and the operator agrees with this response. No action is necessary.
A set of possible semantic roles has been selected for an object and it includes the correct response, as perceived by the operator. The operator then selects this single role. The system automatically forms a new rule based on this precedent and adds it to the rule base
A correct response is outside the range of roles determined by the system. In this case, the operator enters the correct response and the system forms the rule based on the precedent and adds it to the database. At this point, however, it contradicts some rules already stored in the system. The system will be able to find conflicting rule, narrows the conditions under which it is true, and then adds new “generalized” rule that combines these two.
The learning procedure described above allows building elaborated rule systems relatively quickly and without programming efforts. The mathematical model for such a procedure with the proof of correctness of the procedure has been developed by one of the authors during his work on ARVECS recognition system
. It will be awailable as a "white page" on our site
. The set of representative samples is crucial, however for such a procedure. In our case, when we are building rule base for a particular newspaper, the learning could be done on few newspaper issues.
Eventually, the recognition procedure will reconstruct semantic tree for the newspaper. The conversion of a semantically loaded tree into an XML document could be regarded as a more or less routine procedure. For now, we have selected XMLNews-Story, which is subset of NITF, as a destination XML schema. In order to preserve the presentation information from PDF file, we are extending XMLNews-Story schema with tags that ties some elements to the specific addressable areas (for now, pages) in the PDF documents. In addition, we are developing a specific CSS style sheets to partially preserve newspaper’s look and feel in an XML file.
The next important task is to extract meta-data from the XML files to make the search in the database more effective. We started our work on the XSLT style sheets that would extract RDF (more exactly, RSS) information from other XML documents last spring. On April 10, 2001 it has been presented at Vancouver XML Developers association meeting. This idea gains more and more popularity, and the similar approach with more details is described by Eric van der Vlist's paper "Building a Semantic Web site"( .
In addition to the described components, the Smart Translator project will include the software to store both PDF and XML in an efficient way, and means to deliver the content to the user over the net. The project is still in the investigation phase by now, but we hope to present first solid results soon for the recognition part of the project.
We thank Mr. Boris Prokofiev, of Schema Software Inc., who had contributed a lot to the development of the ideas discussed in this paper.
[YK90] Yuri Khramov. Knowledge Acquisition Process in Lota Expert System shell. Proceedings of “Expert System” conference. 1990, Souzdal, Russia (in Russian)
[NE92] G.Negy et al. A Prototype Document Image Analysis System for Technical Journals. Computer, July 1992
[YKU] Yuri Khramov. Interactive learning procedure in strucutral recognition. Unpublished.
[YX98] Yi Xu. An Incremental Approach to Document Structure Recognition. http://www.gmd.de/publications/research/1998/016/
|
Table of contents | Author | City | Company | Country | State/Province | Term | Interchange | ![]() |