Text extraction from graphical objects during XML conversion

Keywords: Conversion, Graphic, Legacy Data Conversion

Ryan Germann
Product Manager
Exegenix
Toronto
Ontario
Canada
ryan@exegenix.com

Biography

Ryan is Product Manager and a founding employee of Exegenix Inc., a company providing content access and conversion technologies. He is involved in market research and the strategic aspects of developing practical applications of Exegenix technology. His professional focus is on facilitating broad adoption of XML content and XML-enabled applications across organizations of all sizes. Since 1995, Ryan has been involved in SGML and XML projects, lending his expertise to both client-side and server-side components. Ryan was previously employed at SoftQuad Software where he was involved in Product Management, Marketing, Web site development and user interface design.


Abstract


Materials that include ornamentation and complex design features have long been challenging to convert to XML, even by hand. The problem is two-fold: complex documents usually contain a variety of graphics, some of which may be simple ornamentation, with others actually fundamental to the subject matter. In addition, these graphics can consist of images overlaid either with text that is integral to the image content, or with actual body text. The analysis and extraction of such content into a meaningful order in the converted XML file is not currently possible via scripting conversion tools, and can be time-consuming and arduous to tag manually.

Now, Exegenix simplifies the conversion of entire classes of documents via groundbreaking functionality that separates and extracts text layers from graphical objects. In addition to the automated image processing capabilities already available in Exegenix Conversion Solutions, conversion operators with no XML expertise can now quickly and easily:

- differentiate between images that are "decoration", and those that are part of the main document content

- identify text that overlays an image as "part of the image" or "part of the main body text." The exported images include any text identified as integral to the image itself; any text identified as body content is excluded from the image, delivering a clean, republishable image.

This functionality is delivered via Exegenix's user-friendly conversion Inspector, which offers a unique WYSIWYG preview into the conversion process before any XML is generated. No XML expertise is required -- the ECS Inspector is intuitive enough for even a novice user.

We believe we are the only company to offer such graphical processing capability, and this is its first public presentation.


Table of Contents


1. Product Presentation Paper

1. Product Presentation Paper

Since this was a product presentation, no paper was prepared for the proceedings.

XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.