Text extraction from graphical objects during XML conversion

Track: Product Presentations

Audience Level: High Level/Technical View

Time: Tuesday, November 16 at 14:45

Author: Ryan Germann , Product Manager, Exegenix

Keywords: Conversion, Graphic, Legacy Data Conversion

Abstract:

Materials that include ornamentation and complex design features have long been challenging to convert to XML, even by hand. The problem is two-fold: complex documents usually contain a variety of graphics, some of which may be simple ornamentation, with others actually fundamental to the subject matter. In addition, these graphics can consist of images overlaid either with text that is integral to the image content, or with actual body text. The analysis and extraction of such content into a meaningful order in the converted XML file is not currently possible via scripting conversion tools, and can be time-consuming and arduous to tag manually.

Now, Exegenix simplifies the conversion of entire classes of documents via groundbreaking functionality that separates and extracts text layers from graphical objects. In addition to the automated image processing capabilities already available in Exegenix Conversion Solutions, conversion operators with no XML expertise can now quickly and easily:

- differentiate between images that are "decoration", and those that are part of the main document content

- identify text that overlays an image as "part of the image" or "part of the main body text." The exported images include any text identified as integral to the image itself; any text identified as body content is excluded from the image, delivering a clean, republishable image.

This functionality is delivered via Exegenix's user-friendly conversion Inspector, which offers a unique WYSIWYG preview into the conversion process before any XML is generated. No XML expertise is required -- the ECS Inspector is intuitive enough for even a novice user.

We believe we are the only company to offer such graphical processing capability, and this is its first public presentation.