XML 2002 logo

Adding XML Capability to a Legacy Large-Scale Full-Text Indexing System

Abstract

LexisNexis has been an Electronic Publisher of Legal Research data (Lexis) since 1973, and News and Business data (Nexis) since 1979. The LexisNexis data warehouse contains over 3 billion documents, all of which have been full-text searchable since the inception of the company. In 1990, we began to add value to the on-line data with the introduction of Term-based Topical Indexing (VISF) and Term Mapping (TMS). TTI and TMS are used to automatically classify documents into categories that are similar to topic maps. This classification process goes beyond separating sources into categories to the more detailed process of classifying documents into categories.

In 1992, LexisNexis introduced Company Indexing into our data workflow. Company Indexing scans data for relevant/important company names, variants and disambiguation clues, and inserts Controlled Vocabulary Terms (CVTs) and relevance scores into the document. The advantage of this is that the customer is able to retrieve documents that are about a given company, while excluding documents that only mention the company once. Company Indexing has since been expanded to organizations, places, people, subjects and industries.

When LexisNexis decided to make the move to storing their data in XML, it presented both challenges and advantages to the indexing team. The primary challenge of migrating to XML was that LexisNexis had always used proprietary markup called VISF. All of the existing systems knew VISF and were able to parse and process it. The indexing and classification code was written to process that markup on MVS systems using PL/1 and lexical scanners specific to the markup. The algorithms that had been developed and refined over many years were a valuable part of the indexing systems. The number of documents that were processed each day and the knowledge of the existing systems necessitated modifying those systems to be able to read, process and write well-formed XML data while still being able to process our legacy content markup.

The use of XML has offered two main advantages to the indexing effort. First, the use of XML lets us use standard XML tools and APIs for development. More importantly, XML tags make the identification of semantic roles of content components easier, which gives any indexing effort a big head start.

The incredible extensibility of XML data presented a challenge to the developers from the very start of the migration effort. No longer were they dealing with a simple, known markup format. Data consistency became an important issue. This presentation will further describe these challenges and how namespaces and data consistency simplified the processing of XML data.

Keywords