XML 2002 logo

Adding XML Capability to a Legacy Large-Scale Full-Text Indexing System

Abstract

LexisNexis has been an Electronic Publisher of Legal Research data (Lexis) since 1973, and News and Business data (Nexis) since 1979. The LexisNexis data warehouse contains over 3 billion documents, all of which have been full-text searchable since the inception of the company. In 1990, we began to add value to the on-line data with the introduction of Term-based Topical Indexing (VISF) and Term Mapping (TMS). TTI and TMS are used to automatically classify documents into categories that are similar to topic maps. This classification process goes beyond separating sources into categories to the more detailed process of classifying documents into categories.

In 1992, LexisNexis introduced Company Indexing into our data workflow. Company Indexing scans data for relevant/important company names, variants and disambiguation clues, and inserts Controlled Vocabulary Terms (CVTs) and relevance scores into the document. The advantage of this is that the customer is able to retrieve documents that are about a given company, while excluding documents that only mention the company once. Company Indexing has since been expanded to organizations, places, people, subjects and industries.

When LexisNexis decided to make the move to storing their data in XML, it presented both challenges and advantages to the indexing team. The primary challenge of migrating to XML was that LexisNexis had always used proprietary markup called VISF. All of the existing systems knew VISF and were able to parse and process it. The indexing and classification code was written to process that markup on MVS systems using PL/1 and lexical scanners specific to the markup. The algorithms that had been developed and refined over many years were a valuable part of the indexing systems. The number of documents that were processed each day and the knowledge of the existing systems necessitated modifying those systems to be able to read, process and write well-formed XML data while still being able to process our legacy content markup.

The use of XML has offered two main advantages to the indexing effort. First, the use of XML lets us use standard XML tools and APIs for development. More importantly, XML tags make the identification of semantic roles of content components easier, which gives any indexing effort a big head start.

The incredible extensibility of XML data presented a challenge to the developers from the very start of the migration effort. No longer were they dealing with a simple, known markup format. Data consistency became an important issue. This presentation will further describe these challenges and how namespaces and data consistency simplified the processing of XML data.

Keywords


Table of Contents

1. About LexisNexis
2. The LexisNexis Search Engine: A User's View
3. LexisNexis SmartIndexing Technology™ and Other Classification Processes
4. LexisNexis' Legacy Content Markup
5. The Challenges and Advantages of Migrating to XML
6. Lessons Learned
Acknowledgements
Glossary
Biography

1. About LexisNexis

LexisNexis has been an on-line information provider since the 1970's. The company was formed in the 1960's as Data Corporation. In 1968, Mead Corporation acquired Data Corporation and renamed it Mead Data Central. In 1973, we introduced the Lexis service, the first commercial, full-text legal information service to help legal practitioners research the law more efficiently. In 1979, we introduced the Nexis service, adding newspapers, magazines and other business information. Reed Elsevier, an Anglo-Dutch partnership acquired Mead Data Central in 1994 and renamed the company LexisNexis. Our customers come from legal, corporate, government and academic markets in North America, the British Commonwealth, continental Europe and Latin America. Over time, LexisNexis has expanded its product offerings beyond on-line systems. Some of their better-known print brands include Butterworths, Martindale-Hubbell, Matthew Bender, Mealeys, Michie and Shephards.

LexisNexis' customers primarily access via the World Wide Web; two-thirds of the company's searches and search revenue come from the Web. Several years ago, however, most of our business came from proprietary dial-up software. We currently have about 2.6 million subscribers that either purchase a site license or pay on a per document or per search basis.

LexisNexis' Data Center is located in Miamisburg, Ohio. It is about the size of two football fields and is on-line 24 hours a day, 7 days a week and 365 days a year. Our on-line search and retrieval system is an IBM mainframe based system composed of nine large mainframe servers running 22 multiple virtual server operating systems. We also operate 300 mid-range UNIX servers, 400 multi-processor NT servers and 600 web servers to deliver millions of Web pages each month. The LexisNexis Data Warehouse contains around 32 terabytes of searchable content. It has contains over 12,000 on-line databases consisting of over 32,000 legal, business and news and public record sources. There are over 3.1 billion documents on-line and approximately 18 million new documents added every week. This system handles, on average, around 700,000 searches per day with an average retrieval time of five seconds. At peak times, the systems have handled as many as 1.7 million searches per day.

How do you convert a legacy indexing system that's been running 24/7 for almost nearly ten years to take advantage of eXtensible Markup Language (XML)?

2. The LexisNexis Search Engine: A User's View

The LexisNexis search engine was the first commercial, full-text search engine. The base technology of the system is a boolean search engine that allows search operators such as:

  • Boolean AND

  • Boolean OR

  • Term A W/n Term B - Find Term A within n words of Term B

  • Term A W/s Term B - Find Term A within the same sentence as Term B

  • Term A W/p Term B - Find Term A within the same paragraph as Term B

  • Term A PRE/n Term B - Find Term A preceding Term B by no more than n words

There are also a number of wildcard characters to enhance searching. The search system adds more sophistication by providing precision search commands such as:

  • ATLEASTn (Term) - Return only documents that contain at least n occurrences of the term

  • PLURAL(term) - Find only the plural form of the term

  • SINGULAR(term) - Find only the singular form of the term

  • CAPS(term) - Term or terms must have one or more capital letters

  • ALLCAPS(term) - Term or terms must be all in capital letters

  • NOCAPS(term) - Term or terms must not contain any capital letters

Finally, the system allows the user to contextually constrain a search to specific components (called segments) in the documents. This system allows the construction of sophisticated search strings such as: superman w/5 justice league of america and byline(clark w/2 kent or lois w/2 lane) and headline(capture! or nab! or foil! or jail! or defeat!) The search system works in conjunction with a sophisticated keywording system for words, numbers, dates and so on. The keywording system also inserts alternate keywords into the data to simplify the handling of plurals, months associated with a season and so on.

Another component of the LexisNexis search engine is a Freestyle search system. Freestyle Searching uses a technology called associative retrieval that lets the user enter a search description in plain English, just the way they might ask someone a question. Natural language can be used in virtually all LexisNexis services. It lets you create your own search description while removing the need for connectors and search logic. Users can enter searches such as the following: "Under what circumstances can biological parents regain custody of their adopted children after an adoption?"The LexisNexis Freestyle search system was introduced in 1993 and precedes the "Ask Jeeves" Internet search engine by approximately 4 years.

The most recent addition to the LexisNexis search system is a "Sounds Like" search engine that is particularly helpful for customers that are doing research in public records databases. This allows customers to simplify their searches - instead of searching for Paul Brown or Paul Braun or Paul Browne, they can do a "sounds like" search for Paul Brown.

3. LexisNexis SmartIndexing Technology™ and Other Classification Processes

In the workflow of a document that is received as part of a feed from one of our information providers and prepared for incorporation into our databases, XML is playing an increasing role in many different stages. XML offers particularly interesting benefits and challenges as it is incorporated into the Indexing and Classification systems used by LexisNexis to add value to the data. These systems add index terms to News, Company, Legislative and Nexis web search sources.

Indexing began at LexisNexis in 1992 with Term-Based Topical Indexing (TTI) and Term Mapping System (TMS). TTI and TMS work together to automatically classify documents into categories that are similar to topic maps. They go beyond separating sources into categories to the more detailed process of separating individual documents into categories. These categories allow customers to focus their searches on documents that they know for certain relate to what they are looking for. The more than 2,100 categories include:

  • Business and Finance

  • Computers and Communications

  • Energy

  • Entertainment

  • Insurance

  • Marketing

Every day, the LexisNexis SmartIndexing Technology™ system processes tens of thousands of news, company and legislative documents on an IBM mainframe-based platform using programs written primarily in PL/1. These programs reference Concept Definition rule bases which are created and updated by a team of information professionals and subject matter experts. The Concept Definitions are used by the indexing programs to look for names, nicknames, acronyms, spelling variants and related words and phrases, and to assign a controlled vocabulary term to such items within a document. For example, a program finds "Big Blue" and assigns a Company CVT of "International Business Machines Corp" or sees "CEO steps down" and assigns a Subject CVT of "Executive Moves". Concept Definitions contain lists of "look-up" terms and phrases as well as rules that assign various positive weights (and negative weights where needed) to the look-up terms. They also contain "block terms" that prevent the look-up terms from being erroneously matched within longer ambiguous phrases, and "frequency limits" that ensure that documents containing denser discussions of a given topic receive higher relevance scores.

SmartIndexing Technology™ grew out of several components that have been used for many years to classify documents at LexisNexis. In 1994, LexisNexis introduced a Company Indexing system that later expanded into an Entity Indexing system. This system scans data for organizations, companies, geographical locations and people, and assigns controlled forms of the entity names as Controlled Vocabulary Term (CVT)s, along with associated attributes such as SIC codes and ticker symbols. The lists of Controlled Vocabulary Terms for Entities include over 330,000 companies, more than 10,000 organizations, in excess of 20,000 people and 800 or more places. A CVT in combination with its relevance score can be thought of as content metadata.

The Entity Indexing look-up process was the first to calculate relevance scores on a range of 50% to 99%. This innovation allows users to define their search criteria to retrieve all of, most of, or only the most highly relevant documents indexed with the target entity. The indexing system's selection criteria thus allow a search to return the highest number of possibly relevant documents (recall) while also maximizing the percentage of those documents that the user judges as relevant (precision). For example, a search like "COMPANY(Microsoft Corporation pre/2 9*%)" will find all documents containing references to Microsoft Corporation with a relevance score of 90% or higher. This search will return the most highly relevant documents about Microsoft.

Another component of LexisNexis SmartIndexing Technology™ is the General Subject Indexing System (Nexis Subject Indexing (NSI)). NSI was introduced in 1997 to index for general subjects in any publication, regardless of vendor-supplied terms. This system is also known as Topical Indexing. The initial focus for NSI was to index Market Research Report databases for any reference to Industry and Marketing topics. It also indexes for State and Federal Legislative topics. Its coverage has since been expanded to news, company and legislative documents. NSI shares thesauri with Term Mapping and Entity Indexing.

LexisNexis' most recent addition to the SmartIndexing Technology™ system is NetOwl Indexing. NetOwl Indexing is based on a commercial-off-the-shelf package that LexisNexis ported to their mainframe indexing systems in 1998. NetOwl uses a combination of patterns, lexicons and aliasing capabilities to find things in a document that look like names of entities. A pattern may look for capitalized words around "Corp." or "Corporation" or "Inc.". A lexicon may be used to find names of people or places. Aliasing uses a process of linking variant forms of the same name together using linguistic co-reference and other techniques. NetOwl Indexing has added over 2 million extracted entities to LexisNexis' on-line database.

The addition of index terms to LexisNexis' on-line documents has it possible to building topic-map like "Dossier" products for areas such as Companies and Industries. The Company Dossier product uses index terms in a wide variety of data to build a single document for a company containing that company's news, executive information, stock information and SEC filings, related court cases, intellectual property and patents as well as information on competitors, subsidiaries, brands on so on.

4. LexisNexis' Legacy Content Markup

LexisNexis data has had a proprietary markup format called Variable Input String Format (VISF) since the inception of our on-line service in 1973. VISF is based on a system of "segments" and "dollar commands" that divide a document up into specific components, provide information about how the document is to be displayed, and allow documents to cross reference other information such as legal citations and other documents in the LexisNexis data warehouse. VISF "segments" break a document down into specific contextual components such as headline, byline, copyright, body and so on. A segment indicates the start of a component; the end of component is indicated by the start of another segment. In later years, VISF even included the concept of "Standard Generalized Markup Language (SGML) Islands" within the data to help provide additional structure to the documents. Here is an example of a simple news story as marked up in VISF:

$00:0000000001:$01:Copyright 2002, Planet News Service$10:The Daily Planet$60:
Superman nabs Lex Luthor, foils latest plot$90:Clark Kent$120:With his latest 
arrest of Lex Luthor, Superman continues to be the salvation of Metropolis. Su
perman caught Luthor in the act of breaking into the vault of the Mid-City Ban
k. When asked why he was there, Luthor told this reporter that he was on an ar
cheological dig and ended up in the vault by mistake.$250:October 3, 2003

VISF served the company very well for many years. All of our systems know how to process, store and deliver data in VISF format. Despite our comfort with VISF, several developments made us realize the limitations that it placed on us:

  • More and more of our data was being delivered to the Web, and transformation of VISF to HyperText Markup Language (HTML) was not intuitive

  • More customers were sending us data in XML or SGML and we were translating it to VISF for processing and storage

  • The company was acquiring print publishers, but we had no easy way of converting VISF to print formats

  • We were starting to form partnerships with other companies to create derivative products based upon our data, but we couldn't offer those partners any standard tools for working with LexisNexis data.

Starting around 2000, a business case was made to begin storing documents in XML. While this architectural direction presents many advantages to LexisNexis going into the 21st century, it poses a great number of challenges as well.

5. The Challenges and Advantages of Migrating to XML

While many departments within LexisNexis faced challenges migrating their systems to process XML, this paper will focus on the advantages and challenges experienced by the SmartIndexing team. First, I would like to talk about the advantages gained by migrating the SmartIndexing systems to XML. The primary advantage that we found was that XML provided highly structured documents that provide some contextual metadata about the elements. In VISF, our control files would indicate that we had to, for example, process segments $10:, $100:, $119 and $120:. For similar XML data, we would be processing <lnv:BYLINE>, <lnv:HEADLINE>, <lnv:REAL-LEAD> and <lnv:BODY>.

Context VISF XML
Byline $10: lnv:BYLINE
Headline $100: lnv:HEADLINE
Document Lead $119: lnv:REAL-LEAD
Document Body $120: lnv:BODY-1

Table 1. Comparison of Contextual Markup in VISF vs. XML

Part of the migration to XML was a data design consistency effort that greatly reduced the number of possible document formats that the indexing software had to process. It is far easier to build control information for one news document Document Type Definition (DTD) than for thousands of individual news sources. This is not to say that segmentation was different on every VISF source, but rather that we had to be aware of the fact that variability was possible.

Another advantage was that the indexing team was able to define and "own" a set of elements to hold the enhancements that we add to the data. We created a DTD fragment was to be included in every new DTD being developed in the company. We always know the elements that we are adding data to and are able to write tighter code to process those elements. In addition, the information regarding those elements is the same in every control file, simplifying the development and maintenance of those files.

Because of the data consistency efforts and because the indexing elements were fixed, educating the people that create the control files (and ultimately, the work that they do) became simpler. First of all, there were less control files to maintain. Since News documents have one DTD format, one master control file can be defined and all XML news source control files can be aliased to that master. Secondly, while the people maintaining those control files generally know what type of segments are processed for a news document, they have to cross reference another control file to get the specific segment number to be processed. Mistakes can and do happen. The following example is part of an entity indexing control file for a VISF source:

	#BASES:
	#BASE=P801
	
	#SEGMENTS:
	#DATE=20
	#HEADLINE=60
	#POUND=89
	#LEAD=119
	#BODY=120
	#CO=160
	#TS=166
	#COUNTRY=188	
	#ST=185
	#CITY=182
	#ORG=163
	#PEO=179
	#SUBJ=155
	#IND=171
	#AUDIT=215
	#PUBSUB=156
	#COPYRI=01
	#PUB=02
	#DATELINE=100
	#HIGHLT=105
	#GRAPH=142
	#SECTION=30
	#BYLN=90
	#VENDID2=161
	#VENDID3=164
	#VENDID4=167
	#VENDID5=172
	#VENDID6=177
	#VENDID7=180
	#VENDID8=183
	#VENDID9=186
	#VENDID10=189
	#VENDID11=191
	#VENDID12=55
	#VENDID13=151
	

This is fairly cryptic to the uninitiated user. There is not a whole lot of information in this file that gives you confidence that you are referencing the correct segment number for a given classification (i.e. #HEADLINE=60). Compare that with the corresponding section of the control file for an XML database:

	#BASES:
	#BASE=V216

	#ELEMENTS:
	#DATE=DATE0
	#HEADLINE=HEADLINE
	#LEAD=REAL-LEAD
	#BODY=BODY-1
	#CO=LN-CO
	#TS=LN-TS
	#COUNTRY=LN-COUNTRY
	#ST=LN-ST
	#CITY=LN-CITY
	#ORG=LN-ORG
	#PEO=LN-PERSON
	#SUBJ=LN-SUBJ
	#IND=LN-IND
	#AUDIT=AUDIT
	#POUND=SPEC-LIB
	#PUBSUB=PUB-SUBJECT
	#COPYRI=COPYRIGHT
	#PUB=PUB
	#DATELINE=DATELINE
	#HIGHLT=HIGHLIGHT
	#GRAPH=GRAPHIC
	#SECTION=SECTION
	#BYLN=BYLINE
	#VENDID2=PUB-COMPANY
	#VENDID3=PUB-ORGANIZATION
	#VENDID4=PUB-TICKER
	#VENDID5=PUB-INDUSTRY
	#VENDID6=PUB-PRODUCT
	#VENDID7=PUB-PERSON
	#VENDID8=PUB-CITY
	#VENDID9=PUB-STATE
	#VENDID10=PUB-COUNTRY
	#VENDID11=PUB-REGION
	#VENDID12=NAME
	

In this example, this section of the control file contains information about the XML elements to be processed by the indexing programs. Because we are now using XML element names, the cross reference between an element type and its' corresponding element in the document is a little clearer. One would be less likely to classify #HEADLINE=COPYRIGHT, for example. In the future, we plan to migrate these control files to an XML format that would capture mapping between elements and their respective functions as they pertain to indexing.

While there are and will continue to be advantages to migrating the indexing systems to process XML, there were several significant challenges that the SmartIndexing team faced in making their systems XML capable. The primary challenge that faced the SmartIndexing team was the sophistication and scale of their own systems. As previously mentioned, the SmartIndexing system uses sophisticated analytical software tools and algorithms with human intellectual input to index documents. These systems must be able to add index terms to tens of thousands of documents each day and pass them on to other systems to be made available on-line. Early on in the migration process, the decision was made to keep the existing systems and to "teach" them to read XML. These systems are primarily interested in the content of a document. For them, the markup primarily exists to direct them to the content that they need to process. At the time that the migration effort began, there was nothing like the standard XMLApplication Programming Interface (API)s available for mainframe PL/1 programs to read and parse XML data (since then, IBM has released PL/1 Enterprise Edition, which includes a Simple API for XML (SAX)-like XML parser). The SmartIndexing systems used Lexical Scanner Grammars (much like Flex) to read VISF data and to return tokens and text to the calling program. The challenge was to write new grammars to break XML down into components that the indexing programs already understood. This reduced the impact of XML upon the existing code base. When this effort began, there was little or no test XML data on which to base the grammar, but we were lucky enough to come across a basic XML grammar that had been written by another department. As the XML data began to evolve, the grammars had to change with it. As the data became more stable, we were able to simplify the grammars somewhat, and by extension, the code.

Another significant challenge to migrating the SmartIndexing systems to XML came from the "depth" of markup in the XML. VISF, our legacy content markup, has a very flat structure; as I previously mentioned, we only recently introduced the concept of "SGML/XML Islands" in the VISF data and they are still not widely used. XML, on the other hand, has a fair degree of structural depth to it. Even with the current limitations placed upon our XML markup from various systems in our workflow, there is generally more markup that must be processed in order to get to the content of a given element. The model for processing VISF was that one segment ended when another segment began. That model did not work for XML. We had to modify our systems to track the major (i.e.child of root) elements and ignore or pass all other elements until we found the close of the major element. In some cases this could be done with the Lexical Scanner Grammar. In other processes it had to be done programmatically.

The next major challenge faced by the SmartIndexing department came when we had to insert new elements into the data. The standard model for adding segments to VISF data was to add any new or modified segments at the end of the data. The program that adds the elements is fairly sophisticated. It must determine which CVTs and codes are to be added to a document (the output of Term Mapping, Entity Indexing, NetOwl Indexing and Subject Indexing), check if those terms and codes already exist in the document, make sure that scores match if they do exist, and add new terms to existing elements if necessary before writing out the resultant document. As previously mentioned, when the migration effort first began, there was very little sample data to work with. When we had to add new elements, we would just put them in front of the closing root element. The first DTDs that were produced were very simple and imposed no restrictions on element ordering. The documents validated correctly and were able to be passed on to other systems to be made available in the LexisNexis on-line data warehouse. Later DTDs added rules for element ordering. A post-indexing process had to be written to essentially "sort" the XML documents into the element order required by their DTD.

Another challenge of migrating to XML was coordinating efforts between various departments in the company that were also migrating their systems to process XML. Some departments were on the mainframe and had similar issues to those faced by the SmartIndexing group while other departments that were already on distributed or NT platforms had the XML APIs at their disposal. All of the departments involved in the migration had some pre-conceived notions about the data and about how it would be structured and processed. For some time, the DTDs were changing quite rapidly and it was difficult to coordinate code changes for systems that processed the data. It was especially important to keep everyone talking to each other and getting issues out in the open as soon as possible.

The final challenge in migrating our legacy systems to XML was dealing with Latin-1 encoding and looking ahead to being able to process data in Unicode. For many years, LexisNexis had been using their own internal code page for Extended Binary Coded Decimal Interchange Code (EBCDIC) data. All of the VISF data processed by the LexisNexis mainframe fabrication systems were oriented to this code page. Moving to the Latin-1 EBCDIC code page necessitated modifying the Lexical Scanner grammars to accept the full range of characters contained in CP1047 while still supporting U.S. English data (without any special characters) in CP037. We also have to deal with data encoding issues when transferring data from systems that use American Standard Code for Information Interchange (ASCII) to mainframe systems that use EBCDIC. As previously mentioned, efforts are now underway to incorporate full Unicode support into these systems.

6. Lessons Learned

If I had to bring away one lesson from migrating our systems to process XML, it would be the importance of communication and project management. When you are dealing with integrating XML across an entire enterprise, it is extremely important to keep everyone talking to each other as well as dealing with the interactions and dependencies across groups. I would recommend that one of the first tasks in starting up a migration project would be to implement an issues tracking and change control system that includes all interested parties.

The next lesson I would mention would be to make sure that the efforts to create the data are not completely separated from the efforts to process the data. This can be difficult because of funding, budgets, departmental boundaries and the like, but a good deal of animosity and "gotchas" can be prevented if the data owners are talking to the processing owners. Another lesson would be to have test data available as early as possible in the process. Even though XML is highly structured and there are standard tools to work with it, it is not enough to just think that if you have a DTD and a parser and someone who can program in Java that you are set. XML can be as simple or as complex as it needs to be and one model for processing the data is not enough.

Final lesson - Don't panic! XML is ultimately just data. If you are faced with retrofitting XML into an existing system and there are limited tools for dealing with XML on your hardware or software platform or if you have limited technical resources to recreate a system on an entirely new platform, there are solutions available to you. You know how to read, process and write data. Now you are faced with reading, processing and writing data that is just a little different. Since XML is highly structured, it is fairly easy to machine read. A good solution may be to modify existing processes and wait for the technology and resources that you need to become available.

Acknowledgements

I would like to thank Don Bergeron, Jill Sellers, Mark Shewhart and Brian Wisvari for reviewing this paper and helping to make it better. Special thanks goes to Bob DuCharme for not only reviewing the paper, but also helping me with the gcapaper DTD. Finally, I would like to thank my wife Joan for her help with improving the grammar and general flow of the paper.

Glossary

API

Application Programming Interface

ASCII

American Standard Code for Information Interchange

CVT

Controlled Vocabulary Term

DTD

Document Type Definition

EBCDIC

Extended Binary Coded Decimal Interchange Code

HTML

HyperText Markup Language

NSI

Nexis Subject Indexing

SAX

Simple API for XML

SGML

Standard Generalized Markup Language

TMS

Term Mapping System

TTI

Term-Based Topical Indexing

VISF

Variable Input String Format

XML

eXtensible Markup Language

Biography

Ron Callahan was a technical lead for the Nexis Data Enhancements group for the first phase of XML Migration at LexisNexis. He has since assumed the role of project manager/system architect for the second phase of that project. He has been with LexisNexis since 1996. Ron lives in Cincinnati, Ohio with his wife Joan. He enjoys home brewing, playing golf and singing in his church choir.