Challenges and Rewards of Migrating an Electronic Publishing System to XML

Keywords: Break-Even Analysis, Business Process, Case Studies, Change Management, Content Management, Conversion, Database, Electronic Publishing, Enterprise Applications, Legacy Data Conversion, Legal Publishing, Metadata, ROI, Return on Investment

Ronald Callahan
Project Manager, XML Infrastructure Program
LexisNexis
Miamisburg
Ohio
United States of America
ronald.callahan@lexisnexis.com

Biography

Ron Callahan has been with LexisNexis since 1996. He was the technical lead for the Nexis Data Enhancements group for the first phase of the LexisNexis' XML Infrastructure Program. He then assumed the role of technical lead and assistant project manager for the second and third phases of that program and was recently named as project manager for the remainder of the project. Ron lives in Cincinnati, Ohio with his wife Joan and new baby Danny. He enjoys home brewing, playing golf and singing in his church choir.


Abstract


LexisNexis has been an online publisher of legal research data (Lexis) since 1973, and news and business data (Nexis) since 1980. LexisNexis' data warehouse contains over 3 billion documents, all of which have been full-text searchable since the inception of the company.

In 1999, we began an infrastructure project to migrate the fabrication, storage and delivery of our online data to XML. This project was begun with purely strategic goal and no expectation of increased revenue. It should be noted that the project did not include the actual conversion of internal data to XML.

The effort to migrate to XML was complicated by several factors. First and foremost was the sheer overhead of our legacy systems. At the point that the migration project begun, we had been delivering data marked up in our proprietary format from mainframe computer systems for over 25 years. Given the volume of data that we were potentially dealing with, and existing customer expectations for search and delivery performance, moving the data off of the mainframe platform was not an option. This was further complicated by the fact that IBM mainframe systems offered little or no native XML support at the time that the migration began.

Another significant complicating factor to the migration effort was that the XML data needed to inter-operate seamlessly with our existing proprietary data. The customer would not and should not know the base format of the data that they were searching and viewing.

As the migration effort progressed, we were fortunate that our very experienced engineers were able to transfer their extensive knowledge to new platforms, operating systems and programming languages.

We are now in the third major phase of this infrastructure program. Through the achievements of this program and other projects, we are now equipped to receive any type of source data, convert it to XML, index and keyword it, perform offline and online editing, store it in our data warehouse and deliver it to customers.

In this paper, I will present a high-level view of the costs and benefits of migrating to XML and the challenges, rewards and lessons learned from this major effort.


Table of Contents


1. Introduction
2. About LexisNexis
     2.1 History of the Company
     2.2 Customers
     2.3 Products
          2.3.1 Legal, Tax and Regulatory Information
          2.3.2 Public Records
          2.3.3 News Information
          2.3.4 Business Information
     2.4 Delivery Methods and Brands
3. The Business Case
     3.1 Executive Summary
     3.2 Initial Project Scope
     3.3 Associated Risks
          3.3.1 External Risks
          3.3.2 Internal Risks
     3.4 Alternatives Considered
     3.5 Benefits and Value Drivers
4. The Project
     4.1 Phase 1
          4.1.1 Editorial Tools
          4.1.2 Pre-Update Systems
          4.1.3 Search and Delivery Systems
     4.2 Phases 2 and 3
          4.2.1 Data Conversion Tools
          4.2.2 Editorial Tools
          4.2.3 Pre-Update Systems
          4.2.4 Search and Delivery Systems
5. Benefits of XML and the XML Infrastructure Program
     5.1 Driving Product Preference
     5.2 Re-Engineering How We Work
     5.3 Making It Easier to do Business
6. Conclusions, Challenges and Lessons Learned
Acknowledgements

1. Introduction

LexisNexis is a global company headquartered in Dayton, Ohio and is a major publisher of legal, tax and regulatory information as well as public records, news information and business information.

In 1999, LexisNexis began an effort to migrate its legacy data storage and delivery systems to XML. XML offered many benefits and value drivers that made it very attractive to the company.

2. About LexisNexis

2.1 History of the Company

LexisNexis was founded in 1967 as Data Corporation and was acquired by the Mead Paper company in 1968, at which time the name of the company changed to Mead Data Central. In 1973, the company achieved a major breakthrough by introducing the world's first commercial full text search and retrieval engine. They also introduced the Lexis and NAARS (National Automated Accounting Research System) online services. These online services were among the first to employ a private telecommunications network, ensuring that customers in large cities had uninterrupted access to their services.

In 1980, we expanded our offerings by introducing NEXIS, a news and business information service. Some of the first sources were The Washington Post, Newsweek, The Economist, U.S. News and World Report, Dun's Review, and the Reuters and Associated Press news wires.

In 1994, Mead Data Central was acquired by Reed Elsevier and renamed LexisNexis after its core services.

2.2 Customers

LexisNexis has legal, corporate, government and academic customers all over the world. Our primary focus is on North America, the British Commonwealth, continental Europe and Latin America, although we have recently been making inroads into Asia.

2.3 Products

2.3.1 Legal, Tax and Regulatory Information

Our legal, tax and regulatory products comprise an extensive collection of federal, state, and chronological case law group files, as well as state and federal statutory materials and an extensive statute archive – in both codified and slip law form – from all 50 states, District of Columbia, Puerto Rico, Virgin Islands, and the United States Code Service. We also offer a wide variety of analytical legal material for every major legal practice area as well as the Shepards Citations Service. Finally, we offer more than 600 law reviews and journals that can be used for secondary research.

2.3.2 Public Records

We offer a wide range of public records information and products that are commonly used to research parties, witnesses, judges, arbitrators and attorneys for the purposes of factual discovery, pre-suit analysis or trial research. These products include business and person locators, civil and criminal court filings, secretary of state records, liens, judgements and UCC(Uniform Commercial Code) filings, bankruptcy filings and professional licenses.

2.3.3 News Information

We offer a comprehensive collection of newspapers, magazines, industry newsletters, scientific and medical journals and transcripts in multiple languages that are commonly used by our customers for due diligence, fact finding and expanding their businesses. The breadth and depth of these offerings are extended by integration tools that allow customers to create made-to-order news and information feeds, user interfaces and dossier products.

As an example of the breadth of the sources that we provide, a link to the list of U.S. Newspapers we offer can be found here. There is also a list of international newspapers available here.

2.3.4 Business Information

LexisNexis offers a wide range of filings, business directories, market research and intellectual property products. The business filings and directories, in combination with our news and other information are used to create our Company Dossier and Industry Dossier products. We also offer the EDGAR(Electronic Data Gathering Analysis and Retrieval) system for SEC(Securities and Exchange Commission) filings.

Our online services now include over 4.1 billion documents (nearly 25 terabytes) of source information stored in our Dayton, OH facility.

2.4 Delivery Methods and Brands

LexisNexis and its associated services are primarily accessed via the world wide web. Around two thirds of our searches and our revenue come from our web products. That said, many customers still access our system via our private telecommunications network using our proprietary software. In addition to online products, many of our offerings are made available on CD-ROM and in hard copy print products.

Just a few of many LexisNexis brands include Butterworths,Martindale-Hubbard,Mealey's,Michie, Shepard's, the National Fraud Centerand Les Editions du Juris-Classeur.

3. The Business Case

3.1 Executive Summary

When LexisNexis first started looking into a migration to using XML to store and deliver our data, a business case was presented to Reed Elsevier senior management outlining the benefits and risks associated with moving to XML. The executive summary proposed that XML could replace and significantly improve upon LexisNexis' proprietary markup format. XML would allow deeper sharing of information across Reed Elsevier divisions and with external data providers and customers. This aligned well with the company's vision to be the indispensable partner to legal and professional customers for information-driven services and solutions.

XML would provide the following benefits and value drivers:

3.2 Initial Project Scope

The initial specification of the XML program was to address Data Fabrication, Data Preparation and Update and our Online Delivery Infrastructure.

Data Fabrication at LexisNexis primarily refers to data enhancement activities that cannot be done efficiently while the data is being converted. For the initial scope of the project, this mainly referred to the indexing and classification systems that mainly operated on news data. This was the subject of my presentation at the 2002 XML Conference. Details about those changes can be found here.Data Fabrication should not be confused with the actual conversion of data to XML. Adding a data conversion effort to this project would make it unnecessarily unwieldy.

Data Preparation and Update refers to the batch manipulation, validation and preparation of the data before it is placed in a database to be delivered to our customers. LexisNexis is somewhat unique in our handling of XML in that we deal with batches of XML documents, not individual documents. These batches can consist of ten thousand documents or more. Since we receive, process and store thousands of documents a day from a diverse range of providers, this is a very efficient way to process the data. Some of the actions that can take place in data fabrication are moving, renaming or adding an element, stamping document metadata and validation. Update primarily refers to processes that keyword the data and prepare it for storage in a LexisNexis database for delivery to our customers.

The Online Delivery Infrastructure consists of the systems used to store XML data, search the data, retrieve it and format it for delivery to our customers, regardless of how they are accessing the system. LexisNexis stores our data in proprietary databases that are well suited to our full text search and retrieval system. The retrieval and delivery systems will be addressed in more detail later in this paper.

3.3 Associated Risks

3.3.1 External Risks

Some of the external risks identified were:

3.3.2 Internal Risks

Internal risks were also defined. Some of the internal risks were:

3.4 Alternatives Considered

One of the alternatives considered was to do nothing - stay with our existing markup. This alternative was eliminated fairly quickly since it required continued reliance on old technology and did nothing to address underlying system constraints that were holding us back. It also did not fit with the company's mission and goals.

Another alternative was to fund changes to the infrastructure as part of various customer-facing projects. There were a couple of advantages to this alternative. First, funding infrastructure changes in this manner hid the cost of evolving our system. In addition, individual customer-facing projects drive more tangible direct benefits. The downsides were that the customer-facing projects were burdened with indirectly related core system changes that could potentially affect the delivery date of a system. Also, this model would cripple the overall completion of the infrastructure work. Because individual projects were subject to their own justification, management and approval cycles, there was an increased possibility that those projects would only partly complete the needed work.

The final alternative considered, and the one we chose, was to fund the infrastructure changes as one program. It should be noted that there were a couple of downsides to this choice. First of all, this reduced the ability of executive management to decide which program elements have the highest benefits. It also created a huge program that would be difficult to manage. The primary advantage of this choice was that it allowed for one justification, management process and approval cycle. It also allowed for a well-balanced evolution of our critical online and fabrication infrastructure.

3.5 Benefits and Value Drivers

XML Benefits and Value Drivers Tied to the Company Agenda
XML Benefits & Value Drivers Mapped to the Company AgendaCreate Product PreferenceMake It Easier To Do BusinessRe-Engineer The Way We WorkLeverage LN/RE Strengths
Rich Formatting and Display of ContentXX
Customization of Information for Specific Customer NichesXX
Better Intra/Inter-Document NavigationXX
Easier Data Exchange Between ApplicationsXXXX
Emerging Standards for Specific Data TypesXX
Emerging Intra-Company and Industry Specific Publishing StandardsXXXX
Improved Consistency Through Dynamic Resolution of Different Name TagsXXX
Improved Precision and Recall With Finer-Grained QueriesXX
More Powerful, Value-Added Editorial and Validation Tools from Third Party VendorsXX

Table 1

4. The Project

4.1 Phase 1

As previously mentioned, the scope of the program was to make changes to our Editorial, Pre-Update, Search and Delivery Systems in order to build the capability to fabricate, store, search and retrieve native XML documents.

Phase 1 of the XML program generally achieved all of its planned objectives, but did end up taking longer than anticipated.

4.1.1 Editorial Tools

One of the first editorial systems that we created was a set of tools to build and manage the style sheets used in the early stages of document retrieval and formatting. Another system was a set of tools to allow XML data to be retrieved, edited and reloaded to our online databases. Finally, we revamped components that automatically create editorial annotations for documents to work with XML.

4.1.2 Pre-Update Systems

The Pre-Update Systems were some of the most troublesome systems to modify to handle XML. These were, by and large, legacy mainframe systems coded in PL/1 and were ill-equipped to deal with XML.

The first of the pre-update systems migrated was our fabrication pipeline. At a high level, the fabrication pipeline system mainly controls document workflow, but there were specific systems tied to it that were written only for our legacy data format. One of the major components of the pipeline was our offline editing system. The offline editing system, as previously mentioned, is used to perform automated editing functions to large batches of data. We decided to use Java implementations of SAX(Simple API for XML) and DOM(Document Object Model) on the mainframe to achieve the same type of functionality that we had for our legacy data. Because we were one of the early adopters of Java on IBM mainframes, we went through many trials and tribulations to get this system to provide an acceptable level of performance. Further changes to this system will be addressed later in this paper.

Another system modified to work with XML was our indexing and classification processes. These, too, were mainframe based PL/1 systems. The indexing and classification systems took a different approach to dealing with the XML. Rather than go with a Java implementation like the Offline Editor, we decided to use existing lexical scanning systems to process the data. Since the systems process document content, and not markup, it was fairly easy to write new lexical scanners for XML and to pass the corresponding data and tokens to the existing PL/1 programs. This implementation took longer than expected, but the net result was that nearly identical indexing results were achieved for the same document regardless of whether it was marked up in our legacy format or in XML.

4.1.3 Search and Delivery Systems

One of the most major modifications to our online system was the creation of a common retrieval engine that allowed both legacy and XML data to be retrieved in a single search answer set and to be passed to delivery systems as XML. The resulting system is probably the world's only XSL 1.0 compatible style sheet engine written in mainframe Assembler language.

Another feature of the retrieval system is its ability to do an "on-the-fly" translation from our legacy data format to a basic XML format. This allows us to use a series of style sheet transformations to prepare the data for delivery to our customers. Other changes to the Search and Delivery systems addressed performance improvements for document retrieval and processing.

4.2 Phases 2 and 3

The original plans for Phases 2 and 3 of the program were to address collection and conversion issues for caselaw data, enhancements to store, search and deliver CALS tables, improvements to intra- and inter-document linking and retirement of old retrieval engines. Our focus for Phase 2 of the project and beyond was to build the capability to handle any data acquired prospectively in XML. We also realized that rather than starting all new initiatives, we needed to build upon some of the changes that had been started in Phase 1 of the program to allow customer-facing data products to start to migrate to XML.

4.2.1 Data Conversion Tools

As previously mentioned, the project was not to do any actual conversion of data to XML. We were, however, directed to integrate XML functionality into many of the utilities used by our conversion teams. This included, but was not limited to, tools to retrieve data from our online system, split batches of data, extract documents from a batch of data, sort documents in a batch, count documents and characters in a batch, compare documents and to stamp metadata into documents.

Another system that we built during Phases 2 and 3 was a generic XML gateway. This gateway provided an automated portal to accept LN-compliant XML data from vendors and other LexisNexis global business units for processing to our online system. This system was especially valuable as it enabled LexisNexis companies from all around the world to start building data and getting it into our databases.

4.2.2 Editorial Tools

As more of our data began to migrate to XML, we started to build up a number of DTDs and DTD(Document Type Definition) fragments. Because we were trying to maintain as much data consistency as possible, we saw the need to build a system to manage DTDs. This provided a one stop resource to research existing DTDs and to look at common DTD fragments that could be used in building new DTDs. The DTD management system also provided the beginnings of a DTD catalog system that could be used by our internal systems to address DTDs for document manipulation and validation.

Phase 2 of the project also included upgrades to the style sheet management tool that we built in the first phase of the project. These upgrades primarily addressed the ability of the tool to test style sheets against live online data before the style sheets were put into production. Server performance was also greatly improved. Phase 3 of the project has seen this tool move into a sophisticated system that allows building of style sheets independent of any metadata required for our online system. This is valuable to our stylesheet writers since it allows them to use any off-the-shelf tool to edit style sheets while still ensuring that the style sheets will work in our online system. Any metadata necessary for the style sheet to work with our online system is bound to it when it is built. Finally, the style sheet is validated before it is promoted to production.

4.2.3 Pre-Update Systems

Phases 2 and 3 of the project saw a great deal of evolution to our Pre-Update Systems. The first major change was to migrate the offline editing systems to a more "Java-friendly" platform. Testing the performance of the system on various platforms was greatly facilitated by the portability of Java. Migrating the system from the mainframe to a Solaris test bed took several weeks, but the majority of that time was spent on integrity testing. Moving to a Linux/Intel platform from Solaris was achieved in about a week. The offline editing system now runs on a cluster of multi-processer Intel machines running the Linux operating system. This configuration provides an acceptable level of performance for our current data flow and the cost of computing, compared to the mainframe system, is vastly lower.

The migration of the offline editing system to Linux has driven other pre-Update systems to migrate there. A separate project migrated the indexing and classification systems to Java and they now run on a Linux platform. Currently, we are exploring how to call the Java-based indexing systems from the mainframe so that processing for our legacy data can use the same code base.

Another change to the pre-Update systems is the integration of our Citation tools into the offline editor. The Citation tools now can be executed much like any other batch editing command.

4.2.4 Search and Delivery Systems

One of the major initiatives for our Search and Delivery Systems was to bring our retrieval engine up to full XSL(eXtensible Stylesheet Language) 1.0 compliance to support modular style sheet creation. These changes, in conjunction with the enhancements to our style sheet management system, greatly enhanced our abilities to do flexible document delivery.

Another major initiative in Phase 2 of the project was to upgrade the keywording and storage systems to be compliant to the Latin-1 character set. This gave us the ability to handle a broader range of data from global business units. In fact, these changes affected the entire document fabrication, storage and delivery path. Since data gets passed between several different hardware platforms and operating systems as it is built, stored and delivered, we had to take great pains to ensure that we did not "mangle" the data because of differences in default encoding schemes.

Our Update, Search and Delivery systems also underwent changes to build, search and deliver CALS tables. The ability to build and store tables consistent with the CALS format greatly enhanced the appearance and functionality of financial data as well as other data sources that are heavily dependent on tables.

Finally, the later phases of the project saw the beginnings of a complete re-engineering of our proprietary database structure to better handle richly-tagged XML data. This effort addressed, among other things, the size of the document that could be stored and keyworded, enhanced the keywording of numeric data, added the ability to search a specific instance of a repeating element, changed how we deal with "noise words" and improved the searchability of elements below the child of root level by using XPath-based queries.

5. Benefits of XML and the XML Infrastructure Program

Once we began to migrate many of our editorial systems to XML and began to deliver XML from our databases, it seemed that we started to hear everyone in the company talking about it. XML is now the basis for many of our metadata management system and is also used to provide document workflow management.

5.1 Driving Product Preference

XML has helped LexisNexis to drive product preference. The data consistency efforts that took place under the XML program have allowed for "crisper" searching in Caselaw data. The structurability of XML has also made it easier to add new markup that enhances Caselaw data for our customers.

The links for our Caselaw Headnotes are dynamically created from the underlying XML and are always up-to-date. The customer never has to deal with broken links. Also, a name change for this product was easily accomplished with a simple stylesheet change. Changing the product name in our legacy data would have required months of data reloading.

5.2 Re-Engineering How We Work

Fabricating and storing our data in XML, the expertise gained from working with third party tools, and the ability to build reusable systems have brought us some of the greatest benefits. Some of the most dramatic examples of this are the creation of a new user interface for our LexisOne product and another for a high-profile client. Tools and expertise developed in the XML Program are credited with getting these to market ten times faster than previously possible.

Similarly, the time for bringing a new source online has been cut in half. The engineering work for a new source can be done five to six times faster and the design work takes half of the time it used to. The creation of an XML editorial master for our citations metadata has been credited with saving 75 FTEs a year at our Shepards division.

In 2003, LexisNexis won the contract to be the official reporter for the State of California. The California Reporter product would not have been delivered without the markup and editing capabilities provided by XML. We later won a contract to be the official reporter for the State of Georgia. Due to reuse, the Georgia Reporter was basically a "freebie" as far as engineering work was concerned.

Finally, a re-engineered Table of Contents format using XML and links was done for less money and in less time than would have been possible with our legacy data format.

5.3 Making It Easier to do Business

One of most significant ways that XML has made it easier for us to do business is the acquisition of data from our vendors. If they can provide us with data in a known document type, such as the NewsML format, we already have a strong knowledge base for that data type and it is quite easy to convert it for publishing to our customers. If the vendor's data is in an unknown XML data type, it is quite simple for us to write a stylesheet to convert it to the format that we need, or even to provide them with our DTD and have them convert it for us.

XML makes it easier for us to automatically proof the quality and completeness of our data. Editors are able to write business rules that can be applied against the data in our offline editing system. The offline editing system also allows for bulk editing on a specific set of elements in the data. Overall, XML increases our quality and lowers the cost of operations.

6. Conclusions, Challenges and Lessons Learned

While a migration to XML has provided great benefits to LexisNexis, the process has not been easy. I don't think that, at the inception of the program, anyone would have been able to say where we would end up. It's safe to say that we probably did not achieve our initial goals, but also to say that we did an amazing amount of work. It could also be said that the migration to XML has helped us to weather a difficult economy, since it has helped to make our operations more efficient.

If I had to pick a reason why we ended up doing more breadth than depth, it would have to be the complexity and inter-connectedness of our systems. Because of the size and sophistication of our system and products, it is very difficult for one person, even a few people, to account for all of the dependencies. This "waterfall effect" of data must definitely be accounted for.

Frequent and clear communication, along with a dedicated program management team are definitely keys to success. It is extremely important to keep everyone talking to each other. There is even value in talking to those who are not necessarily on the project. More "brown-bag" sessions to educate people across the company about the project and XML in general, would have alleviated some of the problems with missed dependencies and misconceptions about XML that we had to battle.

As mentioned earlier in this document, a conscious decision was made to separate the creation of data in XML from the tools to convert, fabricate, store, search and deliver the data. While the business decision for this made sense, the reality was that decision made it very difficult to test the infrastructure components either individually or as a system. Individual system owners were, at first, forced to create their own test data. There were several instances throughout the project where systems were coded to XML that was based on incorrect assumptions. Later on, a relatively large set of test data was created, but even that was not compliant to any document types that we would later create. Once the separate conversion effort started, many systems had to be re-mediated to work with the "real" data. If we had the opportunity to do it all over again, I think that we would still keep conversion and infrastructure separate, but that we would likely start the conversion project about a year earlier.

As I just mentioned, some systems had to be remediated once we started dealing with "real" data. While that is true, I feel that the creation of a new "pipe" for processing the data helped to ensure system and architectural adherence. Remediation and rework would have been far greater had this not existed.

I should also mention the contribution of our data architecture group to the success of the project. The data architecture group was probably formed later than it was really needed, but once it ramped up, provided great benefits. The data architecture group was focused on setting standards and defining general DTDs and DTD fragment for use by conversion and development groups. They were also chartered with developing and company policies that best served the company's migration to XML.

When I asked some of our key engineers about the migration to XML and what worked well and what could have been done better, I received some good feedback. As far as what worked well, I think that one of the key comments was that having a program manager to champion technical requirements was valuable to the success of the program. It was also mentioned that a high level understanding of our architectural objectives across our functional areas was helpful. Our issue tracking and change control management systems were cited as being helpful for avoiding redundancy and determining ownership. Some of the things mentioned by the engineers that they thought were troublesome were "open-ended" objectives that made it difficult to convey progress to functional managers and project leads, management of system capacity due to the increased storage and processing needs of XML and the seemingly constant flux of third-party software and managing how various packages interact with each other. Our "bad, dirty and ugly" data was mentioned as a headache to try and deliver as XML. It was also mentioned that, for an operation of our size, performance tuning is a significant and ongoing effort.

To conclude, changing a legacy system consisting of components that were created five, ten and even thirty years ago is a major undertaking. Return on investment on this effort will likely take several more years to come to fruition, but we are already seeing benefits because XML allows us to offer significantly new and improved functionality to our customers.

Acknowledgements

I would like to acknowledge the assistance of my wife Joan with helping to improve the grammar and flow of this paper. I would also like to thank Ron Kelly for researching all of the benefits that LexisNexis has seen from XML. Finally, I want to thank Chet Ensign, Chris Weiler, Steve Iddings and all of the project technical leads for providing valuable feedback and lessons learned.

XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.