Large-Scale Publishing: Public Records and News on Westlaw

Keywords: application architecture, business process, case studies, conversion, legacy data conversion, legal publishing, metadata, publishing, repository, Unicode, XML, XSLT

Daniel Dodge
Data Architect
Thomson West
Eagan
Minnesota
United States of America
daniel.dodge@thomson.com

Biography

With a background of fourteen years working with structured documents, Daniel Dodge is a lead data architect on a large-scale XML project at Thomson West. He started with SGML at Unisys Corporation as a technical writer leading a team that developed a production system for publishing technical documentation in print and online media. This project also resulted in two patents being issued for new technology. Now he leads projects implementing XML technology for repurposing information for legal publishing. He is president of the Midwest SGML/XML Forum, a user group that meets regularly in the Minneapolis/St. Paul area to discuss and learn about XML.


Abstract


Public records and news information on Westlaw is now being published using XML and is fully integrated with the statutes, caselaw materials, and other legal resources. This case study shows how the publishing capabilities have been developed.


Table of Contents


1. Introduction
2. Product Features
     2.1 Features of Public Records Databases
     2.2 Features of News Databases
3. Building the System to Publish the Products
     3.1 XML Architecture
     3.2 Publishing Architecture
     3.3 Display Architecture
     3.4 Team Organization
4. Publishing the Products
     4.1 Publishing Public Records
     4.2 Publishing the News
          4.2.1 Migrating the News Databases
          4.2.2 Publishing New or Updated News Content
          4.2.3 Switching Customers to the New System
5. Lessons Learned
6. Summary
Appendix 1. About Thomson West

1. Introduction

The online information available on Westlaw is being expanded with additional types of information. A large set of public records, business information, and news supplements the legal information and is extensively cross-referenced via hyperlinks to make it easy to conduct research quickly. Two recent migration projects used XML for the data format for the online search repository. The Public Records project resulted in a set of publishing system and product development processes for dealing with large quantities of acquired content from many sources. The Business Information and News project ("News") resulted in a publishing process for migrating a very large-scale legacy mainframe dataset to an XML online search repository. This case study shows how the projects started with basic publishing capabilities to convert acquired fixed-field content into XML for additional content on Westlaw, and have since evolved into a system that can easily add features while migrating content to XML form.

2. Product Features

2.1 Features of Public Records Databases

There are a large number of databases in the Public Records set, from "Adverse Filings" to "Watercraft", totaling over a billion documents.

The Public Records databases support the standard features on Westlaw. This includes searching, printing, and online display. "Clipping" is also available to customers. This provides a message when new documents have been published. By subscribing to the clipping service for a given database, the customer will receive an email when new documents have been published in the database.

"Aircraft Records" is a fairly typical Public Records database. It contains records of aircraft ownership, including the type of aircraft, registration number, engine type, seating, and so on. This is an example of a typical document:

pr-aircraft.png

Figure 1: Aircraft Record

A more sophisticated set of Public Records databases contains documents for court dockets. A court docket logs the activity and events of a case in court. This is an example of a court docket for a recent case:

ms-docket1.png

Figure 2: Sample Docket

There are several features to note in this sample document. One is the addition of links to other databases that are not part of the Public Records set. The link (shown as item A in Figure 2) containing the name of the judge for the case (Miriam Goldman Cedarbaum) connects to the Profiler database. This is what the profile for that judge looks like:

ms-docket1-judge.png

Figure 3: Profiler Record for Link from Judge Name

Links are added to the Dockets for judge, attorney, and expert witnesses contained in the Profiler database. This makes it easy to find the biographical information about the people involved with the proceedings of the case.

There are also links to other related documents, such as briefs filed with the court. These are listed in the display for the docket under the heading "Briefs and Other Related Documents" shown as item A in Figure 4:

ms-docket1-briefs.png

Figure 4: List of Briefs

Selecting a link (such as "2004 WL 813892" shown as item B) displays the related document. This makes it easy to read documents related to the case. All the links are added during the publishing process to make the documents more useful for the legal researcher.

2.2 Features of News Databases

The Business Information and News ("News" in this paper) on Westlaw covers more than 8,000 sources, including the Wall Street Journal, newswires (such as Dow Jones, Associated Press, and Reuters), business magazines, company profiles, and many more.

The News databases support the standard features on Westlaw. The "clipping" feature is also supported in the News databases.

The most recent features of News required a reload of the data. One of these is a "Deduplication" feature. It identifies duplicate documents in the search results and provides the option to group the duplicate information or remove it altogether. Another new feature is the "More Like This/More Like Selected Text" feature. This enables the customer to find documents that are like a selected document, or like the selected text within that document. Both features improve searching capabilities and reduce the amount of time spent eliminating duplicate documents while viewing. Another significant new feature is "Case Sensitive" searching. The user can specify all caps (ALLCAPS), some caps (CAPS), and no caps (NOCAPS) in news for improved search precision.

An example of a query screen in Westlaw below shows the "Identify duplicate documents" checkbox selected (item A in Figure 5) for this new feature:

dedupe1.png

Figure 5: Query to Identify Duplicate Documents

This search also shows use of the "paragraph" proximity connector ("/p", shown as item B in Figure 5), specifying the search to locate the word "settlement" in the same paragraph as "million".

This search result for duplicate documents looks like this:

dedupe2.png

Figure 6: Results for Duplicate Documents

The news article listed as #7 shows the headline (A), the name of the publication it appeared in (B), the publication date (C), and portions of the document where the search words were found (D).

The number of duplicate documents is shown below the document information (E). In this example, 5 duplicate documents were found.

Selecting the document (listed as number 7) displays this information:

dedupe3.png

Figure 7: Selecting "More Like This"

After the document is displayed, the customer can select the "More Like This" or "More like Selected Text" (A) links to expand or refine the search.

The processing of the search results is done on the XML data retrieved from the new display repository at run-time. This enables the newest content to be accessible with these features immediately after they are published and loaded to the display repository. Additional indexes are not needed. The XML data model has elements to support the appropriate tagging for the proper pieces of data required for these features.

3. Building the System to Publish the Products

3.1 XML Architecture

A department of data architects developed the XML architecture. Members of this team had real-world experience both within the company and from projects they had accomplished before joining Thomson West. This experience was key to establishing a consistent framework for the XML data models.

Common models are used across all of the product data models. A common model is an XML element that has a specific meaning and is used in several or all product data models. At the time the first Public Records projects were started, there was not enough information available to know what elements would be common across different data models. However, as each project was completed, common features were identified and revisions made to the data models. The data architects paid special attention to common requirements so the collective set of data models could evolve, in an iterative method, to their final form. Whenever the data for a product could be republished, the updated data model was used to check that the revisions to the publishing process were done. Using common data models allows the publication staff to build up familiarity with the XML element and attribute names, making it easier for them to develop new products.

Two different types of models were produced for Public Records: Interchange and Product. The Interchange model represents the data in XML form in basically the same order as acquired from the data supplier. The data is transformed into this form as early as possible in the publishing process because subsequent stages can then take advantage of common XML transformation tools, such as XSLT processors. The final form of the data is represented in the Product data model. This provides a target form for the publishing process and supports, through well-formedness and validation checks, that the product contains all the required data components.

The News project had a significant advantage at the beginning of the project because an XML data model had already been developed at Thomson Legal & Regulatory (a division of Thomson Corporation). Early efforts to interchange XML data among different companies in the legal and regulatory group led to the development of a news interchange model. This provided a base model to use for developing a product model. Using this will also enable any future efforts to interchange the data to be much more efficient and effective because data transformation will be minimal.

3.2 Publishing Architecture

The function of the publishing system is to convert that acquired data into XML, add value with linking and other transformations, and load it to the new display environment.

The data for Public Records databases is acquired from the data supplier. A variety of technologies are used to transform the data into product form. Some are proprietary for registering metadata and interacting with other publishing repositories for linking. XSLT is used extensively for straightforward XML transformations.

The new features of the News databases required revision of the data and therefore reloading the data to the display repository. The proposition of reloading 140 million documents gave some concern about the amount of time to run it through the legacy publishing system. Estimates predicted many months of time to republish the data. This was a longer duration than was desired. One part of the problem was the reliance on a mainframe with basically a single-thread capability. Using the experience gained with publishing the Public Records data, a team recommended switching the entire system over to XML. It would enable the new publishing features to be built in the environment that was desired in the long run anyway. The advantages of having the new features available early to customers, and investing only once for the development time, were strong reasons to take the risk with a new publishing system. This meant extracting the data from the legacy repository, transforming it to XML, and loading the data to the new display repository. This was projected to take less time to run the data.

The task of publishing the News databases uses two approaches:

The retrospective process takes a "snapshot" of content from the legacy repository, converts it to XML, and loads it to the new display repository. After all the tests were done, it only needed to be run once in full production mode to migrate the data.

The prospective process must keep the News databases current. It uses most of the same steps as the retrospective process, but runs with smaller batches of data acquired regularly from the suppliers.

3.3 Display Architecture

The new XML display repository maintains content and search indexes. It has been developed to provide the same customer functionality that is already on Westlaw but can also be extended for additional functions. The system has additional advantages of being very highly scaleable. This is necessary because the amount of data being integrated within the Westlaw system is being increased to provide customers with more supporting research information including hypertext links between different products.

Using XML stored in the display repository, software renders the product data in the form for customers to view. The business rules for the necessary operations of billing, configuring access, tracking royalties, and other tasks are separate from the XML product data. This has been a large advantage for reusing the product data in different types of products, especially for "multi-base" products where a set of databases is integrated virtually with an encompassing product design.

3.4 Team Organization

When the Public Records project started, there were many new systems being developed at the same time. Significant work had already been done on the features of the new display repository, but not much experience had been gained with the rigors of daily high-volume production publishing. A product team was formed for each different product being developed. This greatly improved interaction between system developers and content publishers. Each product team met on a regular basis, sometimes daily, but at least weekly, to work together to figure out how to solve the issues that were being found.

The production staff did not initially have much experience with XML. However, a data architect was involved with every product team to provide that experience. Over time, the staff developed new processes and took XML training.

Now there are established processes for each aspect of publishing the data, so the process resembles a factory production line, where computing and staff resources are carefully assigned based on the predictable development cycles for new products and the loading schedule for updating current products with new data and additional features.

When the News project was started, team members were carefully chosen for their advanced skills and experience as leaders, developers, process designers, and other roles necessary to build a more sophisticated publishing system. It was an ambitious project, involving reengineering a publishing path, tackling a very-large scale dataset for migration, and driven by many business reasons to make the new features available.

A strong business and technology partnership also contributed to the success of the News project. As issues arose, the options for dealing with them were weighed as combined business and technology risks, opportunities, and challenges. With strong program management, the path from beginning to end was a predictable one. Critical decisions were made as quickly as possible early in the project. Changes to requirements were also managed well. These factors provided development staff with the maximum amount of time to build the system, and allowed for an additional testing cycle to assure the business that the product would have the quality demanded by the customers.

4. Publishing the Products

4.1 Publishing Public Records

The dataset is obtained from a supplier or retrieved periodically from the original source. Many different formats are used for the data used to create Public Records databases. One very common format is a fixed-fielded format. This is processed to create XML in two stages, "interchange" and "product". After testing the publishing process in a quality-assurance environment, the full dataset is published and loaded to the display environment.

4.2 Publishing the News

The operations of publishing the News data fell into two main phases: migrating existing content, and keeping it current with new or updated content. The process for keeping it current is similar in some ways to some of the processes that keep some of the Public Records databases current. They both happen on a schedule that starts a workflow to obtain the data from the source and automatically publishes and loads it to the new display repository.

4.2.1 Migrating the News Databases

The publishing staff set up a workflow to publish and load all the legacy news content. They had already been using the system during testing to help find problems and perform "stress-testing" of the process in end-to-end testing. Starting the full-scale migration was simply the task of starting that workflow.

By the time retrospective publishing process was completed, the advantages of the distributed and multi-stream publishing architecture was apparent. Instead of taking many months to load, the entire process took less than a month. Using the new system took less than 10% of the time that had been estimated for using the legacy system, resulting in significant cost savings. This also meant the new features were available to customers months ahead of the initial plan.

4.2.2 Publishing New or Updated News Content

The prospective process runs automatically in cycles to obtain news from the suppliers and publish and load it to the new display repository. Depending on the priority of the news information, it can be available on Westlaw within 15 minutes of being available from the source.

4.2.3 Switching Customers to the New System

On May 26, 2004, the Westlaw system was configured to allow customers to sign on to the new system. The legacy system was still maintained in case there was a need to switch back to it. That capability has not been used because the new system has worked very well.

The new display architecture also provides an additional benefit for customers of the News databases. Response time is almost cut in half, meaning they can run more complex searches and locate documents more quickly than before the migration.

5. Lessons Learned

The Public Records projects provided valuable results to the company. One immediate result was the addition of useful databases to the Westlaw system. This information supports the research being done by practitioners in the legal market. The process of developing those databases provided the incentive to develop new skills by the developers of the publishing systems and also by publication staff.

The longer-term result of the Public Records projects paid off later. The successful migration of News from the legacy data format and delivery platform to the XML-based delivery platform was only possible because of all the work that had been done to establish a significant volume and selection of Public Records databases.

6. Summary

Although there were many challenges to publishing using XML, many new opportunities for streamlining the publishing process have been found. New products can be developed in less time and with lower cost. It has also been easier to reengineer existing products to add new features. Perhaps the biggest result has been a growing awareness of the possibilities using structured data and the expertise that has been gained to be able to use that in the normal product delivery cycle.

Appendix 1. About Thomson West

Headquartered in Eagan, Minn., West is the foremost provider of integrated information solutions to the U.S. legal market. West is a business within The Thomson Corporation (NYSE: TOC; TSX: TOC) and was formed when West Publishing and Thomson Legal Publishing merged in June 1996. For more information, please visit the West Web site at west.thomson.com.

XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.