Creating an XML-Based Scholarly Research Tool: Challenges and Methodology
ABSTRACT
In my presentation, I will describe the process of creating an electronic research tool, based on a print scholarly index, for use in a scholarly, academic environment and discuss some of the issues and challenges that came up during the development of this project as well as our solutions to these challenges. The project is being developed by Indiana University's Digital Library Program, with the support of a grant from the U.S. Department of Education. Our project involves digitizing a twenty-year run (1956-1975) of the Letopis' Zhurnal'nykh Statei, a Soviet-era scholarly index, which indexes articles from over 1700 scholarly and popular journals, covering virtually all fields of human knowledge in both the humanities and sciences.
The steps in the project include:
-
Creating about 250,000 digital page images from the original printed edition of the index.
-
Performing Optical Character Recognition on the page images to produce editable electronic text.
-
DTD development.
-
Programmatically and manually marking up, proofreading, and validating the text.
-
Developing a Web-based interface to search, browse, and display the electronic Letopis' Zhurnal'nykh Statei.
The Letopis' Zhurnal'nykh Statei project presents a combination of technical requirements likely to occur in research and academic electronic publishing, but potential issues for many others working with XML as well. These issues include:
-
Very Large and Complex XML Documents and DTDs. Academic and research XML documents, such as full-text collections of literary works or large dictionaries or indices, such as the Letopis' Zhurnal'nykh Statei, have the potential to grow very large, from hundreds of Megabytes to over a Gigabyte. Similarly, many of the common digital library DTDs, such as the Text Encoding Initiative (TEI) and Encoded Archival Description (EAD) DTDs are relatively large and complex. We've found that while these are valid DTDs, not all available tools (editors, XML processors, repositories, etc.) are capable of handling documents and DTDs of such size and complexity.
-
Unicode Support. The language of the Letopis' Zhurnal'nykh Statei is primarily Russian and uses the Cyrillic alphabet. However, there are other languages, such as English and Greek, represented in the citations, and so additional alphabets are also required. Such a combination of languages and alphabets is common in academic discourse. The Unicode standard solves many of the problems associated with mixing the various character sets in a single electronic document, and so reliable support for the Unicode standard is required in all tools used in the development of this and other digital library projects.
-
Fast, Flexible, and Sophisticated Search and Indexing Tools. Obviously, excellent search tools are required to make useful large collections of research data such as the Letopis' Zhurnal'nykh Statei. Fast keyword searching of the full-text is required as well as more intelligent searching that takes advantage of the XML structure and markup.
In my presentation, I will discuss the stages of project development, how we addressed the issues described above, and the tools we evaluated and implemented. I will also demonstrate a prototype of the Web-based interface to the electronic Letopis' Zhurnal'nykh Statei.
Table of Contents
1. Introduction
This paper will discuss the challenges encountered and the methodology and tools used in the Indiana University Digital Library Program's Letopis' Zhurnal'nykh Statei project.
1.1. About the project
The following introductory paragraph from the project's Web site (http://www.dlib.indiana.edu/collections/letopis/letopismain.html) provides a brief overview of the Letopis' Zhurnal'nykh Statei (LZhS) and the project:
The Indiana University Digital Library Program has received a United States Department of Education Title VI Technology Program grant to digitize and offer on the World Wide Web a twenty-year portion of the Letopis' Zhurnal'nykh Statei (1956-1975), a serial publication that indexes Soviet periodicals from 1926 to the present. It covers more than 1,700 journals, series, and continuing publications of academies, universities, and research institutes in the fields of humanities, natural sciences, and the social sciences, and it also covers the popular periodical literature. Once digitized and made available on the World Wide Web, it will provide access to the periodical literature for an essential time in modern Russian history, beginning with the period of the Khrushchev "Thaw" following the twentieth Communist Party of the Soviet Union (CPSU) Congress and continuing through the first half of the so-called Brezhnev "Period of Stagnation." Virtually any student or scholar studying Russian political science, literature, or history between 1956-1975 will find Letopis' Zhurnal'nykh Statei to be an invaluable resource.
1.2. Motivation behind the project
There are two main motivations for undertaking this digitization project. The first motivation is to increase the usability of the index. The current print LZhS is very difficult to use. It was published as weekly issues, and there is no cumulative index, so it is very cumbersome and time-consuming to wade through the many individual issues of the index. A digital version will allow keyword searching throughout the entire twenty-year run that we are digitizing and greatly increase the ease of use and utility of the index.
The second major motivation behind this project is the need to preserve the index. The print version was printed on very high-acid paper that over time has become extremely brittle and is deteriorating rapidly. The digitized version will help preserve this valuable scholarly resource. In addition to the XML version of the index, we will be archiving the digitized page images.
1.3. Structure of the LZhS
The structure of the LZhS DTD is based upon the structure of the print index. In the print index, the citations in the LZhS are organized under over 300 subject headings. The subjects are arranged in a three-level hierarchy with thirty-three top-level subjects, subdivided into second- and third-level subjects. The top-level subject headings are identified with uppercase Roman numerals; the second-level subjects with Arabic numerals, and the third-level subjects with lowercase Cyrillic letters. Below are some sample subject headings from the hierarchy:
-
I. Марксизм-ленинизм [Marxism-Leninism.]
-
1. Произведения основоположников марксизма-ленинизма. [Works of the Founders of Marxism-Leninism.]
-
2. Литература об основоположниках марксизма-ленинизма. Работы по марксизму-ленинизму. [Literature about the Founders of Marxism-Leninism. Works on Marxism-Leninism.]
-
-
XIII. Государство и право [The State and Law.]
-
1. Общие вопросы. Теория и история государства и права. [General Questions. Theory and History of the State and Law.]
-
2. Государство и право СССР. [The State and Law in the USSR.]
-
а. Общие вопросы. Советское строительство. Государственное и административное право. [General Questions. State and Administrative Law.]
-
б. Другие отрасли советского права. [Other Branches of Soviet Law.]
-
в. Суд и прокуратура. Работа органов юстиции. [The Court and The Office of Public Prosecutor. Work of the Organs of Justice.]
-
-
1.4. The LZhS DTD
Below is the DTD we are using to encode the LZhS. It includes a letopis element, which can contain multiple volumes. There are also container elements for individual volumes and issues. Three division elements (div1, div2, and div3) are used to contain the subject divisions within the three-level subject hierarchy. And the cit element contains an individual citation. Further information about these and other elements and attributes may be found in the comments in the DTD listed below:
<!-- ...................................................................... -->
<!-- Letopis Russian Index XML DTD ........................................ -->
<!-- version: 2.4.xml2001 ................................................. -->
<!-- author: John A. Walsh ................................................ -->
<!-- Copyright 2000 The Trustees of Indiana University .................... -->
<!-- ...................................................................... -->
<!-- LETOPIS -->
<!-- The letopis element can be a root element -->
<!-- to contain metadata (in letopisHeader) -->
<!-- and multiple volumes -->
<!ELEMENT letopis (letopisHeader, volume+) >
<!-- LETOPISHEADER -->
<!-- Content model will be fleshed out later. -->
<!-- Will be modeled on TEI header and -->
<!-- contain metadata about the project and -->
<!-- the text. -->
<!ELEMENT letopisHeader (#PCDATA) >
<!-- VOLUME -->
<!-- Used to contain a single volume of the -->
<!-- letopis, which includes fifty-two weekly -->
<!-- issues. -->
<!ELEMENT volume (issue+) >
<!-- VOLUME/@ID -->
<!-- Use the format "vXX", where XX is the year -->
<!-- of the volume. -->
<!ATTLIST volume id ID #REQUIRED >
<!-- ISSUE -->
<!-- Used to contain a single weekly issue of -->
<!-- the index. -->
<!ELEMENT issue (div1+, back) >
<!-- ISSUE/@ID -->
<!-- Use the format "vXXiYY", where XX is the -->
<!-- year of the volume and YY is the -->
<!-- week (01-52) of the issue. -->
<!ATTLIST issue id ID #REQUIRED >
<!-- DIV1 -->
<!-- Contains the a top level subject in the -->
<!-- subject hierarchy. -->
<!ELEMENT div1 (head, ref*, (div2+ | (cit | citXref)+)) >
<!-- DIV1|DIV2|DIV3/@N -->
<!-- Short name or number for DIV1 subject -->
<!-- division -->
<!-- DIV1|DIV2|DIV3/@ID -->
<!-- Unique ID for each subject division. Use -->
<!-- the format "vXXiYYsAABBCCDD", where XX is -->
<!-- the year of the volume, YY is the -->
<!-- week (01-52) of the issue, AA is the -->
<!-- the subject number for the first level of -->
<!-- hierarchy (DIV1), BB is the subject number -->
<!-- for second level of the hierarchy (DIV2), -->
<!-- CC is subject number for the third level -->
<!-- of the subject hierarchy (DIV3), and DD is -->
<!-- the subject number for the fourth level -->
<!-- (DIV, a named but non-standard subdivision -->
<!-- that is not part of the standard hierarchy.-->
<!ATTLIST div1 n CDATA #IMPLIED
id ID #REQUIRED >
<!-- DIV2 -->
<!-- Contains a second level subject in the -->
<!-- subject hierarchy. -->
<!ELEMENT div2 (head, ref*, ((div3| cit | citXref | div)+)) >
<!ATTLIST div2 n CDATA #IMPLIED
id ID #REQUIRED >
<!-- DIV3 -->
<!-- Contains a third level subjects in the -->
<!-- subject hierarchy. -->
<!ELEMENT div3 (head, ref*, (cit | citXref | div)+) >
<!ATTLIST div3 n CDATA #IMPLIED
id ID #REQUIRED >
<!-- DIV -->
<!-- Contains a named, but non-standard subject -->
<!-- subdivision that is not part of the -->
<!-- standard subject hierarchy. -->
<!ELEMENT div (head, ref*, (cit | citXref)+) >
<!ATTLIST div n CDATA #IMPLIED >
<!-- HEAD -->
<!-- Contains a subject heading. Used in DIV1, -->
<!-- DIV2, DIV3, and DIV. -->
<!ELEMENT head (#PCDATA) >
<!-- CIT -->
<!-- Contains a single citation from the index. -->
<!ELEMENT cit (#PCDATA | year)* >
<!-- CIT/@ID -->
<!-- Contains a unique ID for each citation. -->
<!-- Use the format "vXXiYYcNNNNN", where XX is -->
<!-- the volume year, YY is the week (1-52) of -->
<!-- the issue, and NNNNN is the citation -->
<!-- number (citations are numbered -->
<!-- sequentially, starting at 1 at the -->
<!-- beginning of each year.) -->
<!ATTLIST cit id ID #REQUIRED >
<!-- XREF -->
<!-- Cross-reference. The attribute "target" -->
<!-- is used to point to the ID of the item -->
<!-- being referenced. Used after subject -->
<!-- headings as a "see also" type reference -->
<!-- to similar subjects and within citXref -->
<!-- elements to point to the full citation -->
<!-- referenced by the abbreviated citXref -->
<!-- citation. -->
<!ELEMENT xref (#PCDATA) >
<!ATTLIST xref id ID #IMPLIED
target IDREF #REQUIRED >
<!-- YEAR -->
<!-- Contains year of publication within an -->
<!-- individual citation. -->
<!ELEMENT year (#PCDATA) >
<!-- CITXREF -->
<!-- Contains a special class of unnumbered -->
<!-- abbreviated citation that points -->
<!-- to the a full citation located elsewhere -->
<!-- in the index. -->
<!ELEMENT citXref (#PCDATA | xref)* >
<!-- BACK -->
<!-- Contains back matter located at the end of -->
<!-- each issue. The content model for "back" -->
<!-- will be fleshed out later with more -->
<!-- structure. -->
<!ELEMENT back (#PCDATA) >
2. Digitization Process
The following list outlines the steps in the digitization process:
-
Unbound issues are shipped to third-party vendor for digitization. The individual pages are scanned and saved as 600 dots per inch (dpi) bi-tonal TIFF images.
-
Once digital page images are received back from the vendor they are processed using Fine Reader Optical Character Recognition (OCR) software from ABBYY(http://www.abbyy.com/). ABBYY is a Russian company, and their Fine Reader product is the only OCR product we are aware of that is able to recognize Russian/Cyrillic text. The recognized texts produced by the OCR process are saved as UTF-8 Unicode files.
-
The files containing individual issues go through a first phase of manual markup in which the subject divisions are enclosed in their respective div1, div2, and div2 element tags. The required division id attributes are ignored during this phase.
-
The files containing individual issues are then processed by an in-house developed Java program that is able to do the vast bulk of the markup. The Java program, dubbed LMU for "Letopis MarkUp," does the following:
-
Performs pre-tagging processing to correct common OCR errors and formats the text to increase the efficiency and reliability of the markup processing.
-
Inserts the correct id attribute values (based upon the volume year, issue number and the given subject's location in the subject hierarchy) into all div1, div2, and div3 elements.
-
Tags the thousands of individual citations with the appropriate cit element tag, including the correct cit/@id attribute value.
-
Within each cit element, tags the year of publication with the appropriate year element tag.
-
-
The files containing individual issues then go through a final phase of manual editing and proofreading in which validation and other errors are corrected.
3. Challenges
3.1. Too Much Data, Too Little Time
Certainly one of the biggest challenges we face in completing this project is simply dealing with the vast about of data we are attempting to digitize. Again, our aim is to digitize and encode a twenty-year run of the LZhS. The twenty-year run includes 1040 weekly issues, totaling over 250,000 pages and including more than three million individual citations. The creation of the digital page images from the original printed page source and the subsequent OCR processing went relatively smoothly and painlessly and were completed during the first year of the grant (October 1999 - September 2000). The labor intensive task of tagging and proofreading the text is a much more painful and time consuming process.
Much of the markup and error correction can be done programmatically, but there remains a great deal of markup and proofreading that can only be done manually by trained human beings with a strong Russian-language background. Since our project is based at a major Big Ten research university with a strong Slavic Studies department and many international students, we have a reasonable supply of Russian-speaking individuals whom we can hire and train to do much of the manual tagging and proofreading, but it remains to be seen if we will meet our goal and complete the twenty-year run during the three-year grant period. Already, we have been forced to make compromises in order to increase productivity.
3.1.1. DTD Compromises
One of the compromises we were forced to make to increase productivity was a fairly radical simplification of the DTD, especially in the content model of the cit (citation) element. We originally planned to use a number of other child elements within the cit element. These included a contributor element with a type attribute for authors, editors, illustrators, etc.; a title element with a type attribute for the article and journal titles; an enumeration element to enclose the volume issue and page numbers of the cited article; and a notes element for miscellaneous information in the citation. Below are "before" and "after" examples of the same citation tagged using the original, more complex DTD and the newer, simplified DTD.
"Before" example using original DTD:
<cit id="v56i01c000003"> <contributor type="author">Жаров, А.</contributor> <title>Величие и простота.</title> <notes>[К выпуску изд-вом «Молодая гвардия» сборника «Воспоминания о В. И. Ленине»].</notes> <srcTitle>Новый мир,</srcTitle> <enumeration><year>1955</year>, № 12, с. 232-233.</enumeration> </cit>
"After" example using current, simplified DTD:
<cit id="v56i01c000003">Жаров, А. Величие и простота.
[К выпуску изд-вом «Молодая гвардия» сборника
«Воспоминания о В. И. Ленине»].
Новый мир, <year>1955</year>, № 12, с. 232-233.
</cit>
Unfortunately, due the irregular and sometimes inconsistent formatting of the citations, it was not possible to tag programmatically the many child elements from the original DTD, and it simply took far too much time to tag them all manually. Thus we were forced to eliminate them from the DTD and settled on the current cit content model of "(#PCDATA | year)*". The year element was not something we could eliminate because it will be necessary to qualify searches by date range. Since a single yearly volume of the LZhS may index articles from up to four years prior to the publication of the LZhS volume, we cannot rely on the year of the LZhS volume to accurately restrict searches by date range. In addition, we can pretty reliably tag most of the "year" elements with our Java processing and tagging application.
These compromises, of course, cost us a fair amount in terms of the functionality we will be able to offer our users. For instance, we lose the ability to do "author" or "title" searches. The author and title data is still accessible through regular keyword queries, but this will not be as precise or intuitive a way of getting at the author and title data. We retain keyword searching throughout the entire index and within individual citations, and we will further be able to restrict searches by date range and/or subject. Thus, we will still be able to provide our users with a very powerful research tool that will be vastly more usable and accessible than the print index, but it will not - at least initially - offer all the functionality we originally envisioned. Of course the possibility always exists that at some future date we could markup in much greater detail the individual citations.
3.2. The Challenge of Unicode
The vast majority of the LZhS is in the Russian language and uses the Cyrillic alphabet, but the text also contains other languages that use the Latin alphabet, and Greek characters appear frequently, for instance, in the mathematical subject categories. This mix of languages and alphabets led us to the obvious choice of Unicode as our character encoding standard for the LZhS.
This led to another major challenge, which was finding appropriate XML tools that support Unicode. Although the XML 1.0 specification clearly states, "All XML processors must accept the UTF-8 and UTF-16 encodings of [ISO/IEC] 10646" [XML 1.0], we found that not all available XML tools have happily embraced Unicode. This situation is improving, but was particularly bad when we started the project in October 1999.
3.2.1. Search Engines
For years, we had been reliably and happily using the Pat (version 5.x) search engine from Open Text Corporation (http://www.opentext.com/) to index and search our large SGML text collections, but Pat does not support Unicode or other multi-byte character encodings. We have since migrated from Pat to the University of Michigan's XPAT (http://www.dlxs.org/), which is based on Open Text's Pat, but is being developed by Michigan specifically for digital library applications. As its name implies, XPAT supports XML; however, although the XPAT developers are working on providing Unicode support, XPAT is currently unable to support multi-byte encodings. Since neither of the products we had experience using to search SGML and XML documents support Unicode, we were forced to spend a significant amount of time exploring other options.
Some of the things we were looking for in an XML search engine include:
-
Unix platform support (preferably AIX, Solaris, or Linux).
-
Full-text searching.
-
Wildcard and/or regular expression support.
-
Support for very large XML files (hundreds of megabytes).
-
Unicode support.
-
Java and/or XML API.
Everyone will have their own opinions about the various XML search engine and database products on the market, and all the products have their strengths and shortcomings. I think it is fair to say though that the primary focus of most commercial products is not scholarly applications such as the LZhS project. Rather, most commercial XML products are understandably focused on commercial and administrative applications and markets in which searching an XML document that contains, for instance, a combination of modern English, ancient Greek, Hebrew and Latin is a very unlikely requirement.
Let it suffice to say that we have yet to find the ideal XML search engine for scholarly and digital library applications. But for the LZhS project and another Indiana University Digital Library Program project, which like the example above combines modern English, ancient Greek, Hebrew, and Latin, we are currently developing using XYZFind (http://www.xyzfind.com/) as our XML search engine/database. While XYZFind continues to lack some features we require, most notably wildcard support, we have overall been very happy with it, and the developers and support staff have been absolutely superb about responding to our concerns and requests. We are hoping that all the features we require will be implemented by the time we are ready to go public with the Web interface to the LZhS, which should happen during the coming year. The following passage from the Introduction of the XYZFind Server User's Guide [XYZFind User's Guide] provides a useful overview of the basic functionality and capabilities of XYZFind:
XYZFind is a server that consists of an XML repository and an XML query engine.
As an XML repository, XYZFind accepts any number of well-formed XML documents and maintains a single data representation of all of the documents it receives. The original documents may be retrieved, updated, or removed from the repository. Once a set of documents has been indexed by XYZFind, search and query services are available as outlined below.
As an XML query engine, XYZFind accepts queries written in an XML language called XYZQL. XYZQL is a powerful query language that includes support for path-level queries, Boolean queries, keyword search, and numeric range queries. An XYZQL query is a filter specification that constrains which XML documents are returned as well as which parts of documents are returned. XYZFind s query processor uses its repository to optimize this filtering process, exceeding the performance of less sophisticated approaches.
3.2.2. XML Editors
We also encountered difficulty finding a suitable relatively user-friendly XML editor that supports Unicode and the ability to edit large files and easily tag existing untagged text. The people we have been able to hire to do the markup and proofreading have excellent Russian language skills, but they are not XML experts or information technology professionals. While we can and do provide some training to our editing staff, we nonetheless require an editor that can be used by individuals with minimal XML and information technology experience.
For past and current English-language projects that do not require Unicode support, we have used emacs, with the PSGML Major Mode (http://www.lysator.liu.se/~lenst/about_psgml/), and SoftQuad's XMetal (http://www.softquad.com/). With version 2.1, XMetal now provides pretty good Unicode support (although it still does not fully support right-to-left languages such as Hebrew), but when we started the project in 1999, their Unicode support was very limited, and XMetal could not display Cyrillic Unicode text. By default we ended up choosing WordPerfect (http://www.corel.com/) as our XML editor for the project. It allows the editing of large documents, provides a relatively familiar word-processing interface for XML novices, and provides suitable Unicode support. The other products we tried were dismissed because they had an unsuitable interface for the type of work we are doing, because they lacked Unicode support, or because they required a knowledge of XML beyond that of the people we have been able to hire for the project.
4. Building the Interface
The ultimate goal of the LZhS project is to provide a freely-accessible Web interface to the twenty-years of LZhS data that we will eventually have digitized and marked up using XML. As the laborious process of digitization and markup continues, we have commenced work on building the Web interface to the data.
The current prototype of the Web interface was built using a combination of HTML, Java, and JavaScript, communicating with XYZFind server. The Letopis interface is a single Java Server Page (Java Server Page (JSP)) that uses JavaScript and Java intertwined with HTML and calls what is effectively a local stateless JavaBean for the more complicated logic (which will likely eventually be pulled out into a separate JavaBean). In addition to the XYZFind classes, the JSP also uses Xerces for XML Document Object Model (DOM) access and Xalan for Extensible Stylesheet Language Transformations (XSLT)transformations.
The dynamic content on the page is managed by an externally loaded JavaScript. Originally the JavaScript code was also contained within the JSP file, but problems were encountered with JSP/Servlet engines not correctly recognizing JSP files when they contained Non-Latin Unicode characters (Yet another Unicode-inspired challenge!).
The search interface allows the user to enter search terms and combine them with a Boolean AND ("All words") or Boolean OR ("Any words") or search for the search terms as an exact phrase. In addition the user can select a date range and subject area to search. Once the user selects a top-level (div1) subject heading, the page will dynamically generate a list of the second-level (div2) subject headings contained within the selected top-level subject heading. Likewise, once a second-level (div2) subject heading is selected, the page will dynamically generate a list of the associated third-level (div3) subject headings. If the user selects only a top-level (div1) or second-level (div2) subject heading the query will also search through div2 and/or div3 subject categories contained within the chosen top- or second-level subject heading. This allows the user execute a broad search through one of the more general subject headings and all the subcategories it contains or restrict a search to one of the more specific second- or third-level subject categories.
Figure 1 below shows the search interface as it is initially presented to the user.
Figure 2 below shows the search interface with a third level (div3) subject heading selected and other options selected in the "Search for" and date pull-down menus.
Figure 3 below displays a search results page. Each citation is listed under its most immediate subject heading with the appropriate Roman numeral, Arabic numeral, and Cyrillic letter combination indicating the subject heading's position within the subject hierarchy. Scholars used to the print version of the LZhS will be familiar with the subject hierarchy and expect to see these outline indicators.
Figure 4 below displays another search results page illustrating citations with a combination of Russian/Cyrillic and English/Latin text.
5. Conclusion
The LZhS project poses a number of significant challenges. Chief among these is simply completing the task of digitizing and marking up the vast amount of data we have chosen to make available on the World Wide Web. The second major challenge is dealing with the technical issues inherent in working with a multi-language, multi-alphabet text and the still relatively immature Unicode support found in many current XML tools. We have been forced to make compromises to deal with these challenges, but we are confident that in the end we will be able to offer to the scholarly community an extremely valuable research tool and ensure the continued preservation of an endangered intellectual resource. And in the process of completing this important scholarly mission we will gain invaluable experience from which we and others may learn as we embark upon other still more ambitious scholarly projects that exploit the vast potential of XML and other modern information technologies.
Acknowledgements
I would like to thank my colleagues Deb Horn, Kenrick Rawlings, and Andy Spencer, all from Indiana University's Digital Library Program, who have worked closely with me on this project and provided valuable advice in the preparation of this paper.

