XML 2003 logo

BookBuilder: Content Repurposing with Topic Maps

Abstract

We present a solution developed at Aspen Publishers which uses Topic Maps [XTM], [ISOTM] technology for multidimensional indexing and classification of content across various products and publications. A topic map comprised of numerous merged indexes and classifications is used for navigation and, most importantly, content repurposing and building new and customized publications.

Aspen Publishers, an information provider for attorneys, business professionals and law students, has an unusual perspective—unlike most publishers its audience is equally receptive to both print and electronic products. Because Aspen publishes products across many subject lines, customers may want information contained in multiple products (e.g., tax issues are covered in such disparate areas as Pensions, Corporate Law and Insurance). As more of its titles move to XML, Aspen is looking to repurpose material in ways that make it more useful, more easily located or more readily navigated.

Because of the cross-disciplinary nature of many subjects, Aspen’s first look at “chapter-chunking” custom publishing was unsatisfactory. Chapters and sections listed in a Table of Contents provided a base of topics that was too wide. It needed a way for customers to combine smaller components of multiple publications to fit their needs more precisely. Working with Cogitech, a topic maps consultant, Aspen devised an approach that let the XML markup used for indexes become the basis for an index-based book builder.

In the browser-based utility that will be demonstrated, as with any custom publishing application, a library of books can be accessed and those books’ chapters, sections and subsections dragged and dropped into a column representing the book to be constructed. But, uniquely, all of the library’s book indexes can also be called up and individual entries added instead or as well. And, of course, the custom book that is created has a single, unified index.

The topic map approach combines all the indexes, tables of contents, glossaries and referenced materials from all books in the series into a single multidimensional index. Using topic map associations, the entries in one book’s index can be related to other entries, as well as to entries in the other books’ indexes, glossaries, tables of contents and so on, as well as to all external referenced materials, such as IRS publications or websites. This approach allows each index extry to point to the direct content in the book, which is displayed in the BookBuilder application so the user can explore all the associations any topic brings up. The demonstration will conclude with a discussion of the underlying topic-map technology and the use of XSLT to provide real-time creation of the HTML display.

Keywords


Table of Contents

1. How did we get here?
2. Aspen Custom Publishing
3. System Requirements
4. Why Use Topic Maps?
5. Aspen BookBuilder
6. Conclusions
Bibliography
Glossary
Biography

1. How did we get here?

Aspen Publishers is a New York-based legal publisher and part of the WoltersKluwer Legal-Tax-Business cluster in North America. As it happens, legal texts are prime candidates for XML publishing for several reasons. In the first place, lawyers have fully embraced electronic publications—along with everyone else the internet is the first place lawyers go for research, but they have also eagerly embraced CD-ROM publications unlike other areas of publishing. Many legal texts are consequently good candidates for release in three different formats, On the measure of multipurposing, Aspen handily meets Jon Bosak’s threshold for applying structured markup of 1.6 uses.

And of course legal texts are relatively stable. Many issues of law are pretty much settled, so when new laws are passed, new court cases decided, new regulations promulgated, you’re not usually throwing away the previous text, but instead adding to it or editing it somewhat. This type of authoring also lends itself to inhouse or staff authoring, which facilitates using XML, but for historical reasons and unlike its sister companies Aspen works almost entirely with outside authors and is unable to dictate their working environment.

The first series of books that Aspen put into structured markup shared not only the same design but the same structure as well—a question-and-answer or frequently asked question (FAQ) format. (See Figure 1.) As the 41 titles in this series of AnswerBooks, spanning the full breadth of the law, were migrated into XML, the question arose as to how else take advantage of this fact. Aspen includes a legal-education division, which publishes books for law school classes, and the natural educators’ desire for material customized to specific classes introduced the notion of creating a custom-publishing application. Mixing-and-matching chapters, sections, subsections or even questions into a new book seemed a natural way to take advantage of the different books’ identical formatting and structure. And of course this application was envisioned to generate a single, unified index for any custom publication. [1]

Opening Page of an AnswerBook

The AnswerBook chapters begin with an introduction and include a “mini-TOC” for the chapter, with references to specific page numbers, for the section, subsection and subsubsection headings. The questions are numbered sequentially, with the chapter number as the first component. Supplements to a full edition include only edited and new questions (which receive a “dot” or decimal number using the question number of the question after which they are inserted). The question-and-answer groups, as you would expect, cross page boundaries.

Figure 1. Opening Page of an AnswerBook

At the time this project was first discussed, each book’s Table of Contents and List of Questions [2] were being generated programmatically from the XML. Each book’s index, however, consisted of an XML file that was manually edited as entries and references were added or deleted. (See Figures 2 and 3.) Similarly, the “end-tables” (such as the Table of Cases) existed as XML files instead of being generated programmatically. In the body of the book, each question-and-answer pair is contained within a <qagroup> element. The index and end-tables point to the questions by number and not to the page where the question begins, while the Table of Contents (TOC) and List of Questions (LOQ) point to the page containing the question and beginning of the answer.

An Index page

The indexes for AnswerBooks have a three-deep hierarchy. References to specific questions point to the question number and not the page where the question appears. In a supplement, references to questions in the full edition retain their initial numbering, but edited and inserted questions are preceded by the letter “S,” indicating the latest information is in the version that appears in the supplement. Ranges, consequently, have to be interrupted when they include a supplement question, such as that under the item “Age 50 catch-up limit on contributions.”

Figure 2. An Index page

Markup for Indexes

In the XML file Aspen started with, the question numbers were stored as attributes, while all punctuation for the print edition is stored as content.

Figure 3. Markup for Indexes

In a traditional custom-publishing scenario, Aspen could list the books’ contents divided into chapters or segments smaller than chapters, such as sections, subsections and subsubsections, and provide a mechanism for checking off which segments a professor or professional user wanted in the custom publication. This list would then be fed to the pagination application which would assemble the one-off title and an index and format it. Of course, this scenario would depend on Aspen’s converting the index’s XML file into inline references. [3] Some publishers actually construct such titles by combining already existing PDFs, automagically generating a unified index. They believe that the TOC hierarchy provides sufficient indication of a segment’s content, at least so far as a professor is concerned.

2. Aspen Custom Publishing

This crude level of chapter-chunking does not serve the professional user well. Just as a story about a sports star signing a new contract appears in a newspaper’s sports section but is also business topic, many AnswerBook questions have relevance to more than a single topic. In fact, of course, the mere concept of an index acknowledges that the primary hierarchy is not suitable for all types of access into a book’s contents. As Aspen studied the effort to transform the index’s XML file into inline indexing,[3] these advantages for custom publishing were identified:

  • all titles use same DTD/schema

  • inline indexing would facilitate question moving and renumbering, as well as automatic generation of a unifed index

  • material already atomized into qagroups

  • index entries point to questions not pages

  • qagrouping provides start- and stop-points in the text

  • author for each question already identified

  • unique ID, such as a Digital Object Identifier (DOI), considered for each question, potentially allowing for access from outside the system

At this juncture, Nikita Ogievetsky of Cogitech provided the necessary guidance to steer Aspen towards using topic maps for its custom-publishing application. Topic maps, as he pointed out, derive from indexes, and indexed material translates easily into topic maps. As to their suitability for this project, you can be the judge here.

Cogitech’s Adaptive Classificator framework, [xml2000]developed earlier, formed the basis for the proposed approach. The first step was to separate the question-and-answer content from the back of the book indexes and the so-called end-tables [4]. Question-and-answer pairs from all books were collected into a uniform corpus of question-and-answer pairs and kept in an XML repository. All indexes, including the Table of Content (TOC), Back of the Book (BOB) index, glossary, references into IRS publications, etc. would be extracted in the form of topic map associations that point into information resources (e.g., question-and-answer pairs) stored in the content repository. Some additional associations can be inferred based on the topics’ similarity and semantic proximity. Topic maps extracted from all books are to be merged into a single topic map and maintained in a topic map repository. The XML repository can be used for this purpose with some proper normalizations.

Individual books can be reconstructed following TOC indexes. This is possible because associations of each TOC index are scoped with the theme of the original book. As the TOC and qagroups are reconstructed, BOB and other indexes follow. This allows content repurposing and facilitates maintenance of BOB and other indexes.

The merits of this approach become even clearer as one considers its use for custom publishing.. Indexed content of the original books can be used as the basis for the new publication. A book in our framework is a mere collection of TOC associations. So construction of a customized book becomes almost as easy a process as reconstruction of any of the original ones. Steps that are needed are as follows:

  1. Create a topic for the new book, provide it with a title and some other metadata.

  2. Create TOC associations scoped to the theme of this new book. Add references to existing or newly created question-and-answer pairs to these TOC associations.

  3. Authors can select existing content by navigating topic map relationships (TOC, BOB and other indexes and inferred associations) extracted from the previously “mapped” books.

  4. Adding new question-and-answer pairs requires authoring the text and mapping it into topic map of existing indexes.

Once the conceptual framework was settled, a few requirements were delineated. Namely, any application built would have to accommodate the following editorial needs:

  • Questions (or more properly, qagroups) in the AnswerBooks move about and have to be renumbered when a full edition is issued

  • New questions have to be inserted when supplements are prepared.

  • New material and deleted material require that the index entries reflect the text changes.

It was recognized that relying on inline indexing would allow qa-numbering to move forward in the book production timeline, closer to print time. This was not a requirement but any easing in the schedule makes life simpler for the editors.

3. System Requirements

As we built the prototype of Aspen’s BookBuilder custom-publishing application, it was clear that several different users’ needs would need to be considered:

  • Law school professors constructing texts

  • Specialist practitioners collecting information on small as well as large topics

  • Editors creating specialty one-off titles out of the general texts

  • Website visitors, using our interface to navigate the AnswerBooks’ content online

This last item was not part of the custom-publishing notion, but along the way the content navigation model was admired by some of the people responsible for website design.

Specifically, the custom-publishing application included these requirements:

  • It would have to construct a new contents list that could be handed off to another application that would compose the custom publication in PDF or Microsoft Reader .lit format. Both local generation, using Apache FOP, and online generation, using the facilities of Aspen’s 200-page-per-minute PDF generator, Datalogics Pager, were considered.

  • It would have to use a drag-and-drop interface to construct the new contents list

  • Two parallel development paths would have to be considered, online and standalone. A standalone application would be distributed on CD-ROM and might require online generation of PDF.

  • The design of the custom-publishing application would need to facilitate both pre-planned research and spur-of-the-moment wandering, as typified by web browsing. Thus, the application would need to be able to display a variety of levels of content of any question indicated.

  • Lastly, and most critically for custom-publishing, the related material in otherwise different books ??? (totally unconnected books) would need to be identified.

This last requirement is truly what brought Aspen to topic maps.

4. Why Use Topic Maps?

In the first place, topic maps require close and accurate subject identification for all the occurrences that the topics are going to point to. This is no small requirement, especially as the material you are dealing with moves into the tens of thousands of pages. But, odd as may seem to anyone who has worked in the low-margin world of publishing, book publishers—unlike most businesses—have already invested vast amounts of money to perform this identification. [5] That is, they have hired subject-matter experts (subject-matter expert (SME)s) to index their titles, not for building topic maps but for navigating a book by topic. Because this expense has already been incurred, it provides not a barrier to but an incentive for using topic maps, to capitalize on existing investment. In the case of Aspen’s AnswerBook indexes, serendipitously the start- and stop-points of any index entry were already identified, because the index pointed not at a page number but at a qagroup. That meant the inclusiveness of any material could be identified with one hundred percent certainty.

Indexes also contain several useful relationships easily modeled in topic maps—the see and see also relations most prominently. Other relationships will be discussed momentarily.

5. Aspen BookBuilder

With this in mind, let us take a look at the prototype for Aspen’s custom book constructor, the Aspen BookBuilder. It utilizes the Cogitative Topic Maps Websites (CTW) framework, an XSLT library for Topic Maps as described in [CTW], [XTM4W], and [XSLTCB]. The prototype was designed for proof of concept and speed of development, and consequently does not represent the final application in many respects. For this prototype, Aspen selected just 5 of its 41 AnswerBook titles and then supplied approximately 20 percent of the questions from these titles. In some cases, questions or index entries were selected because it was clear they had relations that would span across the titles—in other words, the sample material was probably richer than the material as a whole. The five titles—Pension Answer Book, 401(k) AnswerBook, 403(b) AnswerBook, ERISA Fiduciary AnswerBook and Nonqualified Deferred Compensation AnswerBook—were assumed to have related material on benefits and pensions and are in fact marketed as the Panel Pension Library, Panel being an Aspen imprint.

The prototype being demonstrated in Philadelphia is shown in Figures 4, 5 and 6. These are web-based screenshots, reflecting not only the simpler path of developing a web application, but also a reliance on standard technologies such as HTML, CSS and XSLT, as well as XTM. (We hope to demonstrate access to the same underlying topic maps using hyperbolic or star-tree representations but this work was not completed at this paper’s submission time.)

Aspen BookBuilder Opening Screen

The top row of BookBuilder contains the names of the five included books, selectable by the user. In this case, the second title, 401(k) AnswerBook was chosen. Either the Table of Contents or the Index for each title can be displayed, along the screen’s left side. Before selecting any index or TOC entry, the middle column simply displays the book title and other metadata. The righthand column will hold the items selected for the book being constructed. Any topic in the lefthand column can be dragged into the list on the righthand side.

Although our figures do not display the Table of Contents, any chapter or subpart of a chapter can be dragged into the book construction list, which can be re-ordered once it is complete. The mechanisms for dealing with questions that are pulled in more than once also have to be put into place, but were not required for the proof of concept.

Figure 4. Aspen BookBuilder Opening Screen

BookBuilder Displaying a Top-Level Index Entry

When the Index was selected, entries or topics from the book’s index are displayed. Here, the top-level entry “Compensation” has been selected. This entry has two sub-entries to it, which are displayed, as are the questions referred to in the second-level index entry. Note that below the Compensation entries are listed similar topics from other titles. In this case, the similarity relationship is based on the use of the same term in the different indexes. Wherever an index entry refers to specific qagroupw, questions themselves are displayed. From a navigational standpoint, any question or index entry shown in the middle column is clickable—taking the display to that entry or question and answer.

Figure 5. BookBuilder Displaying a Top-Level Index Entry

Question and Answer Display

If a question is selected rather than an index entry, the question and its answer are displayed. Roughly 12 lines of text are displayed before scrollbars in the shaded answer part are required. In the prototype, users can view the complete answer, but any restriction can be placed here. Related topics within the same book refer to the other index entries that point to this question. While the user has navigated to this spot by selecting 401(k) AnswerBook > Contributions > limits, he or she can also reach it using one of the other three entries in the index that point to this question. Any question that shares an index entry with a question has a “related topics” relationship.

Although the link here to Question 8:66 is not highlighted, links within the text content were built, both within the text and to external sources, such as the reference to the Economic Growth and Tax Relief Reconciliation Act (EGTRRA) § 616(a)(2)(A). Because Aspen includes an online primary-law research site, Loislaw, this availability of the primary-source with explanatory material heightens the navigational strengths of this interface.

Figure 6. Question and Answer Display

6. Conclusions

The BookBuilder prototype easily related index entries from five different titles, using XML topic maps to identify the index entries and the basic relationships used to connect associated concepts. Merging of the different topic maps was based on names, but in the actual production arena it was envisoned that published subject indicator (PSI)s would be used for merging. After the application was constructed and the notion extended to other series, the natural fit of this material to this technology became more apparent. First, the atomized nature of the FAQ approach makes it simple to identify the starting and ending point of any topic for inclusion in the custom publication. For many books, a second pass through the index by an SME would need to take place, to map the exact start- and stop-points. Second, the detailed index of the material that is necessary for this approach to work poses no obstacle to technical publishers, who have made huge investments in indexing already. As was noted before, choosing open standards facilitated the development of this project—although to the authors’ knowledge, no index-based custom-publishing application had ever been built before, the Aspen BookBuilder took less than two weeks to construct from requirements list to working prototype. Most of the subsequent work was simply in cleaning up the user interface.

Content repurposing is not just a buzzword in publishing, but in fact one of the essential means for justifying the expenditures for better, richer content. XML, of course, presupposes repurposing as the natural course of events. Whether the topic map methodology described here is used for custom-publishing or small-run narrow-interest republishing or web navigation, it clearly makes separate books into a web of interconnectedness that puts exploration of the material onto a higher level.

Bibliography

[XTM] http://www.topicmaps.org

[ISOTM] http://www.isotopicmaps.org/

[xml2000] http://www.cogx.com/xml2000

[XTM4W] XML Topic Maps: Creating and Using Topic Maps for the Web, Addison-Wesley, 2002. (Chapter 9).

[XSLTCB] XSLT CookBook, O’Reilly 2002. (Recipe 11.4)

[CTW] http://www.cogx.com/ctw

Glossary

BOB

Back of the Book

CTW

Cogitative Topic Maps Websites

DOI

Digital Object Identifier

FAQ

frequently asked question

LOQ

List of Questions

PSI

published subject indicator

SME

subject-matter expert

TOC

Table of Contents

Biography

Nikita Ogievetsky is a consultant in knowledge technologies and information management. He leads the community in finding enterprise solutions for real life problems using Topic Maps, XSLT and other XML technologies. He has developed the Cogitative Topic Maps Websites (CTW) framework and is actively involved in enabling the interchange between the RDF and Topic Map standards. Nikita Ogievetsky authored over twenty papers on Knowledge Technologies and Applied Math and Physics.

Roger Sperberg is an electronic publishing consultant working with WoltersKluwer’s LTBNA group. He was formerly manager of electronic publishing systems at Aspen Publishers, a legal publisher which has begun publishing its books direct from XML. Prior to that he was director of content services for The Ballantine Publishing Group at Random House. The author of many web articles and co-author of a multimedia text, he is the author of the for.eWords column at eBookWeb.



[1] Tommie Usdin of Mulberry Technologies provided Aspen with guidance at this juncture, critically pointing out the frustrations engendered by a custom publication whose chapters are referenced only by a collection of indexes from the books supplying chapters and successfully pointing the effort in the direction of a unified index.

[2] There are as many as 1500 questions in some of the series titles, so this is an important navigational feature.

[3] Each entry in the index would therefore require an element to be inserted into the text. If a question was referenced three times in the index, three references would be attached to the qagroup in order to regenerate the index at publication time. Some question-and-answer pairs are referenced as many as fifty times in the index.

[4] Although not essentially discussed here, these separate BOB sections are structurally no different from the main index, but with specialized references, such as sections of the Internal Revenue Code or Treasury Regulations.

[5] At Aspen, for instance, the amount budgeted for indexing typically equalled that allocated for composition.