Keywords: Content Management, Topic Maps, Database
Biography
Mr. Carton is Vice President of Content Technologies for Retrieval Systems Corporation. He has over 25 years experience in information systems development with special expertise in the technologies of the publishing and library sectors. He couples strong requirements knowledge with best-of-breed technology development skills to create content-centric solutions. At RSC, he is responsible for design and development of complex repository management systems for publishing clients and for research and development into the company's CMS product "Tractare".
Retrieval Systems Corporation (RSC) is an established software services company with a well-earned reputation for delivering successful content management and publishing deployments on time and under budget. Our customers include professional publishers and other public and private institutions with significant content management and publishing requirements. As a thought leader in content management and delivery systems, RSC was recently recognized by The Gilbane Report for its leadership in blending commercial off the shelf (COTS) and open source components to provide high impact low cost content management solutions.
Prior to joining RSC, Mr. Carton worked for Preston-Whitelaw, Ltd., an early adopter of for-profit web publishing technologies, and for Online Computer Systems (A Reed-Elsevier company), where he was Vice President of Electronic Products and responsible for their premier line of CD-ROM products.
Mr. Carton, a native of New England, lives in the Annapolis, MD area with his family. He enjoys sailing, reading, gardening, and travel.
Topic maps (XTM) offer a standards-based way to represent classification schemes. The practical uses of topic maps can include classification of many kinds of content and leveraging metadata to extend classification schemes. The associations in topic maps provide unparallel opportunities to improve the users’ navigation experience and increase serendipity. But is it possible to integrate several classification schemes while retaining accuracy and can the online navigation experience be improved for the user? Can the many classification schemes work with each other to improve the accuracy of the user experience?
This paper discusses a framework that can be used to integrate two (or more) XTM classification schemes, using information from one scheme to improve the navigation of others while increasing the serendipity of the user experience. Although presented with examples from a specific commercial topic map and CMS engine, the same framework could be used in many integration settings.
1. Introduction, Goal and Purpose
2. Approach
3. Sample Classification
4. XTM Development
5. Application Sample
6. Conclusion
Bibliography
Perhaps the most commonly used form of content classification is the multi-level hierarchical topical index. In either printed or electronic form, these finding aids are developed and provided for all manner of content. Often, the tree that classifies some piece of content is manually created, thus intellectually augmenting the content. The classification tree becomes an important part of the content itself. Presenting the index with the content is an important part of the publication of the content.
In electronic form however, despite advances in navigation technology, large multi-level, hierarchical indexes are difficult to use: they usually require more display space than computer monitors and windows permit, they often present disparate portions of the index in isolation, and they are frequently not interconnected – each hierarchy is presented in isolation from others. The larger the indices, the more compelling these problems become. How often have we heard or even said that the index to a product was useless. “Serendipity is “luck, or good fortune, in finding something good accidentally.”[Serendipity] We have lost the serendipity experience of browsing a printed index.
An interesting problem arises when there are multiple classification schemes for content. For example, a help-center document collection might be indexed topically, with each content document also indexed by a particular part, product, or even product line.
Topic Maps[XTM Spec] offer a standards-based way to represent a classification scheme or index. Using Topic Maps, we can develop new ways to represent the traditional multilevel index to increase navigability while preserving the intellectual augmentation of the content. By integrating these different classification schemes over content, we can achieve a more robust content finding aid and improve the user experience, both in terms of accuracy and serendipity.
This paper discusses a Topic Map (XTM) ontology that can be used to implement integrated topical and classification schemas and the technical issues that were addressed by this ontology. Although presented with examples developed in the Ontopia OKS Topic Map and RSC’s Tractare CMS engines, the same techniques could be used in many integration settings.
Content Classification is the process of indexing content in accordance with some pre-ordained scheme. This could be as simple as listing key words or phrases describing the content or identifying elements of a controlled vocabulary (or thesaurus) that describe the content within a framework.
But why do it? Simply put, so that others can find the content. Finding aids are just that – aids for others to find the content. For finding aids to be useful, they must adhere to "standard" nomenclature. These "standrds" have their own problems, as described by Garshol[TopicMaps!] :
“…In addition to metadata not necessarily saying very much about the aboutness of an object a related problem is that making metadata describe the subject precisely may also be difficult… if the authors had not listed "topic maps" as a keyword those searching for "topic maps" would have been unable to find their papers at all…”
Describing the “aboutness” of an object is an important part of making the object find-able. But, classifying content is not enough to make the content “find-able.” The classification scheme must also be supported by an access method – navigation, searching or whatever. Over time, access methods have evolved from printed tables of content and subject keyword indexes to sophisticated online browsing and searching of mammoth classification schemes. But what online methods add in completeness (and even retrieval speed), they tend to lose in serendipity – it is harder to “stumble” into the content you want, especially as the size of the content and classification scheme grows.
The use of TopicMaps as a paradigm for developing a classification model increases the power both of the creation and the retrieval processes. We can classify more content, with more complexity and still provide an accurate user experience. We can also integrate multiple classifications, introducing a whole new way of accessing content. TopicMaps, as a design methodology, have the power to revolutionize the way we think about classification and subsequent navigation of content.
In this paper, we will develop an example TopicMap based on content extracted and heavily modified from the EPA website that representing the integration of two simplistic classification schemes. We will show how the topic map results in improved user navigation accuracy and serendipity experience.
In our application, there are two topical classification schemes over a collection of content. The first is a multi-level hierarchical topical index. The second is a flat document type index.
Let’s start by looking at some of the content (the “TYPE” attribute is fictional). Some of these are classified under multiple topics:
1. IRIS: Integrated Risk Information System TYPE: Database A database of human health effects that may result from exposure to various substances found in the environment. URL: http://www.epa.gov/iris/index.html TOPICS: Radiation And Radioactivity > Exposure Human Health > Exposure 2. Health Effects Notebook for Hazardous Air Pollutants TYPE: Fact Sheet The fact sheets available on this Web page describe the effects on human health of substances that are defined as hazardous. URL: http://www.epa.gov/ttn/atw/hapindex.html TOPICS: Radiation And Radioactivity > Exposure Human Health > Exposure 3. Understanding Radiation: Exposure Pathways TYPE: Guide This page provides information about different exposure pathways to radiation. URL: http://www.epa.gov/radiation/understand/pathways.htm TOPICS: Radiation And Radioactivity > Exposure Radiation And Radioactivity > Exposure > Exposure Pathways 4. Draft Report on the Environment: Human Health: Environmental Pollution and Disease TYPE: Report There is an association between environmental exposure and certain diseases. URL: http://www.epa.gov/indicators/roe/html/roeHealthEn.htm TOPICS: Radiation And Radioactivity > Exposure 5. Effects of Radiation Type and Exposure Pathway TYPE: Guide The type of radiation to which a person is exposed and its exposure pathway influences health effects. URL: http://www.epa.gov/radiation/understand/health_effects.htm#typeandexposure TOPICS: Radiation And Radioactivity > Exposure > Exposure Pathways 6. Draft Report on the Environment: Human Health: Environmental Pollution and Disease TYPE: Report There is an association between environmental exposure and certain diseases. URL: http://www.epa.gov/indicators/roe/html/roeHealthEn.htm TOPICS: Human Health > Exposure 7. Chemicals in the Environment: OPPT Chemical Fact Sheets TYPE: Fact Sheet These fact sheets provide a brief summary of information on selected chemicals. URL: http://www.epa.gov/opptintr/chemfact/index.html TOPICS: Human Health > Exposure 8. Human Exposure Database System (HEDS) TYPE: Database An integrated database system that contains chemical measurements, questionnaire responses, documents, and other information related to EPA research studies of the exposure of people to Environmental contaminants. URL: http://www.epa.gov/heds/ TOPICS: Human Health > Exposure 9. Emergency Response Program: Exposure Pathways TYPE: Guide An exposure pathway refers to the way in which a person may come into contact with a hazardous substance. URL: http://www.epa.gov/superfund/programs/er/hazsubs/pathways.htm TOPICS: Human Health > Exposure > Exposure Route
The content type index is straightforward and obvious, consisting of these “types”:
The hierarchical index phrases look like this:
The “>” character is used to indicate separation of levels. It is important to note that at each level, the higher levels are required to accurately describe the current index phrase. “Exposure Pathways” is somewhat meaningful alone, but is much more meaningful as a child of “Radiation And Radioactivity > Exposure.” Also, “Exposure” is meaningful in two contexts: as a child of “Radiation And Radioactivity” and of “Human Health.”
There are several interesting concepts that can be derived from this content and classification:
This last point is particularly interesting in that is make it possible to create a navigation tool that limits portions of the topical index that are exposed to only those describing content of a certain type. We can list the inferred types by topic:
The ancestors of each index phrase carry the children’s content type relationship implicitly – “Human Health” is related to “Database”, “Fact Sheet” and “Guide.”
In this simplistic relationship, it becomes possible, for example, to limit the exposure of the topical index by the content type. So if a user wishes to navigate only through content that is a “Report,” not only can we restrict the content, but we can restrict the topical index, in this example to “Radiation And Radioactivity” and “Radiation And Radioactivity > Exposure.” The user won’t see anything under “Human Health” because there is no content of type “Report” located therein.
Next, we’ll examine the creation of a topic map for this sample.
We’ll start by creating an ontology. We will need topic classes for the contents, the topical index phrases, and the content types. We are keeping the content types in separate topics for this example because in a real-world implementation, it is likely that this class would be more complex.
We will also need classes for the associations (and the roles therein) between the levels of the topical index, between the index phrases (leaf nodes) and the content, and between the content types and the content. The associations between the topical index phrases will need to preserve the original ancestry while allowing free navigation.
As an aside and for reasons of brevity, we are not showing Subject Indicators (or PSIs) in this example. However, a well-formed, real world topic map would use them.
Here is the ontology:
<!-- Topic Classes -->
<topic id="TopicalIndexPhrase">
<baseName>
<baseNameString>
Topical Index Phrases
</baseNameString>
</baseName>
</topic>
<topic id="ContentType">
<baseName>
<baseNameString>
Content Types
</baseNameString>
</baseName>
</topic>
<topic id="Content">
<baseName>
<baseNameString>
Contents
</baseNameString>
</baseName>
</topic>
<!-- Association Classes -->
<topic id="TopicalIndex">
<baseName>
<baseNameString>
Topical Index
</baseNameString>
</baseName>
</topic>
<topic id="TopicalIndexContent">
<baseName>
<baseNameString>
Topical Index to Content
</baseNameString>
</baseName>
</topic>
<topic id="ContentTypeIndex">
<baseName>
<baseNameString>
Content Type to Content
</baseNameString>
</baseName>
</topic>
<!-- Association Member Roles -->
<topic id="TopicalIndexLeaf">
<baseName>
<baseNameString>
Topical Index Leaf
</baseNameString>
</baseName>
</topic>
<topic id="Superior">
<baseName>
<baseNameString>
Superior Phrase
</baseNameString>
</baseName>
</topic>
<topic id="Inferior">
<baseName>
<baseNameString>
Inferior Phrase
</baseNameString>
</baseName>
</topic>
<topic id="level1">
<baseName>
<baseNameString>
Top Level Topical Index Phrase
</baseNameString>
</baseName>
</topic>
<topic id="level2">
<baseName>
<baseNameString>
Second Level Topical Index Phrase
</baseNameString>
</baseName>
</topic>
<topic id="level3">
<baseName>
<baseNameString>
Third Level Topical Index Phrase
</baseNameString>
</baseName>
</topic>
<!-- Scopes -->
<topic id="Description">
<baseName>
<baseNameString>
Content Description
</baseNameString>
</baseName>
</topic>
|
From this ontology, we can create the topics and associations. Rather than listing them in their entirety, here are samples to demonstrate the points. First, the topical index topics are of the form:
<topic id="I1">
<instanceOf>
<topicRef xlink:href="#IndexPhrase"/>
</instanceOf>
<baseName><baseNameString>
Human Health
</baseNameString></baseName>
</topic>
|
The content type topics are of the form:
<topic id="guide"> <instanceOf><topicRef xlink:href="#ContentType"/> </instanceOf> <baseName><baseNameString> Guide </baseNameString></baseName> </topic> |
Finally we have the topics representing the content, only one of which is replicated here. Each of these topics has a full text description of the resource and an occurrence pointing to the instance of the content on the Internet.
<topic id="C1">
<instanceOf>
<topicRef xlink:href="#Content"/>
</instanceOf>
<baseName>
<baseNameString>
IRIS: Integrated Risk Information System
</baseNameString>
</baseName>
<baseName>
<scope>
<topicRef xlink:href="#Description"/>
</scope>
<baseNameString>
A database of human health effects that may result from
exposure to various substances found in the environment.
</baseNameString>
</baseName>
<occurrence>
<resourceRef xlink:href="http://www.epa.gov/iris/index.html"/>
</occurrence>
</topic>
|
The associations between these topics permit the navigation capabilities to be implemented. First, we have associations between the topical index phrases. Here are some examples:
<association id="HA1"> <instanceOf> <topicRef xlink:href="#TopicalIndex"/> </instanceOf> <member> <roleSpec> <topicRef xlink:href="#Superior"/> </roleSpec> <topicRef xlink:href="#I1"/> </member> <member> <roleSpec> <topicRef xlink:href="#Inferior"/> </roleSpec> <topicRef xlink:href="#I2"/> </member> </association> <association id="HA2"> <instanceOf> <topicRef xlink:href="#TopicalIndex"/> </instanceOf> <member> <roleSpec> <topicRef xlink:href="#Superior"/> </roleSpec> <topicRef xlink:href="#I2"/> </member> <member> <roleSpec> <topicRef xlink:href="#Inferior"/> </roleSpec> <topicRef xlink:href="#I3"/> <topicRef xlink:href="#I5"/> </member> </association> <association id="HA3"> <instanceOf> <topicRef xlink:href="#TopicalIndex"/> </instanceOf> <member> <roleSpec> <topicRef xlink:href="#Superior"/> </roleSpec> <topicRef xlink:href="#I4"/> </member> <member> <roleSpec> <topicRef xlink:href="#Inferior"/> </roleSpec> <topicRef xlink:href="#I2"/> </member> </association> |
It is important to note that the second association (HA2) connects one phrase with two children, even though those children were not in the same tree in the original topical index. By connecting them here, we allow serendipitous navigation through the tree. We are saying, in essence, that each node stands by itself (at least for this purpose). We will make the connection to the original ancestry from the content itself in later associations.
We also have associations connecting the content types to the content. Here is a self-explanatory example:
<association id="TC2"> <instanceOf> <topicRef xlink:href="#ContentTypeIndex"/> </instanceOf> <member> <roleSpec> <topicRef xlink:href="#ContentType"/> </roleSpec> <topicRef xlink:href="#database"/> </member> <member> <roleSpec> <topicRef xlink:href="#Content"/> </roleSpec> <topicRef xlink:href="#C1"/> <topicRef xlink:href="#C9"/> </member> </association> |
Finally, we have associations between the “leaf” phrases of the topical index and the content. In these associations, we not only map the phrase to the content, but we also map the original ancestry of the topical index under which this content is classified. Here are two examples:
<association id="IC3"> <instanceOf> <topicRef xlink:href="#TopicalIndexContent"/> </instanceOf> <member> <roleSpec> <topicRef xlink:href="#TopicalIndexLeaf"/> </roleSpec> <topicRef xlink:href="#I2"/> </member> <member> <roleSpec> <topicRef xlink:href="#Content"/> </roleSpec> <topicRef xlink:href="#C3"/> </member> <member> <roleSpec> <topicRef xlink:href="#level1"/> </roleSpec> <topicRef xlink:href="#I4"/> </member> </association> <association id="IC4"> <instanceOf> <topicRef xlink:href="#TopicalIndexContent"/> </instanceOf> <member> <roleSpec> <topicRef xlink:href="#TopicalIndexLeaf"/> </roleSpec> <topicRef xlink:href="#I5"/> </member> <member> <roleSpec> <topicRef xlink:href="#Content"/> </roleSpec> <topicRef xlink:href="#C3"/> </member> <member> <roleSpec> <topicRef xlink:href="#level2"/> </roleSpec> <topicRef xlink:href="#I2"/> </member> <member> <roleSpec> <topicRef xlink:href="#level1"/> </roleSpec> <topicRef xlink:href="#I4"/> </member> </association> |
In the latter example, you can see that content C3 (“Understanding Radiation: Exposure Pathways”) is indexed under topical index phrase I5 (“Exposure Pathways”), with I2 and I4 forming the ancestry above it (“Exposure”, and “Radiation And Radioactivity”). This replicates the original classification in its entirety over the content unit.
Having a complete topic map as we have described, we can now use it to create an application that allows our navigation serendipity while retaining our classification accuracy and permitting the navigation to be restricted by the type of content.
Our sample application offers a topical index browse function. If we click on a topical index entry, such as “Exposure”, we see the following topical index display:
Superior:
Radiation And Radioactivity
Human Health
Inferior:
Exposure Pathways
Exposure Route
So at any point, the user sees the grid or network in which the current topical index phrase is located.
Our sample application also has two search boxes in a browser frame. The first is a search for content type. The second is for topical index phrase (searched as a keyword). The results of a search are the topical index trees that satisfy the search criteria.
If we enter a content type search such as “Guide”, we will get this result:
Radiation And Radioactivity > Exposure
Understanding Radiation: Exposure Pathways
Radiation And Radioactivity > Exposure > Exposure Pathways
Effects of Radiation Type and Exposure Pathway
Understanding Radiation: Exposure Pathways
Human Health > Exposure > Exposure Route
Emergency Response Program: Exposure Pathways
If we enter a content type search such as “Report”, we will get this result:
Radiation And Radioactivity > Exposure
Draft Report on the Environment: Human Health: Environmental Pollution and Disease
If we enter a content type search such as “Fact Sheet”, we will get this result:
Radiation And Radioactivity > Exposure
Health Effects Notebook for Hazardous Air Pollutants
Human Health > Exposure
Chemicals in the Environment: OPPT Chemical Fact Sheets
Health Effects Notebook for Hazardous Air Pollutants
If we enter a content type search such as “Database”, we will get this result:
Radiation And Radioactivity > Exposure
IRIS: Integrated Risk Information System
Human Health > Exposure
Human Exposure Database System (HEDS)
IRIS: Integrated Risk Information System
If we further limit the search by a topical index phrase keyword, such as “Pathways”, with a content type of “Guide”, we will only see this result:
Radiation And Radioactivity > Exposure > Exposure Pathways
Effects of Radiation Type and Exposure Pathway
Understanding Radiation: Exposure Pathways
The same search with a content type of “Report” will yield no results.
The use of Topic Maps offers unparalleled opportunities to develop new, better finding aids for content. Through the techniques demonstrated here, we can intermingle classification schemes and provide a richer, more accurate and faster experience for the user while retaining accurate informational knowledge of the classification and adding a serendipitous element that is missing from many online finding aids today.
XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.