XTech 2005: XML, the Web and beyond.
The ISO Topic Maps standard defines an abstract model for representing many different types of information structure, and an XML-based interchange syntax for that model. With the standard coming up on its 5th birthday, this article takes a survey of what the standard defines, what open source tools are available to work with topic maps and discusses potential application areas that could be of interest to the open-source development community.
The Topic Maps standard was first published by ISO at the end of 1999, with the current, second edition published in 2002 ISO13250. The standard comes from the same ISO committee that brought both SGML (precursor to XML) and HyTime (precursor to XLink) into the world. The original goal of the creators of the standard was fairly limited in scope. They simply wanted to create a structure that enabled document indexes to be automatically merged. Just using SGML was not enough - there was a need to be able to identify indexed subjects, to describe see-also relationships and to be able to link index entries to the subject occurrences in the text. It was also necessary for the standard to define some way to combine multiple indexes automatically.
It quickly became apparent that with a little generalisation, the standard could be used for more than just indexes. From this, the first version of ISO 13250 was born. The first version of the standard showed its SGML/HyTime roots in that the standard interchange syntax was described as a HyTime architecture (an SGML meta-DTD) and made use of the HyTime linking mechanisms.
By the time the standard reached publication, however, it was clear that what would be needed for the success of the standard was an XML/XLink version of the interchange syntax and so a working group was formed outside ISO to get the job done. The result was the XML Topic Maps standard (http://www.topicmaps.org/xtm/1.0/) produced by the TopicMaps.Org group. The second edition of the ISO 13250 incorporates the XML DTD for XTM 1.0 as part of the standard, raising the status of XTM to a international standard interchange syntax for topic maps. Over the years, several other authoring and interchange syntaxes have been proposed each with their own merits and weaknesses, although it is likely that only the HyTime and XML-based interchange syntaxes will remain as standard.
At the time of writing, the ISO committee is in the process of formalising the underlying topic maps data model. Any developer knows that you can get a long way without a formal data model (!), but the committee felt a formal data model was required as a basis for the work in defining a query language and constraint language for topic maps. The data model is quite well developed now and proposals for the query and constraint languages already exist both in principle and in implementation.
For a software developer, probably the most interesting way to consider topic maps is to look at them as a data model. We are used to working with a variety of data models from UML to relational database schemas to class and source module hierarchies to the DOM - much of what a developer does is manipulation of some form of data model.
Alexander Johannesen () recently published a presentation he gave to the National Library of Australia that fleshes out this view of topic maps as a data model in more detail and elegance than this author is able Johannesen04. What follows is a brief overview of the key features of the topic maps data model for those not already familiar with it.
Subjects are the things that a topic map describes. A subject can be any thing at all in any possible world. The topic maps standard does not define any core set of "things" or attempt to restrict what are allowable subjects in any way - it is completely up to the application.
A subject can be identified in one of three ways:
In general, a subject locator is a more robust identifier of a subject than a subject indicator and a subject indicator is more robust than a name. Within OASIS a technical committee has produced a specification for a process of defining Published Subject Indicators which aim to make subject indicators as robust as subject locators by specifying minimum content, meta-data and management policies for subject indicator resources PUBSUBJ.
The topic is the topic map representation of a subject. The topic itself says nothing about the subject, it simply acts as a point to which information about a subject can be attached.
One property that a topic may have is one or more types. The type(s) of a topic define the class(es) to which the subject belongs. These classes are themselves subjects and so types are represented in a topic map using topics. This feature makes the topic map standard independent of any ontology and allows a high degree of self-description.
Topics can also have any number of names attached to them. Multiple names allows for different languages and for names specific to a particular audience to be used (e.g. the Latin name vs. the common name for a plant or animal).
Occurrences are resources that provide information about a subject. This can include simple properties of the subject (e.g. a count of the number of methods in a class) or can be pointers to external resources (e.g. the address of the file that implements the class in a source repository).
Associations define the relationships between topics. The topic map standard allows associations to have any number of participating members. Associations can be typed (using topics) and the role played by each participating member can be defined by a topic.
The topic map standard provides a simple mechanism for dealing with the issue of context. Sometimes a piece of information can be considered to apply to a subject only in a particular context. Names are a good example - the name of a subject could be contextualised by the language of the name. Scope allows names, occurrences and associations to be contextualised by using a set of topics to describe the context in which the name, occurrence or association should be considered to be valid. Although this fairly loose definition of context has some shortcomings, it is undoubtedly a useful tool.
Two or more topic maps can be merged by a simple process described by the standard. The merging process makes use of the three types of subject identity. When two topics share the same subject identity, then they are merged. The merging process replaces the merged topics with a single topic which has all of the names and occurrences of the merged topics and which replaces those topics where ever they are referenced as a type, as part of a scope or as part of an association. The merging principles and the use of well defined subject identification mechanisms enable a modular approach to the creation of large knowledge bases using topic maps, and also allow the possibility of annotating one topic map with another.
As has already been mentioned, the topic maps data model does not define any set of valid topic types, association types or occurrence types. All of this information is in the realm of the application. New types can be introduced into a system simply by defining them. The merging features of topic maps also allows those new types to be specified in a modular fashion.
In some applications it will be necessary to code or script some behaviour for those new types, but even where such additional code is not provided, the rich abstract model of topic maps allows a topic map-aware application to differentiate between the subjects in the data (topics), the relationships between subjects (associations) and relationships to other data (occurrences) and makes it possible to create a useable presentation of the information that cannot be processed in any other way.
In comparison, a relational database would require new tables to support a new class of entity and queries would have to be rewritten to incorporate the new entity type and any new relationship types that arise. An XML file format would require changes to the schema. A binary file format change would require changes to the serialisation and deserialisation code. Using a data-drive architecture like topic maps does not eliminate all of the work, but does at least reduce it.
The development of topic map applications has made much use of existing open source tools. Both commercial and non-commercial topic map applications make much use of the commodity XML parsers, databases, data integration tools and application servers produced by the open-source community. The question now is what can topic maps provide back to that community and how can the community best make use of the topic map tools that exist.
Software development projects tend to generate large quantities of documentation - user documentation, developer documentation, API documentation, bug-fixes and work-arounds, HOWTOs and FAQs. Given the history of topic maps in the definition of mergeable indexes, it should not be a surprise to find that topic maps have something to offer the project that is drowning in documentation.
Multiple sources of documentation can be difficult to navigate. In larger projects documents are often contributed by different authors and cover different aspects of the software and may be located in many different repositories and web sites. For a user this can make it difficult to locate where in the documentation their particular problem might be covered. Of course, searchable documentation helps but often the newbie needs more guidance than a full-text search results list. Indexing the documents (in the traditional sense of identifying index terms) helps because it gathers together the subjects considered to be most important. Topic maps aid the indexing effort by allowing the indexes from separate documents to be combined into a single unified index of the documentation set. You can think of the indexing terms as simply identifying what the subjects of interest in the project are. Adding occurrences to point back to the sources of those index terms provides simple index functionality. However, by creating associations between the topics that document the relationship between subjects, and with the addition of explanatory text in topic occurrences, the topic map can be raised from a unified index to a highly navigable overview of the concepts used in the project.
In addition to the "static" world of documentation, a development project will involve all kinds of dynamic data, such as queries to a bug reporting or requirements tracking system; views of a source code repository; check-in messages and so on - as we will see later, the integration of multiple data sources can also be handled within the topic map data model.
As an example, the TM4J project includes a Developer's Guide created in Docbook XML and Javadoc API documentation. This project has created tools for the generation of a topic map from each of these resources. The Javadoc topic map is created by a Doclet and includes topics that represent the classes, interfaces and methods of the API. The Docbook topic map is generated using XSLT from indexentry elements in the Docbook source. The resulting topic map contains all of the information of the Javadoc with the addition of links to mentions of a class or method in the developer documentation. Both the doclet source code and the Docbook to XTM stylesheet can be found in the CVS repository for TM4J. In future we plan to add further topic maps to document frequently asked questions and sample source code.
Many projects are not self-contained, but instead combine several components and, of course, when we write applications that use those projects we may be using not one library but combining several into a single application. One failing of Javadoc as an API documentation format is that it results in HTML pages which are inherently non-extensible, so if library A has a method that returns an instance of an object defined in library B, I cannot simply click on the return value in the Javadoc for library A and get taken to the Javadoc for library B. An intermediate, mergeable format like a topic map offers the ability to do just this and to be completely open to extension simply by loading new data.
Java project dependencies can get very complex. Even a tool such as Maven in its current form does not have the smarts to deal with nested dependencies and as projects mature and develop at different rates managing dependencies becomes an onerous task. While simple documentation of project dependencies is not the answer to all open source software configuration woes, it is often a useful thing to have when trying to work out why an updated system has stopped building or working. The Maven POM is one attempt to do this, but is a simple flat list with no facilities to handle complex dependencies, no way to specify dependencies other than simple library dependencies. For example I may want to know which databases or application servers are supported by a particular version of an application and maybe even which version of those databases or application servers. The POM doesn't have the structure to represent this kind of information). Although a POM might be extensible through the use of XML namespaces, it is not mergeable and when we have multiple dependencies on different projects developing at their own rate it would be extremely useful to be able to get a merged view of the entire dependency graph.
The topic map data model provides a useful high-level abstraction for the representation of many other forms of application data. It has been shown, for example that data sources such as relational databases and content management systems can be viewed through the lens of the topic map data model. There are two common approaches to this - on-the-fly mapping of requests for topic map objects into queries against the underlying system; or batch mapping where a topic map is generated or updated on a regular schedule or when the underlying database is modified.
Regardless of the method used to perform the mapping, the high-level abstraction that such a mapping provides, gives developers an interesting way to think of application integration where data providers simply map their information as a set of topics, occurrences and associations with some well-defined identifier schemes to enable the data entities they provide to be merged with other external data entities. A topic-map aware client could then make use of these data providers as if they were topic maps while being shielded from the details of accessing the underlying data source, and applying the standard topic map merging rules to generate a unified view of the data sources.
Rather than mapping data in and out of the topic map data model, there is the option of using the topic map data model as the data model for an application itself. The principal advantages of this approach are:
A prime candidate for such a development would be a content management system / web portal. One open-source effort, ZTM actually layers the topic map data model on top of Zope and then uses this model to construct highly navigable web sites with a simple topic-oriented interface for users and content producers. However, this author believes that there are strong arguments for building a content management system from the ground up based on an underlying topic map data model.
Other application areas also provide scope for the use of a topic map data model. The data managed by PIM applications could be better integrated using a topic map. Mind-mapping tools, bookmark management and genealogy tools are application areas where the handling of subjects and resources as well as the graph-like nature of topic maps provide a solid platform for application-specific ontologies to be developed. Some of these application areas already have de-jure or de-facto interchange standards, but the use of the topic map data model does not have to mean the replacement of those interchange standards - it should be possible to map such an interchange format into and out of a topic map data model.
The really innovative topic map applications are still out there waiting to be discovered, but what this innovation needs as a bootstrap is more open data. Open data is data that is web-accessible and machine-processable. Although XML provides a basic level of openess, building data interchange on a higher-level data model can improve interoperability. The challenge for all open source developers must be to consider how the data that their applications manage can be made more open.
XML-based syndication formats such as RSS and Atom and proposals for REST-style topic map exploration interfaces are providing new channels for distribution of XML and topic map data. With those channels in place, what excuse do we have to stop at HTML renditions of application data ? More XML, XTM and RDF data will help to build a critical mass that enables new applications to be built that processes and combines that data in ways that the data creator and publisher never thought possible.
This section presents a few of the open source libraries and tools available for creating and processing topic maps.
Software which processes and (optionally) persists instances of the topic maps data model are commonly referred to as "engines". There are a number of open-source topic map engines available.
TM4J () provides a selection of different persistence options and implementation of a prolog-like query language. tinyTIM () is a light-weight, in-memory processor with a small JAR footprint. XTM4XMLDB () provides a topic map engine that uses native XML databases such as eXist or Xindice as the persistent store. Shark () is a topic map engine written for the J2ME environment. All of these engines support the newly developed TMAPI, a common API for topic map applications ().
AsTMa () provides a complete suite of topic map processing tools including implementations of query and constraint languages.
This is a small selection of some of the open-source applications that make use of topic maps.
xSiteable () is a website development tool which combines the topic map data model with a simple XML syntax and XSLT processing to generate small to medium-sized websites. Using the topic map data model, xSiteable focuses the site author on creating pages for subjects and documenting the relationships between them.
TMNav () is a desktop application for browsing topic map data. The application is capable of reading XTM syntax topic map files and displays them in a dynamic graph GUI.
TMTab () is a plug-in for the Protégé ontology editor () which allows the editor to export ontologies as XTM syntax topic map data. With a little configuration work, it is possible to create a very usable and fairly intuitive editor for a particular topic map ontology.
TM4Web and TMBrowse () build upon TM4J to provide a web application for topic map-driven websites that uses the Apache Velocity templating language. TMBrowse is a reference implementation using the TM4Web framework. TM4Web can also be used in an off-line mode to generate a static website from topic map sources.
ZTM () extends the Zope Content Management Framework with the topic maps data model. Under ZTM, topics are created as content items managed by Zope and occurrences and associations are created as properties of those content items. ZTM enables the use of Zope's extensive page editing facilities (including multi-user support and multiple levels of undo) to allow developers to create a customised, easy-to-use topic map editing environment.
The recent development of open-source topic map tools has greatly benefited from the infrastructure of parsers, libraries and application frameworks created by the open source community. After several years of development, these topic map tools are now ready to contribute something back to the community. Topic map tools can be used to better integrate documentation and data generated by open source projects - not only within a single project but also where projects make use of or depend on other projects.
The topic maps data model also provides a rich abstraction with several implementations available in a range of programming languages. Starting with one of these libraries could enable developers to focus on the high level functionality of their app. while still getting the benefits of integration with a standard interchange format and data-driven extensibility.
Finally, topic maps like all semantic web technologies are still waiting for a critical mass of accessible and processable data to be made available. Topic maps is just one route to this goal, and even if you don't decide to follow that path, I hope that for your next project (or your current one!) you might stop and think "How do I make this so that some one can create a topic map from the data?".
Kal Ahmed
Networked Planet Limited