In this work a TM aware protocol following the REST paradigm is introduced. It enables clients to access TM backends over a network regardless of their platform and storage technology. For precise location of information it adopts features of the upcoming TM query language, TMQL, but it also supports the bulk transfer of whole maps. In contrast to similar approaches it also supports update operations.
Keywords: Topic Maps; Web Services
| XML Source | PDF (for print) | Author Package | Typeset PDF |
During the past months it has become viable to build portal sites serving content from a Topic Map backend[Pepper04][BrainBank]. The obvious benefit of such setups is that much more versatile content can be potentially hosted. The portal software itself can remain rather generic [ArtUNIX03], interpreting only TM content as necessary (semantic upscaling). Naturally, much of the semantics is put into the topic map data (this is what TMs have been invented for), leaving it up to the user (and/or the context) what aspects of the data is presented and how it is postprocessed: users may be exposed directly to a network of topic nodes allowing them to freely navigate; they may also be guided through a given sequence of nodes whereby only specific information is shown to them (e.g. slide shows[BaZa03]).
More conservative architectures will host topic map content (we prefer the more semantically-rich term content over data) using mature technologies, such as relational databases, or - as long as the file sizes remains rather small - file systems. There content can be stored in one of the available TM formats[XTM 1.1][LTM][AsTMa=]. Another alternative is to use XML databases and to retrieve pertinent content as XTM fragments.
The upside of most of these approaches is that they inherently networked, whereby network protocols such like ODBC or NFS take care of the actual transport. Content stores and the portal frontend can now be run on different nodes, allowing to reuse existing infrastructures. Naturally, these protocols have been designed to cope with SQL over tabular data or file segments in a hierarchical file system. Such protocols may not necessarily cope well when retrieving topic map data.
From the above we derive some of the motivation for a TM-specific exchange protocol. That should provide clients with direct access to TM structured information. The design was driven by the following objectives:
Apart from the architectural considerations, the protocol should also expose non-functional qualities. As always, speed is a major concern. This must be achieved in reducing the number of messages a client has to exchange with the server to achieve typical tasks. This - naturally - has to be traded off with the complexity of individual messages. Realistically, the protocol should also allow servers to refuse cooperation if the access patterns violate their policy or endanger the QoS levels (such as in DoS attacks).
The paper is organized as follows. First we provide an short survey of existing knowledge transfer protocols which can be used in a TM context. Then we introduce map spheres to illustrate how TMIP is experienced from a developer's point of view. Using that as motivation, we single out interesting message exchanges within TMIP and discuss the involved message formats. The paper closes with first performance measurements and with a discussion of open issues.
One of the long-standing technologies is OKBC[OKBC], developed by KSL and SRI International in the late 90'ties as a programming language neutral protocol to exchange knowledge between knowledge bases. It was supposed to unify the access to existing knowledge bases.
The protocol is based on the commonalities between all these systems, such as the concept of a class or slots and individuals. With the version 2.0 it moved to a rather object-oriented, abstract API. It is architected an top of TCP and is connection-oriented whereby a client attaches itself to the knowledge base object and invokes operations on it. Consequently, it has a very rich interface with hundreds of methods and quite a few error conditions. The transport of these invocations are left open.
Early work on TM-specific interchange was undertaken by [XTMFrag]. It leaves open protocol details and only details how arbitrary parts of a topic map can be represented using XML notation.
This notation is used as transfer syntax in [TMRAP]. It specifies a protocol to actually access a TM store. While it reuses XTM as serialization for TM content, but introduces a number of additional XML formats, even for retrieval only. It - by itself - has only very limited ways to identify specific TM fragments and does not support updating. As it is biased towards portal applications, a lookup mechanism has been added experimentally to detect a portal carrying relevant information. One might argue that such meta information could be consistently hosted inside a special topic map itself. Although it claims to be RESTful, the specification lists a number of special functions. Their function names are then used as part of the request URL, clearly violating the REST paradigm.
A Java system tailored for Topic Maps is Shark[Schwo04]. It allows peers to share knowledge in TM form whereby the system controls which information is extracted from one knowledge base. This topic map fragment is serialized into XTM and KQML[KQML] is actually used for the interchange. The system is tailored for propagating knowledge in adhoc or P2P networks.
Interesting architectural insights provides [XPointerREST]. This work outlines how RESTful protocols can be developed to access semantic stores. It is not focused on Topic Map data alone and shows the limitations when using XPointer as an addressing scheme for server-side resources. The analysis includes also the use of RDF and Topic Map query languages.
One integral concept of TMIP is that of map spheres. They are abstractions of a TM store, so a collection of maps (or other objects such as ontologies or queries). Every map sphere can hold a number of objects, each of them then is addressed using a name from a hierarchical namespace. The name of the root in this space is denoted as / and all maps in the name space have a relative URL below that root, such as, for example, /web/ or /web/browsers/.
In the following code example (for simplicity in Perl) a map sphere is created:
my $ms = new TM::MapSphere (BaseURL => 'file:/var/maps/');
Applications can access topic maps through the API of the top-level map sphere, for example:
my $tm = $ms->tao ('/markup/xml/xpath/');To store complete maps, applications may then use something along the following lines:
$ms->tao ('/internet/web/', $tm);Apart from the bulk retrieval and store, the interface also offers a way to access components of maps. To have fine-grained control over this process, the client can use TMQL path expressions:
my $firefox = $ms->path ('/internet/web/firefox');
my @browsers = $ms->path ('/internet/web//browser');
my @dlbrowsers = $ms->path ('/internet/web//browser [ ./oc[ *download ] ]');Using path expressions, the interface also can forward updates to a map:
$ms->path ('/internet/web/firefox/rd[* popularity]', '10%');The protocol's tasks is to be able to host functionality like the above, by translating application invocations of the methods mentioned above into HTTP requests. It will use the predefined methods GET, PUT, POST, DELETE and OPTIONS appropriately, extending these as necessary with additional HTTP headers. The HTTP request will be resolved and - in case that is successful - will cause the result to be returned to the calling application. In case of errors, these have to be reported back as well.
To download a whole map addressed via /markup/xml/xpath/ from the server an HTTP request may look like this:
GET /markup/xml/xpath/ HTTP/1.1 Host: server1.farm.example.org Accepts: application/xtm+xml
Should the map not exist on the given location, the server will return a message with the status code 404 (Not Found). Otherwise it will respond with
HTTP/1.1 200 OK Server: TMIP server v0.3 Content-Type: application/xtm+xml <?xml version="1.0"?> <topicMap.....
Exchanging maps in XML form allows for platform-independent exchange but comes at a rather high performance cost. Using HTTP content negotiation clients can request also alternative formats. A Perl client, for instance, may request
GET /markup/xml/xpath/ HTTP/1.1 Host: server1.farm.example.org Accepts: application/x-storable, application/xtm+xml
To figure out which formats the server supports for a given map, we can use the HTTP method OPTIONS:
OPTIONS /markup/xml/xpath/ HTTP/1.1 Host: server1.farm.example.org
HTTP/1.1 200 OK Server: TMIP server v0.3 Allow: GET, PUT, OPTIONS Accepts: application/x-storable, text/x-ltm, application/xtm+xml
To store a complete map under a particular location, it is natural to use the HTTP method PUT (here we use LTM as map format):
PUT /internet/web/ HTTP/1.1 Host: server1.farm.example.org Content-Type: text/x-ltm [ larsbot ]....
HTTP/1.1 201 Created Server: TMIP server v0.3
Clearly, there could be many more reasons why such an import would fail. One of them is that the sent map information cannot be properly interpreted. In such a case the status code would be 415 (Unsupported Media Type). Such an error would also be returned if the client would send a map in a text format not understood by the server:
PUT /internet/web/ Content-Type: text/x-my-not-implemented-format
More complex is the handling of map fragments. To retrieve a particular topic, the URL contains a simple path expression:
GET /internet/web/firefox HTTP/1.1 Host: server1.farm.example.org Accepts: application/x-storable
If the platforms differ, though, the query result will have to be serialized into XML again. Here it is useful to understand that every TMQL path expression - when applied to a map - will return a sequence of tuples of items. To illustrate this, let us consider a query for all browser names:
GET /internet/web//browser/bn`s HTTP/1.1 Host: server1.farm.example.org Accepts: text/xml
HTTP/1.1 200 OK Server: TMIP server v0.3 Content-Type: application/x-tmql-sequence+xml <seq xmlns="http://astma.it.bond.edu.au/ns/tmip/1.0/ts/"> <t><s>Firefox</s></t> <t><s>Mozilla Firefox</s></t> <t><s>w3m</s></t> .... </seq>
The response contains in the message body a tuple sequence (<seq>). Each of the tuples (<t>) in turn contains only only one component, a string (<s>, carrying the textual value of the topic name. Note, that - since a topic can have any number of names attached to it (possibly having a different scope) - one would expect to see an entry of each of these names in a separate tuple within the sequence, as is in fact the case with Firefox above.
The XML notation is itself defined as part of the protocol. With it, simple content like strings, identifiers and topic map fragments (serialized as XTM) can be organized into individual tuples, and these in turn into sequences. Since we did not ask for any ordering, the sequence itself is unordered.
As queries can be also issued within a particular scope context, that can be forwarded to the server via an additional HTTP header:
GET /internet/web//browser/bn`s HTTP/1.1 Host: server1.farm.example.org X-TMIP-Accept-Scope: uc, en, de, * Accepts: text/xml
At the end of this list, the wildcard * signals that any other scope can be used if no topic characteristic exists in the scopes before. Ignoring how that scoping context is used on the server-side, an important consequence for the result is that only one name, that in the appropriate scope, is now selected from the list of name items.
In the above examples we have always requested string representations of topic characteristics; here we used a stringification postfix `s. If a client is not interested in the string values, but needs the whole items (including scope and type information), then it would omit the prefix:
GET /internet/web//browser/bn HTTP/1.1 ....
<seq ordered="no" xmlns="http://astma.it.bond.edu.au/ns/tmip/1.0/ts/"
xmlns:xtm="http://www.topicmaps.org/xtm/1.1/"
xmlns:xlink="http://www.w3.org/1999/xlink">
<t>
<i>
<xtm:association>
<xtm:instanceOf><xtm:topicRef xlink:href="#has-basename"/>
</xtm:instanceOf>
<xtm:member>
<xtm:roleSpec><topicRef xlink:href="#basename"/></xtm:roleSpec>
<xtm:topicRef xlink:href="#x-string-436326"/>
</xtm:member>
<xtm:member>
<xtm:roleSpec><topicRef xlink:href="#topic"/></xtm:roleSpec>
<xtm:topicRef xlink:href="#firefox"/>
</xtm:member>
</xtm:association>
</i>
</t>
....
</seq>The whole item is now embedded in the tuple as an association. Only structural information is sent, no additional unsolicited content is added, not even the string within the characteristics is sent. Otherwise the message has exactly the same structure as before.
This is also true if we ask for all involvements of a certain topic in particular associations:
GET /internet/web/firefox->software[*runs-on-platform] HTTP/1.1 ...
In general, path expressions can return tuples with more than one component. As an example, we consider the request:
GET /internet/web//browser < ./bn`s, ./oc[ *download ]`s > HTTP/1.1 Host: server1.farm.example.org X-TMIP-Accept-Scope: uc, en, de, * Accepts: text/xml
The response follows the general pattern of a tuple sequence:
<seq> <t><s>Firefox</s><s>http://www.mozilla.org/products/firefox/download.html</s></t> <t><s>lynx</s><s>http://lynx.isc.org/release/</s></t> .... </seq>
Sor far, a client can retrieve mostly structural information. At some stage it may need topic names to present information eventually to a human user. The protocol allows to bulk-load topic names for a list of topic identifiers in a specific scope context:
GET /internet/web/ HTTP/1.1
Host: server1.farm.example.org
X-TMIP-Accept-Scope: uc, en, de, *
Accepts: text/xml
Content-Type: application/x-tmql-sequence+xml
<seq>
<t><id>firefox</id></t>
<t><id>w3m</id></t>
</seq>
<seq>
<t><id>firefox</id><s>Firefox</s></t>
<t><id>w3m</id><s>w3m</s></t>
</seq>
As names are usually quite static, applications may choose to manage them separately from the structural information. This also minimizes the traffic if users switch the scope frequently. Experiences have shown that significant responsiveness can be achieved if this information is cached on the client side.
Updating server-side TM content could be done by first pinpointing involved topics (or associations) and then by applying the update on them. Instead we consistently choose to leverage our use of TMQL path expressions to precisely define particular components of a target map.
Consider again the download occurrence of the topic firefox we had mentioned before. Should this be updated to a new value, that can be sent within a tuple sequence to the server:
PUT /internet/web/firefox/oc[*download] HTTP/1.1 Host: server1.farm.example.org Content-Type: application/x-tmql-sequence+xml <seq> <t><s>http://www.spreadfirefox.com/</s></t> </seq>
On the server side the tuple sequence will be used to update the fragment addressed by the URL as follows: If the new value is a string (as in the example above), then the occurrence value will be set to this string. If the new value is a complete characteristics item, that will then replace any existing one. In a similar way, we can also make use of the method POST to add topic characteristic values to a map or DELETE some. If the sent data does not make any sense to the server, it will respond with an appropriate error code.
The same mechanism can be extended to extend a particular map with additional fragments. Using, for example a POST request
POST /internet/web/ HTTP/1.1 Host: server1.farm.example.org Content-Type: application/xtm+xml <xtm:topic id="safari" xmlns:xtm="..."> <!-- topic information here --> </xtm:topic>
The TMQL queries so far were all path expressions, so that they can be easily embedded into a request URL. General TMQL queries can also generate XML content to simplify the integration into XML application servers.
The simplest way to do that is by using predefined queries. As example we show a query HALO which - when confronted with a specific topic - computes from the map the surrounding of the topic (all its occurrences, names, involvement in associations, etc.). It was especially designed for generic TM user interfaces.
Queries like these are then used as filters. First, a relevant part of a topic map is identified and then the filter is applied to this result:
GET /internet/web/firefox * HALO (occ=>[1,3]) HTTP/1.1 Host: server1.farm.example.org Accepts: text/xml
The predefined queries themselves are not relevant to the protocol per-se. They just show how flexible the REST approach can be. What is relevant, though, is that all these functions will return XML content to the client. How its structure is, will depend on the function.
TMQL can be used to organize query results not only into tuple sequences, but also directly into XML. This is not possible using TMQL path expressions as we have done so far; for this we have to switch into TMQL FLWR (flower) mode:
GET /internet/web/ HTTP/1.1
Host: server1.farm.example.org
Accepts: text/xml
Content-Type: application/x-tmql
return
<browsers>{
for $browser in %_ // browser
return
<browser id="{ $browser }">{
for $occ in $browser / oc
return
<url href="{ $occ }">{ $occ * }</url>
}</browser>
}</browsers>Without going into any detail regarding the query language, the query expression above will iterate over all browsers (%_ identifies the queried map) while wrapping all subresults into a <browsers> root element. For each of the browsers we find all ocurrences and iterate over those. For each of the occurrences we issue a bit of XML, inserting the occurrence string (the URL) and the occurrences type (computed by $occ *).
While the client must be more skillful to add a full TMQL FLWR query, it can precisely control the XML structure which it will get back from the server.
Of course, servers may flag their unwillingness to perform certain queries if these violate a local policy.
One plausible extension to the above is to view the server itself as a (virtual) topic map[BaVirt04]. This allows to access operational data (such as statistics, current resource limit settings or access control) as topic map data and - given the necessary permissions - also to modify them.
For this purpose the protocol has reserved a special URI subspace, /.meta/ as it is used in the following where we access some statistical information:
GET /.meta/inbound-messages-per-sec HTTP/1.0
The ontology for the server is currently in flux, but the protocol specification will define a minimum every implementation has to provide.
The analysis of the protocol is split into an architectural part, preliminary performance results and finally into some thoughts about security.
To benchmark the protocol (and not the server implementation) we have to identify particular use cases and have to reflect on the costs involved. To store, for instance, a topic map instance into a TM server, the client has to prepare an HTTP request. Here it will embed the map into the message body whereby several format options exist. Then the message will have to cross the wire.
On the server side, the infrastructure will have to do a service dispatch after analyzing the incoming request. Depending on the format, the server will have to deserialize (parse) the map from the message body. Any acknowledgement message will be sent after that.
In Fig 1. we generated increasingly big maps in different formats (XTM, LTM, AsTMa and a binary format). The amount of content (assertions involved) is used as a common factor for comparability. The overhead processing time is then computed from the timing, factoring out message preparation (on the client and on the server side) and server method dispatch times as they are constant and effectively negligible. The overhead then only includes transfer time (on a local network) and the deserialization costs (on a typical contemporary hardware). While the absolute numbers may vary with the implementation it is obvious that the deserialization is the predominant cost factor.
As expected, XTM performs worst; the binary format was the fastest, the other two formats occupy the middle range. To make the costs for the transfer more visible, Fig 2. shows the absolute size (in KB) of the maps.
While LTM and AsTMa= seem to be quite economic, the binary format for maps quickly reaches unacceptable sizes, at least in current WAN environments. What the diagrams seem to prove is that more compact map encodings or notations pay off as long as their deserialization effort is bounded.
A second series of experiments analyzes the cost structure of GET requests using query expressions. While the advantage of TMIP is a very fine-grained control over the requested content, this comes at a significant price.
For the preparation of the request, transferring it to the server and dispatching it there the costs are negligible. The time-consuming steps are the parsing, analyzing and pre-optimization of the query expression, the execution of the query on the named map and the serialization of the results. Finally, on the client-side the XML-encoded tuple sequence has to be built up in memory for further processing. Fig 3. shows these times over an increasing result size.
As expected, querying the backend is by far the dominant factor. One might justifiably argue that the used TMQL processor is not using any optimization and is only orthodoxically implementing the current TMQL draft. Still, even with the improvements to be expected, querying times will remain the bottleneck for the time being and will be pose the throughput-limiting factor of a TM server.
The first aspect of security is authorization. Since we are using HTTP as transport we can use all the options associated with it, including adding TLS/SSL for certificate based authentication. TLS can also be used for encryption, if end-to-end privacy is an issue.
Access control can also be completely burdened onto the HTTP infrastructure. Request URIs can be used for a fine-grained control so that users can have access to complete maps or only to certain topics. It is difficult, though, to extend URI based authentication by including TMQL path expressions. This limits the granularity of access control.
Both aspects, authorization and access control we regard outside the TMIP protocol specification. This does not apply to the protocol's resilience to denial of service attacks. For this, servers are allowed to deploy a local resource limitation policy. Such a policy would include limits on the use of CPU resources, number of requests per second per client, etc.
It is arguable, that the REST approach chosen for TMIP captures a considerable range of application scenarios. HTTP has proven to be sufficiently rich to cover most interactions; when necessary we added additional HTTP headers and reused existing MIME types. Still, there was no need to introduce new status codes.
This minimality partly justifies why we regard a RESTful solution superior compared to a SOAP-based protocol stack. While debatable, SOAP seems to be more equipped to closely-coupled, DCOM-style interaction between two parties. The fact that the interface has to be declared separately allows a very fine-grained information flow into remote methods and back. This is certainly not possible using REST. Interfaces using the latter architectural style also struggle somewhat with composition, especially with transactions.
What seems to be quite reassuring is that the chosen setup conforms largely with the one outlined in the WWWA document[WWWA]. Maps are treated as resources using a reproducable URI space, but also individual topics can be interpreted as individual resources. In a general way, even adding a path expression to a request URI can be seen as addressing fragments of map in a declarative way.
Different representations (in the WWWA sense) can be requested by the client as it seems fit. Servers can respond to these choices along the rules of HTTP negotiation[RFC2616]. A convenient side-effect of using the HTTP method GET together with the TMQL path expression language is that all conveyed responses to these requests can be cached consistently in downstream cache chains. There are no special characters, additional cookie information (as would be necessary for state management) which would undermine caching efforts. All the chosen URIs are also context free.
In this work a RESTful protocol between a TM client and a TM server has been introduced. Accordingly, it uses onboard HTTP mechanisms, such as its methods and additional HTTP headers, to control the modalities of the data exchange between the parties. An URL regime was defined to address maps (and other objects) on the server-side.
To address fragments of topic maps efficiently, we used TMQL path expressions - as they will be available in the upcoming TM query language. This put us in a position to not only retrieve content, but also to update whole maps or fragments thereof. Content is transfer-encoded in XML if openness is an objective. For this purpose we have defined one single XML structure to convey tuple sequences between the parties. Apart from that we adopted XTM fragments to cargo topic map information.
There are a number of open issues, though:
[ArtUNIX03] The Art of Unix Programming, Eric Steven Raymond; http://www.faqs.org/docs/artu/
[AsTMa=] AsTMa= Language Definition, Robert Barta, TechReport, Bond University http://astma.it.bond.edu.au/astma=-spec-xtm.dbk
[BaVirt04] Barta R., Virtual and Federated Topic Maps, XML 2004 Amsterdam, Conference Proceedings
[BaZa03] A Use-Case for Topic Maps, R. Barta, A. Zangerl; Proceedings of the International Conference on Information and Knowledge Engineering. IKE'03, Las Vegas, USA, Volume 1\ . CSREA Press 2003, ISBN 1-932415-07-6, 379-383 http://zope.it.bond.edu.au/research/borgtom/publications/ike03
[BrainBank] BrainBank Learning - A Strategy for Learning and Construction of Personal Topic Maps, Stian Lavick, XML 2004, Conference Proceedings
[Fielding02] Fielding R. Architectural Styles and the Design of Network-based Software Architectures, PhD., Universityl of California, Irvine, 2002 http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm
[KQML] Specification of the KMQL Agent Communication Language, Finin T., Weber J., et.al., 1993 http://www.cs.umbc.edu/kqml/papers/kqmlspec.pdf
[LTM] LTM, The Linear Topic Map Notation, Lars Marius Garshol, Ontopia A/S, http://www.ontopia.net/download/ltm.html
[OKBC] Open Knowledge Base Connectivitity 2.0, Chaudhri V., Farquhar A., et.al., Technical Report, http://www-ksl-svc.stanford.edu:5915/doc/release/okbc/okbc-spec/okbc-2-0-3.pdf
[Pepper04] Towards Seamless Knowledge - Integrating Public Sector Portals, Steve Pepper, XML 2004, Conference Proceedings
[RFC2616] RFC 2616, R. Fielding, J. Gettys, H. Frystyk, et.al, http://www.faqs.org/rfcs/rfc2616.html
[Schwo04] Shark - a System for Management, Synchronization and Exchange of Knowledge in Mobile User Groups, Schwotzer T., Geihs K., TU Berlin, Technical Report, http://ivs.tu-berlin.de/~thsc/Shark_IKnow.pdf
[TMQL] Topic Maps Query Language, Working Draft, Lars Marius Garshol, Robert Barta, ISO/IEC JTC1/SC34, http://www.isotopicmaps.org/tmql/spec.html
[TMRAP] Topic Maps Remote Access Protocol v0.2, Graham Moore, 2004, http://www.jtc1sc34.org/repository/0507.htm
[WWWA] Architecture of the World Wide Web, Volume One, Ian Jacobs, Norman Walsh, http://www.w3.org/TR/webarch/
[XPointerREST] Scalable, document-centric addressing of semantic stores using the XPointer Framework and the REST architectural style; B. Thompson, G. Moore et.al. Extreme Markup Conference 2004http://www.cognitiveweb.org/publications/server-side-xpointer-extreme-markup-2004.html
[XTM 1.1] Topic Maps - XML Syntax, Lars Marius Garshol, Graham Moore, JTC1 / SC34http://www.isotopicmaps.org/sam/sam-xtm/
[XTMFrag] XTM Fragment Interchange, Lars Marius Garshol http://www.ontopia.net/topicmaps/materials/xtm-fragments.html