Abstract
The Dutch Tax and Customs Administration website is aimed at supporting citizens in fulfilling their tax and customs obligations. In 2002, it was decided to implement a enhanced search service. A commercial topic map engine was used to centralize management and control of keywords. A closed-loop process was implemented, connecting site visitors entering search terms with the synonyms used by the search engine and the metadata statements added by authors. While none of the individual components is groundbreaking, the combination is useful and has already been used in other problem domains. The fact that the topic map paradigm is well suited to an evolutionary style of development proved to be an important asset. In this respect, the standardized merging facility was crucial.
Table of Contents
The Dutch Tax and Customs Administration (DTCA) is responsible for collecting a wide array of taxes that affect more than 6 million Dutch citizens and 1.1 million companies. With more than 30.000 employees, it is one of the largest government organizations in the Netherlands.
In 2002 about 1 million visits to offices were made and 4.4 million phone calls were answered from call centers. The website www.belastingdienst.nl served about 9 million pages to 700.000 unique visitors (totals for March 2003). Not only is the website used for serving documents but also for distributing tax return software - about 4 million tax returns were done with the help of this software in 2002. In the near future, the website is positioned as a transaction center for taxes that are currently done by paper forms. Also, further growth of the web site is seen as an instrument to minimize relatively expensive contacts done by phone.
Although surveys showed that visitors were positive about the quality of the site, some areas for improvement were found. The most important problem was the lack of a search service to complement the browsing paths offered. The absence of search was not without reason. At any moment during the year, the site contains information about the current and past year. In addition, October marks the beginning of the preliminary refund campaign ("VT") which is about taxation in the next year. Differences between years can be tiny or major, but the cost of using information that is meant for another year is often high.
In the past, site visitors were shielded from making such mistakes by carefully crafted browsing paths. A search engine cuts right through these paths. To drive home this point, we used Google with the site:www.belastingdienst.nl parameter to demonstrate the effect of adding full-text search without further optimization. It showed that adding full-text search without doing any optimization would actually make things worse. This was unacceptable. Therefore, the decision was made to do a full-scale rollout of a search service. This paper is about the approach we took, and the key role XML and Topic Maps played.
Deploying soft- and hardware is obviously necessary to get a search service going. However, the crucial question is how to use these tools. We took some time to get comfortable developing a conceptual model. Next, we wondered how to avoid earlier implementation problems that appear tied to the problem space. This quest lead us to choose a topic map system as a central repository in the architectural plans. Therefore, creating a suitable topic map schema became part of the project.
Broadly speaking, steps for enhancing the operation of a search service fall into two phases. The first phase relates to the search term which is entered by the user and the possible optimizations that the search engine performs. This optimization is about creating the conditions for a larger set of hits that would otherwise have been obtained. For instance, a search engine might add a plural form to a search term that was entered by a user.
The second phase is aimed at ordering the resultset by relevancy. As users are generally not looking beyond 20 hits, it is important to get those documents that are relevant up there with the first two pages. In some cases, additional information must be added to documents (metadata statements) to augment the internal relevancy algorithms of the search engine.
Ideally, metadata statements should be fine-grained and diverse to accommodate all kinds of questions. In practice, monetary constraints limit the level of detail that can be reached. The aim is to balance the effort that is made by the site owner and the benefits for site visitors. A layered model is an instrument to determine what priorities need to be set and what investments yield the largest effect.
The first level consists of basic full-text search provided by a search engine. At this level, most search engines allow for grammatical tricks like stemming. After relatively easy setup and installation steps, the search engine functions more or less automatically and only needs systems management attention.
The second level deals with enriching the querystring entered by a user. Candidates are synonyms and possibly hierarchical relations. Synonyms play a key role in bridging the gap between site vocabulary and the documents that they are looking for. An example is the difference between the popular Dutch term "leaseauto" (company car) and the legal term "auto van de zaak". Synonyms are harvested by specialists and loaded into the search engine. Once loaded, the process is done fully automatic.
The third level is about enriching (groups of) documents with metadata statements. This is the most labor-intensive strategy. To ensure consistency, authors should pick values from a controlled list. Whenever possible, metadata statements should be assigned to groups of documents, as this can be done with little effort. Tagging individual documents is a strategy of the last resort.
There is an art to describing documents with metadata in such a way that visitors are able to find what they want. Librarians have been performing this trick for ages, and it is therefore only fitting that website architects turn to approaches from the library sciences. Facetted classification is an example of a library science technique that was rediscovered in the context of web sites. Peter Morville [Morville, P. 2001] uses the example of an e-commerce wine seller that describes each bottle by several "facets" (year, price, region, color). Each facet contains values from an enumerated list (color: red, white, rose) or a hierarchy (region: Europe, France, Bordeaux).
The main advantage of facetted classification is the avoidance of deep hierarchies that sometimes characterize classical thesauri. This reduction of complexity does not lead to fewer possibilities for the end-user. By combining facets, powerful queries can be constructed: "give me all red wine from France that costs less than 5 Euro". This example highlights the way hierarchical information is used. The value "France" is not assigned to any bottle, but resolved by a query engine because of the hierarchical relationship to all French regions that are connected to bottles.
Morville's wine example works well for explaining what facetted classification looks like, but turns out to be a bit deceptive. Key to its simplicity is the fact that the units (the bottles) are homogeneous, all facets are relevant to all units, and the facets are clearly separated. In our case, the documents that are part of the website are less homogeneous, which implies that not all facets will be applicable to all documents. Also, a clear separation of facets was hard to find, which ultimately meant we had to trade in analytical clarity for practical applicability.
As it is expensive to change facets after the documents are processed, it is necessary to thoroughly analyze the information domain and the expected behavior of the site visitors to determine what facets are required. By turning this process into a collaborative effort in which all stakeholders participate, conflicts down the road can be avoided.
In our case the following facets were determined:
| Facet | Description | Example |
|---|---|---|
| Facet | Description | Example |
| Year | The year to which the information in the document applies | 2004 |
| Process | The steps in the fiscal process that site visitors are doing | filing returns, objecting to a decision |
| Law | The kind of tax law | Income tax, road tax |
| Event | Events in the life of Dutch citizens that have fiscal implications | Getting married, starting a company |
| Type | The kind of document | Navigation page, information page |
Table 1. Website facets
The law facet contains a hierarchy of law-related terms. The other facets are enumerated lists. Year, event and type were the easiest to discover. Process and law took longer, because we wrestled with occasional cross-pollution between the two (laws do sometimes describe processes). In the end, we felt that minimizing the process facet to a small number of well-known steps would benefit the search user, and we included process.
The aim is to use these metadata statements in conjunction with the basic search engine facilities. Each XHTML document is assigned zero or more metadata statements that look like <meta keyword="facet" value="value" />. The search engine is instructed to look for these types of description and apply weighing rules. Some rules are applied to groups of documents (for instance type facet). Other rules match values. For instance, if user search term is found within law facet values, then documents with descriptions which contain facet law and value equals search term will gain weight in the relevancy ranking.
Allocating facet metadata statements is aimed at enhancing search. Without an interface that is intuitive and easy-to-use, these "behind the screen" investments would go to waste. It falls outside the scope of this paper to investigate all design decision, but some steps that refer to the use of facets are relevant here.
The initial presentation of search results is flat. Each hit contains at least a title and an URL. When the (complete) set of results contain hits that refer to documents with facet descriptions, opportunities for enhancement arise. The obvious one is to combine the title and URL with the facet descriptions. The addition gives users a better insight into the context of the document that is referred to, and possible avoidance of unnecessary detours to documents that do not matter.
The more advanced features that can be built upon facet information are advanced search and refinement. Advanced search refers to the process of specifying a query - before the actual resultset is created. By specifying facet - value criteria, the size of the resultset might be reduced. Refinement takes place after the resultset is constructed. It allows the user to drill down within a resultset.
Ultimately, one can only make educated guesses about the kind of queries that the site visitors will make. Therefore, it is crucial to try to learn as much about site visitors needs as possible. One way is to invite the site visitor to answer a questionnaire. Another way is to monitor actual behavior.
In the context of a search service, one way to monitor behavior is to log the keywords as they are entered in the search box. By replaying popular terms, keywords that do not lead to acceptable results can be found, and steps can be taken to improve the results.
Obviously, this manual process of testing keywords is expensive and only useful for the most popular keywords. Therefore, one should start with the small set - say a top-50 - and a weekly or monthly period. After a few iterations, the amount of work decreases and experiments with a larger set or a shorter period might be possible.
Closing the loop is important because it enhances the - necessarily indirect - communication between site producer and consumer. A continuous process needs to be designed: determine whether a popular keyword yields acceptable results, add new keywords to a centrally managed collection as synonym or facet values, republish the modified synonym list to the search engine and the author controlled vocabulary to the author tools, and test whether the situation improves. As this process is labor-intensive (specialist time), manual steps like calculating a top-50, checking what keywords are new, exporting the new synonym list and vocabulary should be automated as much as possible.
Initiatives to enhance search (and navigation) seem particularly susceptible to derailing. Quite often, a "big-bang" approach is taken with a project group spending months or years in stealth mode. In the meantime, the outside world changes. New, sometimes competing, initiatives start. Interim solutions work better than planned. New requirements are incompatible with the chosen approach. In the end, time and money are wasted.
Instead of waiting for a perfect solution, an evolutionary approach tries to deploy as soon as possible and build from there. This way, experience is gained and the overall complexity can be reduced. Of course, site visitors should not be turned into guinea pigs; the first version should bring some real benefits. An evolutionary approach is especially important for metadata assignment. A medium-sized website contains thousands of documents. In a big-bang approach all documents require full metadata descriptions. As specialist capacity is often scarce, this whole process could easily take months. In the meantime, new categories and values are invented (or required by outside changes). By the time all documents are described, the process starts all over again.
Earlier initiatives to upgrade our website were harmed by a big-bang approach. To prevent this from happening again, we developed a number of strategies:
design for extensibility
refinement versus advanced search
group assignment of metadata
An evolutionary approach places a premium on extensible systems. The topic map paradigm is well suited to driving extensible systems. By implementing the topic map standard, systems offer a number of mechanisms that help handling diversity and growth in an evolutionary way:
merging
concept versus label
type definition
By using different topic maps, different communities are able to model their own world. The standardized procedure for merging topic maps makes it a bit easier to live with this diversity. Splitting up information domains lowers complexity and buys time, and makes a project therefore easier to manage with a higher probability of success.
The separation of concept and label is central to the topic map standard. It turns out that many differences between terms are in fact only different labels with an identical underlying concept. This feature comes in handy in a setting where a struggle exists between writers ("does the reader understand this") and legal specialists ("is this a correct interpretation of the law").
In the topic map world, types are topics. This does not look like much, but is key in a corporate world where a single modification in a production system could take weeks (and cost lots of money). It means that generic applications can be built which are driven by the content itself. You need a new entity? Just define a new topic. Of course, this capability is hindered by the absence of an official schema language, but very useful even without one.
The fact that our site contains older information presented a challenge for migration. Retrofitting all content with new metadata statements would take quite a while. In the meantime, advanced search would not work as expected. Unless the majority of the documents on the site are retrofitted, site visitors are fooled into thinking that constructing an advanced query would matter. They would be mistaken and probably disappointed.
One way out of this dilemma is to offer options only when they have effect. In other words, offer opportunities for filtering the resulting set based on the characteristics of the resulting set itself. When no opportunities are present in the resulting set, no filtering opportunities are offered. This refinement strategy obviously takes place in the second phase. When a sufficient number of documents is processed, advanced search can be offered.
Realizing refinement is not easy, as it complicates life for the search engine quite a bit. Instead of just serving 10 or 20 hits, unique facet values for the whole resultset need to be determined to populate the filtering options. Once a user refines the resultset, the next and previous buttons need to take the chosen option into account and skip entries in the resultset.
The alternative is to distribute processing a bit and use JavaScript on the client. In this case the search engine delivers in a single XHTML page with all hits included. Attached JavaScript code sets CSS (display) properties, determines filtering options and controls previous / next buttons. The initial loading of the page is longer (19Kb with 100 hits), but once loaded filtering is in-memory. The disadvantage of using JavaScript on the client can be mitigated with a trick that was popularized by Jon Udell. Each hit contains semantic descriptions:
<li class="hit"><div class="title">This is the title</div> <div class="url"><a href="document.html">Go to document</a></div><span class="year">2003</span></li>
A small amount of JavaScript is required to process this information. Browsers with JavaScript turned off will show all hits without pagination and filtering options. Apart from the longer loading times and scrolling, the page degrades relatively graceful.
This kind of solution obviously does not work for a global search engine - too many hits would create huge pages - but can be made to work in a medium-sized website where the number of hits does not generally exceed 200.
As our website contains a number of relatively distinct sections, writing content is often done in projects. A project follows a more or less distinct process that results in a set of XHTML documents (a subsite). With regards to metadata assignment, there is project-level metadata that is applicable to all resulting XHTML documents, group-level metadata that is applicable to a smaller set of XHTML documents, and document-level metadata that only applies to a single XHTML document within a project.
To minimize manual metadata assignment, a kind of 'trickling-down' mechanism would allow authors to assign metadata statements at the project or group level. These statements should apply to all documents within the project or group. Authors should also be able to prevent trickling-down for those statements that only apply to the containing level. For new projects, one would imagine that this kind of group level metadata assignment is built in the web content management systems. Unfortunately, this proved to be not the case. Because we also wanted to use this technique to enhance existing documents, we had to write a custom solution to append these descriptions to groups of documents. This application uses a site-wide XML document that relates groups of documents to metadata values that are applicable to all documents within the group. With little effort, a large number of documents can be prepared automatically.
Figure 1 shows the overall system architecture. The three main areas are:
The Verity K2 search engine and Apache webserver
The Ontopia Knowledge Suite 2.0 (OKS) that acts as a repository for terms and concepts. The editor is a collection of screens for adding, modifying and deleting topics and associations.
Authoring tools Word and Tridion that are used to add facet descriptions to documents
A number of connections are required to connect the three areas:
OKS to Verity: synonyms and hierarchical relations for use by Verity
Verity to OKS: top 50 search terms are transformed into an XTM file by creating a topic for each term and merged with the main OKS file.
OKS to authoring tools: generation of controlled vocabulary for use by authors in content creation process
The overall system architecture could be described as loosely coupled. Connections between systems are XML documents over HTTP. We were unable to realize a 100% XML solution with XSLT (Xalan) handling the transformation jobs. The main problem was code legibility. At the moment the "XSLT sweet-spot" is left, it proved easier to use a combination of XSLT and Java procedural programming. We do expect that the new features in XSLT 2.0 are going to get us a step closer to a 100% XML solution.
The editor was built with the Ontopia Framework. In our case, an experienced Java programmer with no prior knowledge of the topic map standard built the screens. Getting to know the supplied tag libraries and their expected behavior took most time.
As noted, the techniques for enhancing search are closely related to library science solutions. NISO Z39.19 [NISO 1986] describes guidelines for building a thesaurus. Although we did not look for the rigidity and quality that characterizes professionally developed thesauri, its core entities, relations and properties were usable. However, instead of buying tailor-made thesaurus software we chose a topic map approach. This left us with the step of creating a topic map schema. In general, this is easy. The main problem that we ran into is that the topic map standard is not very specific in prescribing usage patterns. In some cases, many roads seem to lead to Rome.
The following types were defined:
Begrip (term)
Facet (used as "root" of term hierarchy)
Deel-geheel (Part-whole, used for BT/NT relationships)
Soortgelijkheid (Similarity, used for associative relationships)
Equivalentie (Equivalence, used for (near-)synonymy)
Bredere term (broader term)
Smallere term (narrower term)
Voorkeursterm (preferred term)
Omschrijving (description)
Afkorting (short name)
For simplicity, we refrained from subclassing association types [Rath, H. and S. Pepper 2000] or subclassing occurrence types (for instance scope note as a kind of description). The use of the scope construct for a short name is ugly. The XTM 1.1 option of assigning a class to a baseName looks promising, but was not supported by our system at the time.
Synonymy is an example of a thesaurus construct that can be realized in two ways. As two synonyms represents the same concept, one could argue that a topic should represent the concept, and that its baseNames should be used to cover the synonyms. To distinguish between preferred and alternate terms, baseName scoping is available. The other option is to create topics for each term and work with associations to document the synonymy relationship between term topics. The second approach probably scales better, but the first cannot be disabled. From a topic map perspective, both approaches can coexist. In our system, synonyms need to be extracted for use in Verity. Thus, the programmer of the transformation script needs to be aware of both ways or information is lost.
Metadata statements on the topic map constructs is another example of a problem that can be solved in multiple ways. Specialized thesaurus tools offer extensive metadata facilities, for instance to document why a term description is modified, or when the modification was done and by whom. In a topic map, the use of reification is probably the most elegant solution. Topics are created that describe the "history" of the primary topics and associations in a topic map. The actual metadata is written into typed occurrences. Although elegant, this solution runs the risk of massive growth of the topic map. Also, it is not easy to implement. One could also forget reification and add the occurrences directly to the topics themselves. This raises the question of how to document metadata on associations. The third way is to forget the topic map altogether and rely on a separate logfile. Obviously, this is the least attractive solution as information that should be part of the topic map is separated.
The topic map system was part of an overall search service architecture. We focused on delivering benefits for the problem at hand - managing keywords. The characteristics of this problem allowed us to keep things simple at first. However, we expect to make two kinds of improvements: add features to the current system (scale up), and use the system for new information domain (scale out).
As noted, the more advanced features of the topic map standard or the Ontopia Knowledge Server were left unused. We are currently thinking about:
identity / published subjects
optimize closed loop
redo metadata
The concepts and terms currently stored in the topic map do not carry subjectIdentity descriptions. It makes sense to develop a naming scheme for concepts, because the website content is also used for intranet and paper purposes. In fact, the content is written in another system first and transferred later to the website. The system of metadata assignment discussed in this paper is web-specific. This is a waste, and we would like to generalize the process of adding metadata statements. An important step in reaching this goal is developing a naming scheme for concepts.
Realizing a closed loop is an important part of the overall search service architecture. Our initial implementation does the job quite nicely, but some areas for improvement remain (mainly scheduling). We are also thinking about extracting more information from Verity and adding that to the topic map.
As discussed, metadata for the topic map - who added this association? - is possible in a number of ways. As metadata for topic map constructs is something that will remain important, we want to revisit the reification approach.
Once an enterprise-wide topic map infrastructure is available, it makes sense to look for new applications. Prime candidates are the small database-like applications in Microsoft Excel or Access. These applications were usually built by end users because it was quicker and/or cheaper than the central IT department. However, a number of problems exist. In many cases, documentation and coding standards are poor. No backups are made, and central IT policies - for instance an Office upgrade - can wreak havoc. Despite these disadvantages, it is often difficult to migrate these programs to corporate IT solutions. Cost is an issue, but also the lack of flexibility that might hinder future modifications.
The built-in flexibility of the topic map systems (Section 2.2.1) enables a solution that offers end-users flexibility, and at the same time enables central IT to implement management policies. Of course, this solution is not for all situations. Multi-user databases might have concurrency issues. Larger databases might run into scalability issues. Also, some kind of directory would need to be installed once the number of applications grows beyond a certain limit
Our site contains information about offices of the Dutch Tax and Customs Administration (phone, visit, etc.) in relation to Dutch municipalities. Every once in a while mutations are delivered by another organization. The mutations are processed, some post-processing (exceptions) takes place, and the result is published to an application on the website. Currently, the information is loaded into an Access database. By converting the delivered information into a topic map, and managing the exceptions in a second topic map, Access is no longer required. In fact, the same tools that are used for managing keywords are reused.
Offering an enhanced search system to site visitors is a thorny issue. We looked towards a more evolutionary approach to mitigate some of the risks involved. Using a topic map system as the key component for managing terms and concepts proved to be a good fit. Key issues in our approach were a focus on the system-wide flow of information and extensibility.
With regards to the topic map standard, the built-in merging capabilities were important. It helped us realize the Verity to OKS connection, and appears to be useful in other contexts as well (address information example). Broadly speaking, merging helps us to glue together content processes in a large, distributed environment. This might be a much more important contribution than the relatively small keyword management system that we implemented at first.
[NISO 1986] Guidelines for the Construction, Format, and Management of Monolingual Thesauri (http://www.niso.org/standards/resources/z39-19a.pdf)
[Morville, P. 2001] The Speed of Information Architecture (http://semanticstudios.com/publications/semantics/000003.php)
[Rath, H. and S. Pepper 2000] Topic Maps: Introduction and Allegro (http://www.ontopia.net/topicmaps/material/allegro.pdf)
![]() ![]() |
Design & Development by deepX Ltd. |