XML Europe 2002 logo

Five Ways for Managing Hierarchical Keyword Strings

Abstract

The abstract was not available at the time the proceedings were created. Please check an updated version of the paper abstracts at the conference proceedings web site.


Table of Contents

1. Introduction
2. Legal publishing
3. Classical Indexing
4. Central thesauri
5. Keyword strings
6. The process described, using XML
7. The resulting KWOC index
8. Requirements for the system
9. Five approaches
10. Storing everything in one RDBMS
11. Storing everything in one XML document
12. Storing the thesaurus in SQL, the strings in XML
13. Storing everything in an ISO Topic Map
14. Storing everything in XTM
15. Conclusions
Acknowledgements
Bibliography
Biography

1. Introduction

This paper describes two things. First it describes abstractly the way how we think we should index our text. That's nice enough, but the real stuff is in getting it to work. Five approaches are suggested, and as far as there are already results at the time of writing, these are shared with you. By doing that, we describe the second thing: namely, the lessons learned by mixing two kinds of datatypes (relational (SQL) and hierarchical (XML)), and how difficult it is to get out of the box software to support this mix in a user friendly way. That is the reason why this paper is scheduled in the Content Management track, in stead of the Knowledge Technology (Topic Map) track.

Kluwer is the leading Dutch legal publisher, a position acquired by buying several smaller publishing houses and selling the "illegal" activities. The first book was published in 1739, and now every working day three new books are added to our catalogue, many of which are part of one of our 300 serials. Loose leaf publishing is done since 1920, and we have almost 300 different titles, most of which are multi volume. The paper media mix is completed with journals, also about 300 titles. In 1972 we entered the digital world, starting in house with digital page composition, and we know how to loose money by publishing on line since 1979. Our first CD-ROM appeared in 1987, and we now have over 100 titles. We started converting to SGML[ISO8879] in 1989, which isn't finished yet, and we still make money. Yes, we are definitely old economy.

The reason why I emphasize the fact that we are old and big is not necessarily to give you the impression we are a mammoth, but on one hand to show you that we have enough indexable things to justify a kind of systematic approach in indexing them, and on the other to make you understand that since we have been growing for such a long time, there is some history before this systematic approach. As a matter of fact, the systematic approach wasn't designed until a few years ago, and even worse, it hasn't been implemented yet. In this paper I describe the system in abstracto, and I sketch five different approaches for implementing it, as we saw it when designing that abstract. Unfortunately enough, at the time of writing this paper (March 2002) only one approach has been built yet, and the results I must admit were discouraging. I hope that when I present this paper in May, I can share some more experiences with you, but so far it is already clear that it is not a natural thing to mix clearly relational stuff, like a multi hierarchical thesaurus, and ordered recursive constructions like keyword strings into one user friendly system. Even the obvious technique for manipulating thesauri and indexes, Topic Maps, is not adequate yet, lacking a Topic Map Query Language[ISO18048] and a Topic Map Constraint Language[ISO19756].

2. Legal publishing

The continental law system is based on statute law. The intention of the law is usually clarified by commentary, both official (written by the government during the law making process) and editorial (by legal experts expressing their opinions in text books and articles). The official interpretation of the law can be read in court verdicts; the most interesting cases are published as jurisprudence. Almost all cases and commentaries eventually refer to a section in a law, and many books and most loose leaf publications have the form of the official text of a law section, followed by more or less detailed comments, usually referring to the parliamentary history of the law, other sections and of course jurisprudence. If you are more familiar with SGML than with the real world (which is not unlikely if you spend your time reading this paper), you can think of the law as being the formal part of the DTD, the official documentation as presented to the parliament as the informal part of the DTD, the editorial comments as supplementary user instructions to the DTD and the jurisprudence as sample document instances, proving how the DTD should be applied. Just as many DTD documentation systems are constructed around the elements of the DTD, legal information is organized around law sections.

Laws are published once integrally by the government. Changes are published as modification laws, with instructions like "insert between section 11 and section 12 a new section 11a containing the text [....]" or "in section 11, first paragraph, change the word 'SGML' to 'SGML or XML'". Legal publishers follow these instructions, add metadata like the date and number of the modifying law, ask their authors to adjust their comments, and send a supplement with the changed pages to their loose leaf subscribers.

3. Classical Indexing

The classic approach for making a back of book index is that an indexer (a person, sometimes the same as the author, some times a specialist) reads the book from page 1 to the end, characterizing every section or even paragraph with one or more keywords, that are written down on a card, with a page or section number reference. The keywords usually come from his head, and if you are lucky some rules on consistency (like using nouns or verbs, singular or plural) are more or less applied. When he reaches the end, he goes through the (alphabetically sorted) pile of cards, adds some see also references, may be even checks his dictionary of synonyms and adds some see references that guide the reader from a deprecated keyword to the used one, and handles his pile of cards to the publisher to let it be keyed in. The process is rather time consuming, but we are used to that; there is also lots of manual work that could be automated. In the (mock up) example that follows, you see that the keyword "permission for medical treatment" is found as "permission" under "medical treatment" and "treatment", but as "medical treatment" under "permission", which suggests manual interference after copying.

The printed result will look like (CC 1:261 meaning section 261 of book 1 of the Civil Code):

  • medical treatment CC 1:261-264

    • court's decision: CC 1:264

    • definition: CC 1:261 (note 5)

    • performed by nurse without permission: CC 1:262

    • permission: CC 1:264

  • permission

    • medical treatment: CC 1:264

  • treatment

    • medical treatment: CC 1:264

    • permission: CC 1:264

Some element type declarations from the DTD for this index would be:

<!ELEMENT index (keyword1+)>
<!ELEMENT keyword1 (name, (see | (xref*, seealso*, keyword2*)))>
<!ELEMENT keyword2 (name, (see | (xref+, seealso*)))

The result of this process is not necesserally bad, that is to say, as long as you publish just on paper, nobody will really care that each book had his own indexer with his own preferences on rejectable synonyms, on which see and see also relations to show, on the level of detail in adding keywords, on almost anything that two specialists can dissent about. But what if you want to put two or more text books on one CD ROM or on the Internet? Our first approach was to put each book as it was, with its own index on the back of it, but as a reader I would like to have one index for the whole CD ROM, or even better, for the whole Internet (or at least for the portal I use), and that index should be consistent, and applied to all documents or web pages. But how easy is it to merge an index that says "economic offense see false invoice; false invoice: law a, section 11" with an index saying "economic offense: law b section 22; false invoice see economic offense"? Should these keywords be considered as synonyms, and one of the two indexes be declared as leading, merging them into "economic offense: law a, section 11, law b section 22; false invoice see economic offense"? Or are they related terms, and is the see reference just added because the first book had no sections on economic offenses in general, only on false invoices, and the second book vice versa, and should you merge them into "economic offense: law b section 22, see also false invoice; false invoice: law a, section 11, see also economic offense"?

4. Central thesauri

Although we are fully aware of the fact that there is an international standard that was designed just to solve the problem of merging indices[ISO13250], we did not rely on that technique. Topic Maps are good (or at least meant to be good) to merge existing indexes that you can't change, but in this case we thought it would be better to create a big central list of all the allowed keywords, make editorial preferences in case of synonyms, and add see and see also information. Soon enough we realized that we wanted to be more specific in the kind of relations, so that at least a hierarchical relation between two keywords could be established, in which case you end up with another ISO standard, called a thesaurus[ISO2788]. In a thesaurus you have terms, and any term can have one or (as in our case) more broader terms (BT). A term with no broader term is a top term or a mother term (MT). If A is a BT of B, then B is a narrower term (NT) of A. Any term can be related to one or more other terms (RT), and finally if the word C is a deprecated synonym of term A, we say there is a "used for" (UF) relation between A and C, and a USE relation between C and A. Homonyms can be disambiguized by a short qualifier: Minister (government); Minister (clerical).

Once you have a central thesaurus for a certain domain, like civil law, the indexer is instructed only to use terms from this thesaurus; if these don't meet his requirements, he can do a proposal for a new term, that will have to be approved by the central thesaurus polit buro. The pile of cards of used terms and associated occurrences is more or less the same, only the keywords are now harmonized over all books in that domain, and the see and see also reference within the index can now be generated automatically from the relations in the thesaurus. For every term actually used you can create a see reference from all deprecated synonyms. For every sub-term used you can add the BT, and if that BT has no occurrence itself, you can generate one or more see references to its actually used NTs. For every used term, you can generate a see also reference to all its actually used RTs. You might consider creating see references from unused terms to RTs that are used as well. But that is assembling and publishing, we first want to consider data creation and storage.

Since our thesaurus has multiple inheritance (international tax law has both tax law and international law as BTs), and because of the redundancy in RTs, a hierarchical XML DTD with declarations like

<!ELEMENT mt (name, qualifier*, rt*, nt*)>
<!ELEMENT nt (name, qualifier*, rt*, nt*)>

is not the way to model the thesaurus, unless you don't mind having the term international tax law as sub-element of both broader terms. A relational database system would have a table for terms with columns like name, status, qualifier; a table with deprecated terms and a pointer to the term to be used; a table for BT/NT relations; and a table for related terms. But since you (like me) are probably more familiar with XML, I'll show the database schema as an incomplete DTD (in which I skip the tables, just name the rows, which are empty elements):

<!ELEMENT thesaurus (term+, deprecated+, btnt+, rt+)>
<!ATTLIST term
          id ID #REQUIRED
          name CDATA #REQUIRED
          qualifier CDATA #IMPLIED
          status (proposed, approved) proposed>
<!ATTLIST deprecated  
          name CDATA #REQUIRED
          qualifier CDATA #IMPLIED
          use IDREF #REQUIRED>
<!ATTLIST btnt
          nt IDREF #REQUIRED
          bt IDREF #REQUIRED>
<!ATTLIST rt
          rt1 IDREF #REQUIRED
          rt2 IDREF #REQUIRED>

Using an indexing system based on a centralized thesaurus results in consistent indices, both consistent in terminology between indices (which makes them mergeable), and internally consistent in terms of generated internal see and see also references. The principle as sketched in this section is pretty obvious, and easily generated using a Relational Database System, I guess. I only guess, since we didn't create such a simple system. We decided we wanted more.

5. Keyword strings

When retrieving information, there are two things to worry about: recall and precision. You don't want to miss anything, and you don't want to be bothered with irrelevant stuff. Simple, single terms turned out to be either too specific or too generic. Adding more keywords to a retrievable object can improve the recall, but it gives the user no clue whether there is a relation between these keywords or not, not to speak of the kind of relation. Our metadata department therefore decided to introduce the concept of ordered keyword strings, like "Permission for medical treatment by judge".

These strings are then linked to information components (the retrievable objects, like sections of a law or paragraphs). When some components are assembled to a concrete product (the result of a query like "Make me a book with all the laws as they were on date x of law family y, but without law z, and add to every section metadata a and b from our database, a list of all linked jurisprudence selected for publishing in magazine c and the comments from textbook d, and if not available those from e"; individual components can be as small as textnotes or headers; no one said that migrating to SGML would simplify things), an index can be created by gathering all associated strings. Extracting the combining terms from the strings results in a KWOC index (Keyword out of Context): the index is sorted on terms, and every string will be shown under each of the terms it is constructed from (in our example "permission", "medical treatment" and "judge"). Extra information like BT, NT, RT and USE can then be added from the thesaurus, like described above, though these elements could be transformed into the see and see also references that are so familiar to the reader.

A keyword string consists of a main term ("Permission"), followed by zero or more keyword strings, each related to the main term by a combining word, usually a preposition like "by", "of" or "as". As you might have noted, the definition of keyword string is recursive: after a connector you can have a simple string of one term (the usual case) but also a real compound string. Since we are SGML people, our first approach was a recursive, hierarchical DTD fragment like:

<!ELEMENT keywrdstr (keywrd, connectedkeywrdstr*)>
<!ELEMENT keywrd (#PCDATA)>
<!ELEMENT connectedkeywrdstr (connector, keywrdstr)>
<!ELEMENT connector (#PCDATA)>

but this is redundant: the content of keywords, connectors and strings are stored multiple times, which makes a change (like when de term nurse has to be replaced in the whole thesaurus by paramedic) very hard. We would therefore prefer to model everything as references:

<!ELEMENT keywrdstr (connectedkeywrdstr*)>
<!ATTLIST keywrdstr
          id ID #REQUIRED
          termref IDREF #REQUIRED>
<!ELEMENT connectedkeywrdstr EMPTY>
<!ATTLIST connectedkeywrdstr
          connectorref IDREF #REQUIRED
          keywrdstrref IDREF #REQUIRED>
<!ELEMENT connector (#PCDATA)>
<!ATTLIST connector
          id ID #REQUIRED>

To avoid most ambiguity when rendering an index, the connectors of the highest order are usually marked visually: "Permission [for] medical treatment [by] judge" then means that both "medical treatment" and "judge" are directly connected to "Permission", which means that the judge gives permission, while "Permission [for] medical treatment by judge" would describe the unlikely situation that the judge has permission to give medical treatment. When increasing the nesting level, like in "Appeal [against] permission for medical treatment by judge" the difference disappears: the only thing we now know is that judge didn't necessarily appeal himself. No need to say (to this audience) that it only disappears in the presentation of the string; in the system, the exact nesting needs to be preserved. This paper is on how to represent and maintain these strings internally.

6. The process described, using XML

The structure of the thesaurus and the keyword strings is already described. What an indexer does, is essentially still the same as he always did: he goes through the document he has to index, and for each information object that has to be made retrieveable decides what keywords to add, only in stead of a bunch of keywords without explicit relation, he looks in the list of registered keyword strings wether a string already exists. If not (what is a probable case; keyword strings are intented to be so specific and precize, that they will describe only a few information objects), he starts constructing a new string, by picking the head keyword from the list of approved terms (and, if not found, making a suggestion for a new term), which will be connected by picking a connector (also from a expandable list) and another term, or another keyword string (the chance that this string already exists is bigger: you'll probably already have had a section on permissions for medical treatments before you get the section on appeal against it).

The result is an n-to-n relation that connects strings and objects, or in Topic Mapese topics and occurences. The natural way to model the creation of that list in XML would be

<!ELEMENT object-with-strings (xref, keywrdstrref+)>

while the natural way to model the output in XML would be

<!ELEMENT string-with-objects (keywrdstrref, xref+)>

which suggests that the storage of the relations actually should be a table that connects the two elements. In general, relational database theory teaches us that every n-to-n relation should be stored that way. Using an empty element again, the attribute declaration would be something like:

<!ATTLIST keywrdstr-obj-rel 
          keywrdstrref IDREF #REQUIRED
          objref IDREF #REQUIRED>

So far for the storage. We now may consider what user interface an indexer would like. His main task is to link a pointer to an information component with a pointer to a keyword string, and adding new strings, terms, relations between terms and links to information components. When giving him a set of components to index, that set will usually be ordered, like a law with numbered annotations to numbered sections, so you can offer the indexer a list of all sections and per section of all annotations (or more generally and practically: a list of all elements of certain types that have an ID attribute). That list might appear on one window, and usually he'll walk down the list, from item to item. If he just has to make an index of the annotations, and the sections of the law are already given keyword strings, it would be nice to see what strings are already used, so that list could look like

  • Civil Code

    • Keyword: Civil law

    • Book 1

      • Section 261

        • Keyword: (medical treatment)

        • Annotation 1

        • ...

        • Annotation 5

          • Keyword: (medical treatment {definition})

      • Section 262

        • Keyword: (medical treatment)

        • Keyword: ((permission) for ((medical treatment) by (nurse)))

        • Annotation 1

      • Section 263

        • Keyword: (medical treatment)

        • Annotation 1

      • Section 264

        • Keyword: (medical treatment)

        • Keyword: ((permission) for (medical treatment) by (judge))

        • Annotation 1

At least two other windows should be presented. Preferably one presenting a list (with fold/unfold buttons), where he can view all terms and per term all related terms, all strings that are build tusing that term, and per term and string all associated references to the information components:

  • civil law CC

    • NT employment law

    • NT medical law

    • UF private law

  • employment law CC 2

    • BT civil law

    • RT personnel

  • judge Law on magistrature 123-145; 189

    • BT magistrate

    • ((permission) [for] (medical treatment) [by] (judge)): CC 1:264

  • magistrate Law on magistrature

    • NT judge

    • NT prosecutor

    • RT personnel

  • medical law

    • RT medical treatment

  • medical treatment CC 1:261-264

    • BT treatment

    • RT medical law

    • (medical treatment {definition}): CC 1:261 (note 5)

    • ((permission) [for] (medical treatment) [by] (judge)): CC 1:264

    • ((permission) [for] ((medical treatment) [by] (nurse))): CC 1:262

  • nurse

    • BT personnel

    • ((permission) [for] ((medical treatment) [by] (nurse))): CC 1:262

  • permission

    • ((permission) [for] (medical treatment) [by] (judge)): CC 1:264

    • ((permission) [for] ((medical treatment) [by] (nurse))): CC 1:262

  • personnel

    • NT nurse

    • RT employment law

    • RT magistrate

  • private law USE civil law

  • prosecutor Law on magistrature 189; 244

    • BT magistrate

  • treatment

    • NT medical treatment

If however that is too complicated to build, it could be one list with all terms, deprecated terms and strings, one list with all terms related to the term selected in the first list, one list with all strings constructed using the term selected and finally one list with all associated occurences of the selected string or term. Here again the complex, hierarchical point of view and the elegant relational view compete. Probably the easiest to implement will be a screen like this:

Finally, the indexer needs a pop up window where he can construct a new string from existing terms and strings, and add new terms. By walking through the big navigation window described above, he can select the main term for his new string. The upper window moves to that term, shows all related terms, all strings already created with that term and all information objects it is related to. He can then either pick an existing string and bring it to the lower editing window, or create a new string. To show the recursivity in a user intuitive way, that editing window could present the compound string as a nested list. If for instance you would like to create a new string "appeal against permission for medical treatment by judge", you would first have to create a new proposed term "appeal", then go one line down to show that you wanted to connect another term or string to it, select the connector "aginst" and then could select from the upper window the string "permission for medical treatment by judge". The result could be shown as:

  • appeal

    • [against]

      permission ([for] medical treatment) ([by] judge)

There is no need to show the internal structure of the connected strings as a nested list, since that string is now treated as a fixed, given string that we only refer to, we don't want to modify that string now.

7. The resulting KWOC index

Once the thesaurus is provided wiith references to the information to be indexed, the index for a certain selection of information objects can be generetad by walking through the thesaurus, assembling all keywrdstr-obj-rel elements with an xref attribute to an information object that is in the selection, and then assembling all keyword strings that are referenced, then gathering all terms used in the assembled keyword strings, then recursively adding all BTs of gathered used terms that are not used itself but have at least two NTs that are used, selecting all RTs of all used terms, and finally selecting all the deprecated synonyms for all used terms.

The KWOC index that is the result of this process will be looking like:

  • judge

    • permission [for] medical treatment [by] judge: CC 1:264

  • medical treatment CC 1:261-264

    • medical treatment (definition): CC 1:261 (note 5)

    • permission [for] medical treatment [by] judge: CC 1:264

    • permission [for] medical treatment by nurse: CC 1:262

  • nurse

    • permission [for] medical treatment by nurse: CC 1:262

  • permission

    • permission [for] medical treatment [by] judge: CC 1:264

    • permission [for] medical treatment by nurse: CC 1:262

  • treatment see medical treatment

the DTD for which could be something like:

<!ELEMENT kwocindex (kwoc | unusedterm | deprterm)+>
<!ELEMENT kwoc (name, qualifier?, xref*, seealso*, keywrdstr*)>
<!ELEMENT unusedterm (name, qualifier?, see+)>
<!ELEMENT deprterm (name, qualifier?, see)>
<!ELEMENT keywrdstr (name, (connectedkeywrd | connectedkeywrdstr)+)>
<!ELEMENT connectedkeywrd (connector, name)>
<!ELEMENT connectedkeywrdstr (connector, keywrdstr)>

The output can be a stand alone document, where the names (with data content) of terms are used in stead of the references that are used while storing it without redundancy.

Please do not note the line with "medical treatment (definition)". The added word (definition) is a form description, but is not covered in the DTD or in the rest of the paper, and should have been deleted from the example, but I only noticed that when I had no time left to correct the paper.

For the keyword string, I had two options. For storage, I decided that a string would be at least one term, followed by zero or more conected strings: every single term would be a string of one. I did that, in order to have only one element type where I could link an information component to. For display, I decided to differ between single terms and compound terms, because it would facilitate creating a stylesheet where real (compound) strings would be shown with for instance brackets around them, like the unlikely "appeal [against] (permission [for] (medical treatment [by] nurse) [by] judge) [after] (medical treatment [by] nurse)". If we didn't make the difference between simple and compound strings, a simple style sheet would produce something like "((appeal) against ((permission) for ((medical treatment) by (nurse)) by (judge)) after ((medical treatment) by (nurse)))", which would appeal more to Lisp programmers than to lawyers.

8. Requirements for the system

To build and maintain indices in this way, you definitely need an automated system. The first and most important demand is that it is reliable. Terms should be write protected, there should be a work flow that ensures that no unapproved terms get into published indices, there should be consistency checks, et cetera et cetera. Almost as important is that it should be extremely user friendly, because eventually lots of people (also one time authors) should be allowed to create their indices this way, without lots of training. That means fast scrolling, quick building of the windows and pick lists, every command and navigation must preferably be doable by just typing keys, so there is no need to switch the hand from keyboard to the mouse. Simultanuous editing must be possible, so in the ideal case where you work with one central thesaurus, there must be locking on term level. Eventually, the system should be accessible over the internet, so that external authors can use it and work with the one and only thesaurus, that at the same time might be edited by a thesaurus author...

Well, some of these demands are contradictory, unrealistic, and fortunatley not really necessary. The editing of the thesaurus itself is a thing that should not be done too frequent (once it is stable), and even if you created a new string using the term "civil law" at 11:00, there is no guarantee that at 11:15 not a new version of the thesaurus was introduced where the preferred term would be changed to "private law". It is well enough to have a system where you can ask as a user or send as a publisher a new version of the thesaurus, the list of strings, the list of information objects and the list of associations, so that the user can work stand alone, but when checking in the new material from the user, you should have provisions for the case that the tables he used while editing and the tables that are in effect now are incompatible.

9. Five approaches

Five approaches were considered, and will shortly be described. Since reliability was our first demand, and the scale on which we should select, manipulate and edit was very small, and loaded with complicated relations, we didn't feel comfortable with our SGML systems, that were designed to manipulate large chunks of hierarchically organized text.

We thought that the database world would be mature enough to be able to quickly develop a system for maintaining some simple pick lists, as we underestimated our real demands. We especially underestimated the difficulties in editing and maintaining the complex hierarchy of the recursive strings. That aproach turned out to be a long way, with many misunderstandigs. As an escape, we are now considering four other approaches, but on the same time we still feel that the principles of the relational model are very sound, especially in terms of avoiding redundancy.

The first alternative is, to use an XML or SGML editor to edit and maintain everything in one big file. To link an information component to a string that contains a new term then means a lot of navigation within your document, or a very loosely organized document where it is hard to find anything. And how to make sure that the IDREF of an element of the type termref refers to an ID of an element of the type term?

So the second escape would be a hybrid system, where the relational stuff is done in a relational database and the hierarchical stuff in an XML editor.

And finally, Topic Maps were considered as well, in two flavours. In ISO Topic Maps you can write your own DTD, including the nested strings, so that is one approach, but we do also consider XTM, even though it was primarily designed as an exchange format.

10. Storing everything in one RDBMS

We have shown a DTD that describes a non redundant storage scheme in XML. The mapping to a relational database system is pretty trivial for the thesaurus and the table that connects the strings and the information components. The strings are less trivial, since they are both ordered and recursive, two things that are not natural to relational systems. However, by changing in the definition of connectedkeywrdstr the recursive call to keywrdstr* to one in keywrdstrref, we already solved that problem on the modelling level (the processing still might be less trivial), while the ordering can be solved by adding a sequence number. Expressed again in XML, to model a 1-to-n relation the declarations would have to be changed from:

<!ELEMENT keywrdstr (connectedkeywrdstr*)>
<!ATTLIST keywrdstr
          id ID #REQUIRED
          termref IDREF #REQUIRED>
<!ATTLIST connectedkeywrdstr
          connectorref IDREF #REQUIRED
          keywrdstrref IDREF #REQUIRED>

to two empty elements with atributes:

<!ATTLIST keywrdstr
          id ID #REQUIRED
          termref IDREF #REQUIRED>
<!ATTLIST connectedkeywrd
          headkeywrdstrref IDREF #REQUIRED
          sequencenr NMTOKEN #REQUIRED
          connectorref IDREF #REQUIRED
          connectedkeywrdstrref IDREF #REQUIRED>

where both headkeywrdstrref and connectedkeywrdstrref refer to the ID attribute of a keywrdstr element. Let me show you the tables needed to construct the new string "appeal [against] (permission [for] medical treatment [by] judge)":

Table 1.

Table Terms
Term-ID Name Qualifier Status
1 permission   approved
2 medical treatment   approved
3 judge   approved
4 appeal   proposed

Table 2.

Table Connectors
Connector-ID Connector Status
1 for approved
2 by approved
3 against proposed

Table 3.

Table Keyword-strings
Keyword-string-ID Term-IDREF
1 (permission [for] medical treatment [by] judge) 1 (permission)
2 (medical treatment) 2 (medical treatment)
3 (judge) 3 (judge)
4 (appeal [against] (permission [for] medical treatment [by] judge)) 4 (appeal)

Table 4.

Table Connected-keyword-strings
Head-Keyword-string-IDREF Sequence-number Connector-IDREF Connected-Keyword-string-IDREF
1 (permission [for] medical treatment [by] judge) 1 1 (for) 2 (medical treatment)
1 (permission [for] medical treatment [by] judge) 2 2 (by) 3 (judge)
4 (appeal [against] (permission [for] medical treatment [by] judge)) 1 3 (against) 1 (permission [for] medical treatment [by] judge)

To generate a readable keyword string n, you select from the table Keyword-strings the row where the Keyword-string-ID is n and remember the value of Term-IDREF, then retrieve from the table Terms the name of the term with that Term-ID. Next, you query the table Connected-keyword-strings for all strings with Head-Keyword-string-IDREF equals to n ordered by the column Sequence-number, and for each of these rows you retrieve the Connector-IDREF from the table Connectors and print the associated connector, and then you generate the readable keyword string m where m is the value of Connected-Keyword-striung-IDREF. As you see, the recursive data definition has been replaced by a recursive algorithm, and SQL-aware people have told me that such is a nasty thing to do. It doesn't fit in SQL, they tell me, so it means a stored procedure, and that means that a solution will not be standardized and tool independent.

I can't judge the validity of these comments, but certainly I noticed that it was very hard to explain a system where order and hierarchy was important to our first contractor. I guess we choose the wrong people, since it appeared they only could make a model using some visual user interface, in combination with natural language sentences like "a term has a name and an optional qualifier; the combination name and qualifier must be unique". A ready made set of tables as presented here could not be used as input to that system, and appearantly we were not smart enough to provide the right input to the user friendly system they used to create data models with. The developer we then hired had no problem in understanding the system (he had been trained as librarian and obviously understood hierarchies). Unfortunately, it prooved that the intuitive user friendly we had in our minds was very hard to realize. Adding a new string to an information component involved actions like "click on button [new relation]", browse for the string to add, if not found go to another screen [add new string], browse in terms window for existing term "appeal", if not found go to other window [new term], type in the new term (not using the characters aleready typed for searching it in the previous window), open again the new string window, go to term window, select the new term, go back to string window, click on [connect string], choose connector "against", if not found open "new connector window", add newly proposed connector, go again to connector window, select "against", go to string window, select "permission for medical treatment by judge", click [OK], answer lots of questiona "are you sure you want to add ....?" with yes...

We had the feeling that the user would have preferred to just key in "a", then seeing that there was no term yet starting with that letter, automatically getting the window add new term (where the "a" would already be there), key the rest of the proposed term, pressing Enter to accept, then automatically moving to the field [connector], typing "a" to find there was no connector with that name and getting in a editing/adding mode. In fact, only a few more keystroke should be needed than necessary for keying "appeal [against] p" (because at looking for p, the system would find the string "permission [for] (medical treatment [by] nurse)"). Let the programmer and the computer work, in stead of the indexer, was our opinion, but we haven't reached that fase yet.

11. Storing everything in one XML document

Knowing how friendly and easily customizable an editor as XMetaL (or Epic, though tougher to customize) is, our next thoughts went into that direction. Could we use the element type declarations we presented so far, augmented with some elements? For the keyword strings we could now keep using the nested element connectedkeywrdstr with the REP occurence indicator, but more changes should be made. Attributes are usually harder to edit than plain text, and the tree view also just shows PCDATA, so we change all CDATA attributes to data content again. And since the indexer's approach is to walk through the document to be indexed, adding references to the strings from the information component, we choose again for the hierarchical relation that reflects the workflow (element infocompdescr).

<!ELEMENT stringbase (thesaurus, stringset, infocompset)>

<!ELEMENT thesaurus (term+, deprecated*, btnt*, rt*)>
<!ELEMENT term (name, qualifier?)>
<!ATTLIST term
          id ID #REQUIRED
          status (proposed, approved) proposed>
<!ELEMENT deprecated (name, qualifier?)>
<!ATTLIST deprecated
          use IDREF #REQUIRED>
<!ELEMENT btnt EMPTY>
<!ATTLIST btnt
          nt IDREF #REQUIRED
          bt IDREF #REQUIRED>
<!ELEMENT rt EMPTY>
<!ATTLIST rt
          rt1 IDREF #REQUIRED
          rt2 IDREF #REQUIRED>

<!ELEMENT stringset (connector+, keywrdstr+)>
<!ELEMENT connector (#PCDATA)>
<!ATTLIST connector
          id ID #REQUIRED
          status (proposed, approved) proposed>

<!ELEMENT keywrdstr (connectedkeywrdstr*)>
<!ATTLIST keywrdstr
          id ID #REQUIRED
          termref IDREF #REQUIRED>
<!ELEMENT connectedkeywrdstr EMPTY>
<!ATTLIST connectedkeywrdstr
          connectorref IDREF #REQUIRED
          keywrdstrref IDREF #REQUIRED>

<!ELEMENT infocompset (infocompdescr+)>

<!ELEMENT infocompdescr (name, kwdstrref*, infocompdescr*)>
<!ATTLIST infocompdescr
          id ID #REQUIRED>
<!ELEMENT keywrdstrref EMPTY>
<!ATTLIST keywrdstrref
          keywrdstrref IDREF #REQUIRED>

<!ELEMENT name (#PCDATA)>
<!ELEMENT qualifier (#PCDATA)>

A part of the document to model the famous "appeal [against] (permission [for] medical treatment [by] judge)" would be

<thesaurus>
  <term id="t1" status="approved"><name>permission</name></term>
  <term id="t2" status="approved"><name>medical treatment</name></term>
  <term id="t3" status="approved"><name>judge</name></term>
  <term id="t4" status="proposed"><name>appeal</name></term>
</thesaurus>
<stringset>
  <connector id="c1" status="approved">for</connector>
  <connector id="c2" status="approved">by</connector>
  <connector id="c3" status="proposed">against</connector>
  <keywrdstr id="k1" termref="t1">
    <connectedkeywrdstr connectorref="c1" keywrdstrref="k2"/>
    <connectedkeywrdstr connectorref="c2" keywrdstrref="k3"/>
  </keywrdstr>
  <keywrdstr id="k2" termref="t2"/>
  <keywrdstr id="k3" termref="t3"/>
  <keywrdstr id="k4" termref="t4">
    <connectedkeywrdstr connectorref="c3" keywrdstrref="k1"/>
  </keywrdstr>
</stringset>

How easy is it to use this DTD and to build a large system with thousends of terms, strings and information components to link to each other? How much scrolling do you think to have to do to add a new keyword? How easily would you add a new term, not noticing the deprecated element where it already was described as a rejected synonm? OK, we change the declaration of the element type thesaurus to

<!ELEMENT thesaurus ((term | deprecated)+, btnt+, rt+)>

in order to sort preferred and rejected terms to one alpfabetical list, but does that really solve our problems? How do we make sure that the attribute connectorref points to the id value of a connector element, and not to that of a term element? XML doesn't allow us to impose that kind of restrictions; SGML (or better, HyTime) has the option for lexical constraints, so you could specify that the id attribute of connector and the idref attribute of connectorref both have the form "c[0-9]+", but which software supports lexical constraints? We need a lot of configuration, and as far as we understand XML schema these restrictions can not be specified in that language either, so we will have to program in the scripting language of the editor. We don't think that is the solution either.

12. Storing the thesaurus in SQL, the strings in XML

All right, next try. Let us use the best of both worlds, and store the typical relational things like the thesaurus and the infocompset in databases, and use XML for the strings. Existing strings can be stored in the database as a generated list, so we would do almost all the stuff in the RDBMS environment, only for adding a new string we would call an XML editor. One option is to hand the complete thesaurus and list of existing strings to the document as read only part of the document, but I guess it will be easier to just give the keywrdstr element as the root element, changing the AttType from IDREF to CDATA, and instructing the editor to use ODBC to find the values in the database. It is the simplest to describe, and therefore looks like the most promising solution, but at the time of writing this paper, it wasn't implemented yet.

13. Storing everything in an ISO Topic Map

The obvious soultion for any indexing problem is to use Topic Maps. And still, we can't show how we did it, because we didn't, yet. The main reason is the lack of tools that implement TMQL and TMCL, the languages to be designed for specifying queries in a Topic Map and defining the constraints in the Topic Map. In a Topic Map everything is a topic, but that is too general for practical usage. You need to specify several layers in a Topic Map, like what the thesaurus is and what kinds of relations between terms you allow. Then you have a layer of strings, where you explain how they are constructed, using a head term and connectors. Then you have a layer where you describe the potential occurences, and finally you link everything together. These kinds of restrictions should be defined using a constraint laguage, but that standard is still under construction.

Can topics contain topics? The ISO model is an architecture, so you can build a recursive DTD and map them to the topic element, but will the software understand and respect recursion? And how will it be shown to the indexer? Therefore you should need some formatting or transformation language, but a TMFL or TMTL is not even considered. And you should need a query language to fill a window with "all topics that are a broader term of this topic".

14. Storing everything in XTM

The last alternative we want to consider is using XTM. It is meant for the exchange of Topic Maps, and that is not our purpose here, but so was XML designed for exchange of documents, but we will nevertheless convert our large SGML set of documents to the XML syntax, just to be able to profit from the huge amount of XML software. These days, TM engines from empolis, Ontopia or Mondeca support both ISO Topic Maps and XTM, but will that continue?

The question on modelling ordered hierarchical strings however will not be eased when having to use one single topic element, allbeit with an instanceOf sub-element. Once again, the need is for TMCL and TMQL!

15. Conclusions

Rather disappointed we have to conclude that it is hard to mix the relational and the hierarchical world. It would help if you could specify in a schema language that the termref attribute should refer to a term element, and that a user is not allowed to delete an element if its attribute status has the value approved, and that he is not allowed to change that value. It would help if we could specify that all elements term and deprecated should be ordered alphabetically by their sub-elements name followed by the optional qualifier.

Or we will have to wait for tools and languages to specify queries and constraints in Topic Maps. That certainly is the most promising and suitable approach. We hope that by the time this paper is presented, we can go into more detail on this point.

The most important lesson learned however, was that when n-to-n relations are relevant, the relational model is superior to the hierarchical model when storing and mainitaining information in a non redundant way, while when presenting the information to a human user, we want to convert it to a (redundant) hierarchy. To allow editing in such a system, you face two conflicting principles, because on one hand you need to avoid reduncancy (thus: relational) and on the other the other the user likes an intuitive interface, which suggests hierarchy and thereby redundancy. An obviuos standardized solution has not been found yet.

Acknowledgements

This paper is based on a presentation that Bas de Wilde, the lawyer who designed the keyword string principle, and I gave in November 2001 on the Dutch Users Group's annual conference in Rotterdam. I more or less transformed the examples he had given into English equivalents using in some way the EuroVoc thesaurus[EuroVoc].

Bibliography

[EuroVoc] European Communities, EuroVoc Thesaurus, http://europa.eu.int/celex/eurovoc/

[ISO2788] ISO 2788, Guidelines for the establishment and development of monolingual thesauri, 1986

[ISO8879] Ch. Goldfarb (ed.), ISO/IEC 8879, Standard Generalized Markup Language, 1989

[ISO13250] M. Biezunski, M. Bryan, S.R. Newcomb (eds), ISO/IEC 13250, Topic Maps, 1999

[ISO18048] H.H. Rath, L.M. Garshol (eds.), ISO/IEC 18048, Topic Map Query Language, 2001, Working Draft

[ISO19756] S. Pepper (ed.), ISO/IEC 19756, Topic Map Constraint Language, 2001, Working Draft

[XTM] S. Pepper, G. Moore (eds.), XML Topic Maps (XTM) 1.0, 2001, http://www.topicmaps.org/xtm/1.0/xtm1-20010302-2.html

Biography

DTD manager

Trained to make mathematical models of the economy, Diederik Gerth van Wijk has devoted his professional life to modelling text. Since 1992 he is responsible for developing the DTDs used at Kluwer in Deventer, the Netherlands, the leading Dutch legal publisher. He is editor and frequent contributor of <!Element, the quarterly newspaper of the SGML XML Users Group Holland, and chairs the Dutch National Standardization Committee 381 034 that is the counterpart of JTC1 SC 34 (SGML, Topic Maps etc.).