XML Europe 2002 logo

Re-using technical documents beyond their original context

An use-case for Published Subjects

Abstract

Technical documents are generally classified, organized and indexed through specific and accurate structures, making them very efficient in their context of creation and use. But this same specificity make them difficult to export and re-use outside this very original context. To reorganize them for different users generally means overhead and costly process of new indexing, classification, and correlative reorganization of links and references.

We'll show how to significantly reduce this cost by managing separately document units, represented as topics, content structure represented as associations of those topics, and a Published Subjects repository, used to index the topics.

The integrated resulting network can be made available to any actor of the enterprise, industry or community likely to be interested in reusing content. Standard XML authoring tool and local vocabulary can be used to tag and index documents, as a background task during creation.

This use-case will be presented in the general framework of OASIS Topic Maps Published Subjects Technical Committee, gathering actors of the industry to deliver recommendations and best practices for the use of Published Subjects.

Keywords


Table of Contents

1. Documents structure, links, and indexing
1.1. XML improves documents structure ...
1.2. ... but XML does not solve the link issues ...
1.3. ... and Indexing is yet another story
2. A topic map view of documentation
2.1. Document units as topics
2.2. Document structures as associations
2.3. Links and references as associations
3. Published Subjects, Index and Metadata
3.1. The why, what and how of Published Subjects
3.2. Metadata using Published Subjects
3.3. Index-topics representing Published Subjects
3.4. From metadata to "index-associations"
4. Reusing the documentation
4.1. Customized extracts and summaries
4.2. Customized navigation paths
4.3. A full-size ongoing use case
Biography

1. Documents structure, links, and indexing

1.1. XML improves documents structure ...

Documents content and internal structure - summary, subdivision in chapters, sections, paragraphs - used to be managed together by authors through the same tool and interface, text or html editor. Perceiving, through those tools, a document as a set of units and subunits, organized in a certain structure that could be managed, modified and reorganized independently of their content itself, was an implicit but not quite obvious notion, either in user interface, or in authors' mind and methodology.

This notion of a document as made of well-identified units is spreading with XML generalization. XML elements, well defined and identified through their markup, are more likely to be considered as independent units and managed as such, than paragraphs in a text document. The very hierarchical structure of XML makes it easy to represent and manage a summary structure, linking units to subunits like elements to subelements. But this structure is embedded in the XML document itself, and can't really be managed from outside the document.

1.2. ... but XML does not solve the link issues ...

If subdivision is a natural feature of XML documents, cross-linking and references, orthogonal to subdivision structure, raise different issues. Linking structures are already present in classical paper documentation in the form of index, cross-references, bibliography, footnotes ... Various forms of hypertext linking had provided a general tool for dealing with and extending that kind of links. However, hyperlinks embedded in documents are facing the major issue of mobility of those very documents structures and addresses. The too-much famous "404 error" is not a privilege of Web browsing, and links embedded in XML elements do not bring any progress for that matter.

Whatever the format, reorganizing documentation with embedded links needs handcrafted, painful and error-bound search, check and modification of obsolete hyperlinks inside the documents themselves. Managing the links structure in a distinct layer would provide an effective and impressive productivity gain by getting rid of such a costly process.

1.3. ... and Indexing is yet another story

Documentation classification and indexing seem to belong to another universe than authoring, involving expert knowledge of archivists and librarians, exclusively entrusted with those tasks. Those experts have generally no material capacity, and sometimes no will, to take into account authors' input for that matter. Using different tools, authors and librarians live in different worlds, with a flagrant lack of communication. This situation is happily evolving with keywords and metadata generalization, allowing indexing and classification process to begin at the stage of document creation. However, to be efficient, this process needs for authors to have access to a common index, making sense in their context. Unfortunately, use of local vocabularies (technical, commercial, financial...) does not help to full interoperability of indexing by keywords.

All those issues concerning document units, structures, links and indexing management can find an integrated solution based on topic map representation and Published Subjects repository. The following sections will provide an overview of the principles for such an organization. We'll then show an example of how content from a website is reorganized by reference to a Published Thesaurus.

2. A topic map view of documentation

The examples are all referring to the same topic map. The complete XTM file is available on-line at http://www.universimmedia.com/sitemap.xml

2.1. Document units as topics

The first stage in topic map representation of content structure is to represent each document unit by a topic. By document unit, we mean here every document, part of document or set of documents of which content makes sense by itself, even if extracted from its original context.

An important point to highlight there is that those "doc-unit-topics" do not represent the subjects documented, but the document units - addressable resources - themselves. XTM syntax has a specific way to deal with that situation, which is to declare the identity through a resourceRef element. The following example is at http://www.universimmedia.com/sitemap.xml#p-chromo

<topic id="p-chromo">
<instanceOf>
<topicRef xlink:href="#doc" xlink:type="simple" /> 
</instanceOf>
<subjectIdentity>
<resourceRef
 xlink:href="http://www.universimmedia.com/soleil/lexique/chromosphere.htm"
 xlink:type="simple" /> 
</subjectIdentity>
<baseName>
<baseNameString>La Chromosphère</baseNameString> 
</baseName>
</topic>

The external address of the document unit represented by the topic is completely independent of the topic id "p-chromo", which will be used to reference the topic in associations. Therefore if this external address changes, first it will not break the links defined by associations, and second it will need only to change once this address in the subjectIdentity-resourceRef element to update the whole structure.

2.2. Document structures as associations

Once every document unit is represented by a topic, the unit-subunit relationships are represented by similar binary associations between matching topics. To show that those associations are linked to the original document structure, and to make them distinct from any furher recompositions, the original documentation identification is used as a scope. Such a process can be made fully automatic, either based on the parsing of existing documentation, or integrated in the authoring tools, the topics and associations being created as a background task, at the time of validation of the document structure.

The following example shows two web pages represented as subunits of the same directory. Note again that this association structure is completely independent of the actual addresses of the document units represented, but that it is validated in the scope of a domain namespace.

<association>
<instanceOf>
<topicRef xlink:href = "#subdiv"/>
</instanceOf>
<scope>
<resourceRef xlink:href = "http://www.universimmedia.com"/>
</scope>
<member>
<roleSpec>
<topicRef xlink:href = "#unit"/>
</roleSpec>
<topicRef xlink:href = "#d-hypergraph"/>
</member>
<member>
<roleSpec>
<topicRef xlink:href = "#subunit"/>
</roleSpec>
<topicRef xlink:href = "#p-hgtm1"/>
<topicRef xlink:href = "#p-hgtm2"/>
</member>
</association>

2.3. Links and references as associations

A similar mechanism will be used to represent links and references, through specific associations. A very interesting consequence of the representation of links as associations with "origin" and "target" roles is that the two topics are "aware" of the links - at the opposite of ordinary hyperlinks, where the target is not informed of the origins of the links that point at it. Therefore the navigation between topics through "linkref" associations can be made in both directions - a noticeable feature. And there again, those links are completely independent of the actual document addresses.

<association>
<instanceOf>
<topicRef xlink:href = "#reflink"/>
</instanceOf>
<scope>
<resourceRef xlink:href = "http://www.universimmedia.com"/>
</scope>
<member>
<roleSpec>
<topicRef xlink:href = "#origin"/>
</roleSpec>
<topicRef xlink:href = "#p-photo"/>
</member>
<member>
<roleSpec>
<topicRef xlink:href = "#target"/>
</roleSpec>
<topicRef xlink:href = "#p-soleil"/>
<topicRef xlink:href = "#p-chromo"/>
</member>
</association>

The complete similarity of structure between subdivision associations on one hand, links and references associations on the other hand, is a clear illustration of the amazing power of topic map representation. Moreover, with just a slight modification of association type and role specification, the "reflink" association will be transformed in a more specific "navpath" association, linking "previous" to "next" document unit. Not only are links stored and managed independently of document units, and likely to be browsed both ways, but they can be typed. Last but not least, the scope mechanism enables to assert a context of validity of anyone of them, allowing several different structures to coexist over the same documentation set.

3. Published Subjects, Index and Metadata

After this quick overview of topic maps representation benefits in organizing the documentation layers, we'll now focus on indexing issues. The indexation will use another topic map layer on top of the previous one, and enable yet better interoperability and reusability of the whole documentation structure.

3.1. The why, what and how of Published Subjects

The notion of Published Subject has appeared in the topic maps paradigm and applications context, but it turned out to be a crucial element in systems looking for semantic interoperability. To make it short, a Published Subject is a subject - in the sense used by topic maps: whatever somebody wants to speak about, otherwise said a subject of conversation - for which a PSI has been made available.

A PSI is a two-face object, corresponding to two expansions of the acronym. PS Identifier is a stable URI, that will be used in topic maps documents and by topic map applications as an identifier of the subject. PS Indicator is a resource describing the subject in a non-ambiguous human-understandable way, retrievable from the above URI.

To use efficiently subjects available in controlled vocabularies, classifications, domain or enterprise thesaurus, topic map applications need to have this legacy available in the form of PSIs. OASIS Topic Maps Published Subjects Technical Committee is delivering Requirements and Recommendations for content and structure of standard PSIs.http://www.oasis-open.org/committees/tm-pubsubj/docs/recommendations/psdoc.htm

We'll see in the following sections how PSIs can be used both in document units metadata, and to create a layer of index topics over the document units topic map. The metadata will then be interpreted in terms of associations linking the document-unit-topics and the index-topics.

3.2. Metadata using Published Subjects

Adding metadata to documents is a pre-indexing task that can be managed at authoring time. Published Subjects Identifiers (URIs) can be used to identify the metadata in a more accurate way than simple words, even chosen in controlled vocabularies. For example, the following web page http://www.universimmedia.com/soleil/lexique/photosphere.htm could be simply added a Dublin Core metadata

<dc:subject>
photosphère
</dc:subject>

But such a metadata is meaningful only in a well-defined context (solar astronomy and physics) and its interpretation outside this context may lead to attachment of the document to other non-relevant subjects. Using a PSI to define the subject will make metadata more accurate and will avoid any further misinterpretation. For example, a stable on-line astronomical Thesaurus can be used as a PSI reference set, and the above metadata will be replaced by the more explicit definition.

<dc:subject>
<a href="http://www.mso.anu.edu.au/library/thesaurus/french/PHOTOSPHERE.html"/>
</dc:subject>

Any other kind of metadata can use the same mechanism, provided the set of matching PSIs is established. For example an enterprise can set stable PSIs for its employees. In such a framework

<dc:creator>
Bernard Vatant
</dc:creator>

could be replaced by

<dc:creator>
<a href="http://www.mondeca.com/people/bvt.html"/>
</dc:creator>

3.3. Index-topics representing Published Subjects

The above mechanism creates relationships internal to documents, and therefore presents the same drawback than embedded links. To avoid that drawback, the same kind of solution is adopted. Topics are created to represent the Published Subjects. Different names can be attached to those index topics, corresponding to different languages or scopes of use. In an integrated environment, the user interface will present the published subjects to the authors with the name corresponding to their scope.

<topic id = "photosphere">
<instanceOf>
<topicRef xlink:href = "#subject"/>
</instanceOf>
<subjectIdentity>
<subjectIndicatorRef
  xlink:href = "http://www.mso.anu.edu.au/library/thesaurus/french/PHOTOSPHERE.html"/>
</subjectIdentity>
<baseName>
<scope>
<topicRef topicRef xlink:href = "#soleil-structure"/>
</scope>
<baseNameString>
photosphère solaire
</baseNameString>
</baseName>
<baseName>
<scope>
<topicRef topicRef xlink:href = "#soleil-observation"/>
</scope>
<baseNameString>
surface du Soleil
</baseNameString>
</baseName>
<baseName>
<scope>
<topicRef topicRef xlink:href = "#soleil-thermodynamique"/>
</scope>
<baseNameString>couche d'inversion
</baseNameString>
</baseName>
</topic>

The set of those "index-topics" is a topic map internal representation of the PSI set. They are created, managed, associated independently of document units, on one hand, and of PSI set itself on the other hand.

3.4. From metadata to "index-associations"

The relationships asserted in documents metadata can then be represented in the topic map structure. Documents could be directly attached as occurrences of the index-topics, but since the document-units have been "reified" - represented by topics - in the topic map, the metadata relationship is better-off being represented as a specific "index-association" association between "doc-topics" and "index-topics", like in the following example.

<association>
<instanceOf>
<topicRef xlink:href = "#indoc"/>
</instanceOf>
<member>
<roleSpec>
<topicRef xlink:href = "#unit"/>
</roleSpec>
<topicRef xlink:href = "#p-photo"/>
</member>
<member>
<roleSpec>
<topicRef xlink:href = "#subject"/>
</roleSpec>
<topicRef xlink:href = "#photosphere"/>
</member>
</association>

The roles of both document and index topic in the association can be typed in a more specific way than "unit" and "subject". For example index-topic role can be declared as "main subject", "secondary subject", "keyword", "technical subject", "administrative subject". Moreover, the index-associations may be asserted in specific scopes.

Clearly enough, index-associations can either be managed directly at topic map administration level, independently of metadata declared in the documents, or automatically extracted from parsing of those metadata. In an integrated environment, a mixed solution can be applied, pre-indexing being made at authoring time, and further validation by topic map administrator.

4. Reusing the documentation

In the above example, a topic map representation integrates the topics representing document units, the index-topics with their PSI references, the associations representing original document structures and links, and the index-associations. This structure can now be extended to create customized summaries and new paths of navigation in the documentation.

The website http://www.universimmedia.com is used here as the original documentation set. Topics are created to represent directories, pages, and images in the website. The original structures are subdivision of the site in directories and pages, and link structures. They are represented as explained above, the associations being asserted in the scope of the original documentation.

Document-unit topics are linked to subject-topics, themselves defined using the on-line Astronomical Thesaurus http://www.mso.anu.edu.au/library/thesaurus

4.1. Customized extracts and summaries

A new document unit, focused e.g. on "Atmosphère Solaire" can be created, starting from a main topic using as PSI the Thesaurus URLhttp://www.mso.anu.edu.au/library/thesaurus/french/ATMOSPHERESOLAIRE.html

A summary for this document unit is built, using index-topics following linked e.g. by the RT relations defined on the above page of the Thesaurus, or by any other relevant organization. Each document-unit topic that has been indexed by one of those index-topics will be part of the new document, being attached to it by "unit-subunit" associations. The sequence of subunits in the new summary will be managed by "previous-next" associations.

4.2. Customized navigation paths

Specific navigation paths can be created, for example to present the solar radial structure from inside out. The relevant index-topics are created or selected: core, radiative layer, convective layer, photosphere, chromosphere, corona, solar wind ... Those index-topics are then linked by "inside-outside" associations, with a specific scope "solar-radial-structure".

The same topics, along with additional ones like granulation, eruptions, protuberances, sunspots, could be used in a different navigation structure, presenting the Sun from coolest to hottest areas, in the scope "thermodynamics".

The same scopes can be attached to the relevant index-associations, to select along those navigation paths only the relevant document units. More refined scoping can include technical levels, the same navigation paths being documented for "beginner", "advanced" or "specialist" reading.

4.3. A full-size ongoing use case

A full-size use case, including an on-line collaborative environment, is currently developed using Mondeca technology, the GEneral Multilingual Environment Thesaurus, developed by the European Environment Agency, as Published Subjects database, and a consortium of organizations and companies, aiming at the creation of an interactive web site, and derived CD-ROM to be released during the Johannesburg Earth Summit in September 2002.

Various expert editors will add topics and index on-line resources, and an editorial team will reorganize this content in various ways and for different levels of reading, making it a tool both for popularization and search for specialized and advanced on-line technical information about environment issues, solutions and reglementation.

At the time this paper is written - March 1 - the project is at the stage of importation of the Thesaurus and environment configuration, and on-line references can't be given yet. They will be included in conference presentation.

Biography

Bernard Vatant is a former high school mathematics teacher, graduated in 1975 from ENSET (Cachan, France). His research interests have long ago been in knowledge representation and organization, singularly applied to science popularization (astronomy). He's been working since the end of Y2000 as a consultant for Mondeca, where he participates in the development of topic maps and vocabularies, and coordinates the Semantopic Map project. He has been a participating member in the XTM Authoring Group, and is founding member and current chair of the OASIS Topic Maps Published Subjects Technical Committee.