Abstract
The growth of the Internet and the World Wide Web has brought about new models of interaction within newsgroups, discussion lists, web-pages, etc., which are defining new collective social practices. The e-communities emerging from these practices have generated new needs in terms of knowledge management and processing. This paper explores the problem of knowledge acquisition and modeling of e-community interactions using the topic map standard. The distributed and highly dynamic nature of the resources involved in these discussion forums pose a new range of problems in terms of Topic Map construction. One important issue is that such volatile and heterogeneous resources are made up of multiple diverging (and sometimes contradictory) points of view and thus cannot be described by a single semantic model. Our work here focuses on how to construct (and update) a navigable semantic structure from multiple, fragmentary viewpoints. We use inductive Natural Language Processing techniques to identify semantic classes on the basis of word associations patterns and to construct semantic dimensions which define the conceptual space of a viewpoint. These classes constitute potential topics in the resulting topic map. The associations among the topics are constrained to a certain conceptual space or scope. We show the advantages of modeling scopes as conceptual spaces based on a geometric structure. On the one hand, it provides a way to embed topics in an abstract n-dimensional description space where spatial relations between topics have a semantic interpretation. On the other hand, the topological structure of the space provides geometric constraints for comparing conceptual spaces and thus assessing the semantic convergences and divergences across different viewpoints. Finally, we show how our approach for identifying semantic patterns in disparate resources and organizing them in a topic map framework can be useful for such tasks as domain monitoring, technology watch, opinion tracking, etc.
Keywords
Table of Contents
Topic Maps (ISO13250) is a powerful standard for the semantic annotation of document collections. Several key features make it adapted for managing the dynamic, distributed corpora of today. Firstly, separation between the semantic model and the information instances makes it possible to overlay different topic maps over the same collection of resources. Secondly, robust linking based on semantic construction make the Topic Map standard adapted for annotating dynamic and ephemeral resources typical of online information. Thirdly, flexible location mechanisms enable the addressing of heterogeneous resources of varying granularity. Finally, topic map's support for defining scopes makes it possible to model context-dependent knowledge and define different points of view on a document collection.
However the construction of a topic map can be very costly and can quickly become a bottleneck in any large-scale application if recourse is not made to automatic methods. Apart from the initial cost of defining the topic map, problems of maintenance and coherence may arise when the topic map is applied to renewable information sources, as manual construction and maintenance can not keep pace with any significant amount of incoming documents.
A manual approach to topic map construction is adapted if the conceptual model is stable and is linked to a circumscribed collection of resources, but is ill suited to manage dynamic, loosely-structured information sources. The volume of data channeled through the Internet and large intranets today requires not only indexing new resources 'on the fly' but re-structuring the semantic model as new concepts and varying points of view spring up.
One possible approach to topic map construction involves recycling of structured or semi-structured data and exploiting pre-existing knowledge sources for semantic indexing. This implies that a semantic model is available in advance, to which information instances are attached in a top-down manner.
However, the large majority of electronic resources available today are unstructured. This has generated a need to extract semantic information and build topic maps from unrestricted text for which the vocabulary and conceptual models are not known in advance.
In contrast to a top-down approach, in this latter data-driven approach a document collection is not a passive repository of topic instances but rather a tool for discovery from which semantic categories are made to emerge through inductive methods. This approach has been applied within the context of 'monitor corpora' such as the electronic version of the Wall Street Journal. The periodical analysis of these corpora is aimed at monitoring a given domain in view of detecting emerging topics, for a technology watch task, for instance.
The work we present in this paper, describes an inductive method for building topic maps from unrestricted text. Our approach does not use pre-existing knowledge sources, but rather exploits regularities of word pattern distribution within a collection of documents.
The work presented in this paper has been carried out within the framework of the Alliances project. This project involves two groups (the Language, Information and Representations group and the Architectures and Models for Interaction group) of the Human-Machine Communication Department of LIMSI [1] and the natural language processing group at XEROX's European research center (XRCE). The aim of the project, that has been commissioned by the French ministry of research [2] and a non-governmental organization (Foundation Charles Léopold Mayer for the progress of Human kind [3] ) is to construct worldwide associative networks to collect and capitalize experience on a variety of different subjects such as energy efficiency, education, socio-economy of solidarity, etc. This knowledge is then used to define perspectives for action. The networks are implemented in the form of electronic discussion forums centered on a given subject. The different countries participating in the debate host local discussion forums on the subject and then participate in worldwide forums aimed at articulating the knowledge gathered at a local level.
The Alliances project thus involves capitalizing knowledge in a very distributed environment and from multiple viewpoints.
The idea of using inductive methods to extract semantic information from large corpora is the focus of much research recently. Most work pursued in this direction seeks to find semantic relatedness between words on the basis of shared distribution patterns. The underlying hypothesis at the basis of this approach is that words that co-occur in similar contexts are semantically related within the corpus under study.
This line of work has led to 2 different approaches in semantic acquisition. The first approach is aimed at identifying words which appear together often in the same contexts. This is the direction pursued in particular by Church [church-et-hanks90] who compares the probability of occurrence of word pairs to the probability of each word occurring separately in order to infer mutual information [4] between pairs of words. Word pairs with a high mutual information score provide potential semantic relations. The second approach aims at calculating groups of words which share similar cooccurrence patterns but do not necessarily appear simultaneously in the same contexts. This is the direction pursued in Grefenstete's [grefenstette94a] and Hindle's work [hindle90]. Thus while the first approach produces collocation patterns, the second approach produces equivalence classes of words, that is to say, words which can be substituted in certain contexts and are therefore partially synonymous.
The same distributional hypothesis underlies most work aimed at finding similarity measures between documents within the Information Retrieval (IR) paradigm [salton-et-mcgill83].
Although the different approaches cited above all fit into a distributional framework and are based on the the same vector representation of words and contexts [5] , the definition of a context, in other words the granularity of cooccurrence, varies. This leads to different kinds of categorizations.
In IR, the document is most commonly used as a cooccurrence unit. However, in semantic acquisition document-level cooccurrence is too coarse-grained: documents can vary enormously in length and manifest internal variation in terms of subject and genre. As a result, words that cooccur in the same document are not necessarily related in any relevant way.
Therefore, a more restrictive definition is commonly used, a structural unit such as a paragraph or a graphical unit such as a window of n words.
The advantage of this graphical approach is that no Natural Language Processing (IR) is needed for calculating word cooccurrence, only a stop list is commonly used to filter out non-content words (prepositions, deitics, etc.).
More recently, the availability of probabilistic part of speech taggers, robust noun-phrase extractors, data-oriented parsing (DOP) and stochastic tree grammars [bod95], has led to approaches which further restrict the definition of a context through linguistic constraints. These NLP tools identify syntactic dependencies between words (for instance between a subject and a verb, or a noun and its modifiers) which enable the recognition of units of meaning (sentence, noun phrase, etc.). This information is then exploited to implement more focussed definitions of cooccurrence contexts than those used in graphical window-based approaches. Hindle [hindle90] , for instance, extracts noun similarities from contexts defined as predicate-argument structures. Grefenstette's system SEXTANT [grefenstette94b] for extracting corpus-specific semantic relations from very large corpora also uses a set of linguistic features for defining cooccurrence contexts.
In the work presented here, we have also used limited linguistic information to define the contexts by which word similarity is calculated. We use the tool Zellig [habert-et-fabre99] to construct semantic classes on the basis of word distribution within normalized syntactic trees. In the following section we explain how we have used this approach to build topic maps.
Zellig is a text mining tool aimed at automating the construction of semantic classes from corpora of arbitrary length based on a distributional framework. It has been used on specialized corpora in nuclear energy, in the field of coronary diseases within the European Medical Language Processing project Menelas [zweigenbaum-et-al94]and on political texts (the French president Mitterrand's broadcast speeches and interviews between 1981 and 1988) [fabre-et-habert98].
Zellig recycles parse trees produced by two NP (noun phrase) extractors : AlethIPGN (developed within the European Eureka GRAAL project) and Lexter [bourigault93] and has been adapted to work on sentence-level parse trees generated by the dependency extractor XRCE, developed at XEROX. All 3 of these software tools are robust parsers which can analyze corpora of arbitrary length. Zellig reduces the output of these tools to elementary dependency trees, which reflect the essential binary relations between content words. These binary relations constitute the contexts on which word similarity is judged using a weighted Jaccard measure [6] .
On the basis of this similarity score, Zellig computes a graph aimed at displaying sub-groupings of similar words. The words constitute the nodes of the graph and the edges of the graph correspond to a number of shared contexts given a chosen threshold. Zellig also computes the connected components (the sub-graphs in which there is a path between every pair of distinct nodes) and the k-cliques (the sub-graphs in which there is an edge between each node and every other node of the graph). From a topological point of view, connected components and k-cliques are the most relevant parts of the graph, and represent subsets of words which are strongly interwoven. Figure Figure 1 [7] shows a k-clique computed from Lexter parse trees from the analysis of a corpus of the Alliances project.
Zellig also computes similarity lists for each word of the corpus, that is to say a list of a word's 'nearest neighbours' in terms of shared cooccurrence patterns.
Finding semantic similarities between words on the basis of shared context patterns addresses an important issue in topic map construction as it allows to automatically aggregate words that express the same concept. Such aggregations constitute potential topics. Conversely it also provides a way of a aggregating similar contexts based on the similarity of their word profiles. Such semantically related contexts can provide very fine-grained anchors for content-based navigation.
However, constructing a topic map involves not only a process of aggregation of semantically-related occurrences but also a characterization process. This is necessary to address the dual nature of topics : from an extensional point of view, a topic is described by its occurrences, but a topic also has an intensional side which is expressed through the notion of its subject identity. Therefore in topic map construction it is important to not only have a method for grouping occurrences on the basis of semantic similarities but also at characterizing the similarities. We have chosen a dimension scaling method (correspondence analysis) to produce a geometric representation of the conceptual space of a topic map in which similarities between topics can be measured and explained in terms of semantic dimensions.
Correspondence analysis [benzecri92] (and other related dimension-reducing techniques such as principal components analysis or factor analysis) aim at producing low-dimensional representations of object by attribute data [8] (words by Zellig contexts in our model). They are thus used to distill synthetic information from large object by attribute matrixes. The dimensions of the reduced space are linear combinations of the attributes of the full dimensional space. The dimensions can therefore be considered as 'synthetic' or artificial concepts defined by a set of correlated attributes. These techniques thus address the characterization problem mentioned above in the sense that the resulting reduced space can explain the similarities among objects in terms of dimensions or abstract attributes. Dimensional scaling has the effect of reducing the initial description space, where words are described by all the contexts to a smaller semantic space where word similarity is approximated by values on this reduced number of dimensions. Latent Semantic Indexing (LSI) [deerwester-et-al90] is based on a similar dimension-reduction technique aimed at replacing individual words as the descriptors of documents, by synthetic or``latent'' concepts that can be expressed by one or several words.
The correlations between the original descriptors and the new variables or dimensions, also called loadings, provide an indication of the influence a descriptor has in the constitution of the dimension and thus are explanatory of the structure of the conceptual space. They can then be used to label and interpret the dimensions.
Correspondence analysis allows us on the one hand, to produce a spatial embedding of the word associations generated by Zellig and on the other hand to characterize the semantic axes of a space in which similarities between words can be represented as (inverse) distances.
An important property of the resulting geometric representation is that it provides a continuous model for word meanings. Words which have a semantic intersection in a given corpus appear in the same region of the space. This is in contrast to a discrete approach to word meaning such as that of WordNet, where meanings are represented as nodes in a network.
The discrete approach of WordNet is adapted when it is important to discriminate between clearly distinct poles of a word's meaning. WordNet has thus been used for word sense disambiguation in order to increase precision in IR queries [sussna93]. However, a model for describing the fluctuations of word meanings within different points of view requires, on the contrary, a continuous model where it is possible to represent a gradual blending from one meaning to another. This continuous approach has been pursued in particular in [grefenstette94a], [fuchs94] and [habert-et-al2001].
Topics are the building blocks of the topic map standard and are groups of occurrences related to a given subject.
Our work focuses on describing document collections in terms of their lexical content, so words are the building blocks of our model and are defined as topics. A word's normalized form (stem) constitutes the base name of the topic. Its occurrences are the locations of its associated contexts in the corpus.
As explained in the preceding section, a web of word associations for a specific corpus is computed by Zellig. This provides on the one hand, a list of nearest neighbours for each word of the corpus, and on the other hand groupings of words (connected components and k-cliques) which represent semantically consistent sub areas of the graph of word associations.
The Topic Map standard enables the definition of relations between topics through association constructs. We have therefore modeled these different types of relationships as topic map associations. We have defined 3 association types: neighbour, connected components and k-cliques.
The neighbour association connects a word with each word of its similarity list, as computed by Zellig on the basis of a Jaccard score. An instance of the neighbour association is created for a given word with each one of the words on its similarity list. Therefore the association neighbour represents a binary relation, but it is important to note that the relation is not reciprocal. The similarity list of a given word is computed by 'locally' focusing on a pivotal word and computing the words that share most contexts. It is only by crossing local similarity lists, that Zellig finds groups of words that are ``reciprocally nearest neighbours'' (k-cliques). Therefore, in a neighbour relation, the role of its two members is not symmetrical.
The topic map standard allows the distinction of the roles played by the different members in an association by means of anchor roles.
We have distinguished these two roles in our model, by defining 2 different anchor roles : hub representing the pivotal word from which the nearest neighbours are computed, and neighbour for its nearest neighbour.
In addition, a neighbour association has a number of occurrences constituted by the contexts shared by the two words of the association. Navigating through these occurrences provides a way to interpret the association between 2 words, as a neighbour of a given word can be similar to it along different axis of that word's meaning. As associations are not reified in the topic map standard, in order to attach occurrences to an association, it is necessary to define a topic that points to the association as its subject indicator. The occurrences are thus assigned to the association topic.
Connected components and k-cliques are also association types. Each connected component and k-clique generated by Zellig is defined as an instance of an association of that type and it groups all the words of the k-clique or CC. The members of these associations thus all play the same role.
It is important to note that the name of a CC or k-clique is not given automatically by Zellig. The labelling and characterization of a CC and k-clique requires a human interpretation which we consider essential to the categorization process.
The focus of the work we present here is to define a topic map from different points of view, in order to characterize the diverging conceptualizations reflected in distributed, unstructured, heterogeneous document collections.
Unstable document collections like the collection of resources that make up the Alliances project cannot be described by one single conceptual model, as many different points of view are represented. This makes the Alliances project typical of the distributed resources available on the web today, where information from multiple sources is interchanged within web pages, newsgroups, etc. with very little (or no) editorial process. Individuals contributing to these pools of information may possess different representations, and will often have a different understanding of the meaning of terms, which reflects the fact that they have fundamentally different concepts for understanding a domain. This is specially the case in non-technical domains, where boundaries between word meanings are less clear-cut, and the potential vagueness of the underlying concepts is greater. An example is political or social debate where key words are not understood in the same way and are a constant source of disagreement as different parties try to redefine the debate in their own terms.
Our approach to constructing a topic map from multiple viewpoints is based on the hypothesis that it is possible to identify semantic convergences or divergences according to the stability or variation of words' cooccurrence patterns. By comparing the web of associations a given word (its ' nearest neighbours' in terms of shared cooccurrence patterns) in different parts of a corpus, one can identify differences in point of view within a document collection.
Therefore our approach consists in partitioning a document collection into different views. We define a view as a bounded subset of resources which are extracted from the document collection and grouped in terms of a set of criteria. In the present study we have used external criteria such as 'geographic origin' in order to create views on the document collection. The subset of texts which make up each view are then analysed by Zellig as well as by the dimension scaling technique described above. The semantic relations resulting from the analysis of each view are integrated into a topic map with the aim of enabling navigation in terms of multiple viewpoints. The aim is also to compare and measure differences in the topology of word association across these different views in order to assess the stability of a given topic.
As described in the preceding section, we have structured the topic map on the basis of word associations. However, a word's associations may vary from one view to another and this needs to be reflected in the resulting topic map.
Constraining the web of associations of a given word to a certain view is possible through a key feature of the topic map standard which is the notion of scope. As defined in the standard, scope defines the domain of validity of the characteristics associated to a topic (its name, its occurrences). It can also constrain the domain of validity of an association between topics. We have thus defined the associations between words (neighbours, CC, KC) as valid within the scope of a view. The scope can be defined explicitly by a topic or group of topics (called 'scoping topics') or left unconstrained.
We have chosen to define a scoping topic called 'semantic space' which is a geometric representation of the description space of a view yielded through the correspondence analysis of the word by context matrix.
The dimension reduction produced by correspondence analysis can be considered a form of abstraction by which the full dimensional space described by all contexts is reduced to a smaller space of semantic axes where word-topics can be positioned by specifying their value along each dimension's axis.
The advantage of defining scopes as conceptual spaces is that the topological structure of the space provides an explanatory way of representing similarities between topics in a view, as it provides a metric for the topic map in which distances have a semantic interpretation.
Our method of semantic analysis leads to a series of snapshots of a word's associations in different views.
As described in the preceding section, these word associations structure the topic map and provide navigation paths within a given view.
However our aim is not only to enable navigation within particular views, but also to contrast different views and thus to measure the stability of a topic across different scopes or conceptual spaces.
By comparing the associative patterns of a word in different views we aim to determine whether there is an underlying semantic structure which is invariant across a range of different views; this common structure would indicate that there is a strong intersection of the word's meaning across different views. On the contrary, if the patterns of association manifest a great degree of variation, this indicates that the meaning of a word in different views is disjoint.
One can draw an analogy of this aim with the process of 3D model reconstruction from a sequence of 2 dimensional images : 2 dimensional images are related to their corresponding 3-dimensional structure in an intricate manner, as many features are distorted and a dimension is lost through the process of projection.
In an analogous way, correspondence analysis, like a camera, is a projective engine that generates low dimension representations. Building a conceptual model of heterogeneous document collections such as that of the Alliances projects can only be done in an incremental way by extracting partial views and creating conceptual sub-spaces through correspondence analysis. The problem is then how to compare projective sub-spaces of different dimensions and how to keep track of an object's topological relations across different subspaces in order to assess the object's stability.
The method we have designed consists in comparing the associations of a word-topic in different scopes by projecting the word-topic of each scope, into a common space [9]
Figure Figure 2 shows the projection into a common space of the word-topic money from 4 different scopes (the scopes correspond to a partitioning of the corpus in 4 views in terms of the geographic origin : Latin America (LA), Europe (E), United States + Canada (USC), Indonesia (I)). The dispersion of the points of the resulting configuration can be interpreted as an instability of the topic. This contrasts with figure Figure 3 where the configuration of the word-topic group in the common projected space is much more centered. In this second case, our hypothesis is that the word's meaning is less fluctuating across different views.
The interface that we have implemented is designed to enable navigation through different viewpoints and to track the stability of a topic's associations across different views. Currently we generate an HTML version of the topic map with XSLT stylesheets from the XTM format.
The user accesses the topic map through a list of word-topics that are sorted both by frequency and in alphabetical order. When selecting a word-topic, the corresponding associations are displayed, both the connected components and k-cliques the word-topic is a node of, as well as its nearest neighbours. The scope of these associations are also presented to the user, in other words the system indicates to the user in what conceptual space these associations occur. The user then selects a conceptual space to explore. By clicking on a nearest-neighbour of the selected word-topic, the user can initiate navigation among the occurrences in the corpus shared by both the word-topic and the nearest-neighbour.
Finally, for each word-topic, a graphical representation of the common space where the instances of the topic in different views are projected, is displayed as a SVG graphic.
We are exploring issues related to the semantic acquisition and structuring of very distributed and dynamic resources within the Topic Maps paradigm. These issues are becoming increasingly important to deal with the new forms of knowledge interchange that have emerged as a result of the development of the world wide web. The quantity and variety of data available on the web today opens new prospects in knowledge acquisition and structuring that require a dynamic, inductive approach to automatically detect semantic similarities from disparate resources.
We feel that text mining methods based on regularities of word pattern associations provide a robust method for extracting semantic information as these methods do not rely on structured or pre-categorized data. In this paper we have presented a method for building Topic Maps from inductive natural processing techniques.
Furthermore, we have shown that by keeping track of words' association patterns it is possible to detect fluctuations in words' meanings which can reveal different points of view. We exploit these fluctuations to build a Topic Map that enables navigation from multiple viewpoints. These viewpoints are reflected in the resulting Topic Map as different scopes of a topic.
We feel that the possibility to contextualize a topic through the notion of scope, constitutes one of the key features of the Topic Map standard which makes it specially adapted (as opposed to RDF, for instance) for structuring distributed document collections which cannot be described by a stable ontology.
Finally, we show how to model scopes as conceptual spaces based on a geometric structure through the use of a dimension-reducing technique (correspondence analysis). We feel that the method we present for embedding topics in geometric spaces has important applications : Firstly, in Topic Map visualization, as it provides a metric for constructing 2-D or 3-D representations of the conceptual space of a topic map. Secondly, in Topic Map merging, our method provides geometric constraints for comparing and merging conceptual spaces. These constraints are based on the topological structure of the space and therefore provide a way for semantic merging. We believe that finding methods for merging and constructing conceptual spaces from partial views is one of the goals of the semantic web.
[bod95] Bod, R. The Problem of Computing the Most Probable Tree in Data Oriented Parsing and Stochastic Tree Grammars. Proceedings, EACL'95. Dublin. 1995.
[bourigault93] Bourigault, D. An Endogeneous Corpus-Based Method for Structural Noun Phrase Disambiguation. Proceedings, 6th Conference of the European Chapter of the Association for Computational Linguistics (EACL'93), pp. 81--86. Utrecht. 1993.
[church-et-hanks90] Church, K. W., Hanks, P. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics. 1990.
[deerwester-et-al90] Deerwester, S., Dumais, S.T., G.W., Landauer, T.K., Harshman, R. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science. Vol. 41, pp. 391--3407. 1990.
[fabre-et-habert98] Fabre, C., Habert, B. Acquisition de relations entre mots pour une lecture sémantique de corpus.In proceedings JADT'98, pp. 273--282. 1998.
[fuchs94] Fuchs, C. The challenges of continuity for a linguistic approach to semantics. In Continuity in linguistic semantics. Eds. Catherine Fuchs and Bernard Victorri, pp. 93--110. John Benjamins. Amsterdam. 1994.
[grefenstette94a] Corpus-derived first, second and third order affinities. EURALEX. Amsterdam. 1994.
[grefenstette94b] Grefenstette, G. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publisher.Dordrecht, The Netherlands. 1994.
[habert-et-fabre99] Habert, B., Fabre. C. Elementary dependency trees for identifying corpus-specific semantic classes. Computers and the Humanities. Vol 33, n. 3, pp. 207--219. 1999.
[habert-et-al2001] Habert, B., Folch, H., Illouz, G. Mots stables et mots mouvants. Sémiotiques. Num. 17, pp. 121--148. 2001.
[hindle90] Hindle, D. Noun classification from predicate argument structures. Proceedings, 28th Annual Meeting of the Association for Computational Linguistics (ACL'83), pp. "268--275. Berkeley, CA. 1990
[salton-et-mcgill83] Salton, G., McGill, M. J. Introduction to Modern Information Retrieval. McGraw-Hill. New-York. 1983.
[1] LIMSI (www.limsi.fr) is a research lab of the French foundation for research (CNRS) specialized in mechanical engineering and human-machine communication.
[2] Alliances is an RNRT project (www.telecom.gouv.fr/rnrt/index_net.htm).
[3] www.fph.ch.
[4] The formula for mutual information for 2 words x, y is log 2P(x,y)/P(x)P(y) where P (x y) is the joint probability of the events x and y and P(x) and P(y) are the probabilities of each individual event.
[5] In a distributional approach, words and contexts can be modeled by a symmetrical two-way table. Each row of the table represents a word and each column represents a context of the corpus. The value in each matrix position represents the frequency of that word in each context.
[6] The Jaccard similarity measure between two objects (words in the case we are concerned) m and n, is defined by the formula a/a+b+c where a is the number of shared attributes (contexts), b the number of unique attributes possessed by object m and c the number of unique attributes possessed by object n.
[7] Note on each edge the contexts which are shared by the words (the nodes). The tilde stands for the edge node. For instance, tilde social between nodes argent and échange implies the two elementary contexts argent social and échange social
[8] This is done by a singular value decomposition to reduce the matrix to its principal axes.
[9] Projection into a common space is generated from the correspondence analysis of the matrix constructed by the 100 most frequent words of the overall vocabulary of all 4 scopes and the contexts (elementary dependencies) associated to these words in each view.
![]() ![]() |
Design & Development by deepX Ltd. 2002 |