XML 2003 logo

Towards semantic interoperability of XML vocabularies. A brief note.

Abstract

This short reasearch note sketches a direction for linking ontologies to information extraction from XML documents. Only minimal assumptions about ontologies are made. We propose Xpath/XSLT based rules bound to concepts from a chosen ontology as a declarative specification of a mechanism capable of 'lifting' contents fragments from markup specific encodings to an independent, concept-tagged extraction formats.

Keywords


Table of Contents

1. Introduction
2. Ontologies
3. A sketch of the proposed approach
4. Future work
Biography

1. Introduction

We address the issue of common semantics in XML documents created using different vocabularies. Today an abundance of (local) information exchange standards exists, created within different industries. The standards are semantically partially overlapping, yet they choose non-unified expression for potentially shareable information. The resulting lack of inter-standard interoperability will not disappear overnight. The common and universally shared standard, such as UBL, is not yet there.

There are reasons for the multiplicity of 'local' standards as they evolve out of previous practices of electronic data processing within industry groups respecting the need for continuity in interoperability with legacy applications. Metaphorically speaking, a number of islands emerged from the sea following their own natural processes; there is an opportunity for trade between the islands before a complete continent emerges.

The complete translational equivalence between different vocabularies is likely not achievable in the general case: the documents will contain instances of non overlapping concepts, or they may depend on information items from a locally shared context. Our approach therefore concentrates on deriving value from exploiting semantic equivalence between subsets of content items.

The standards work because communicating parties mutually agree as to the details of a chosen markup convention. The markup acquires procedural semantics defined by how elements and attributes are actually used.

As previously mentioned due to differences in expressive powers between markup standards it is in general not possible to faithfully mechanically translate between them. We therefore concentrate on information extraction from a given markup standard to a user defined form synchronized with a user chosen ontology. We concern ourselves with 'lifting' content from a standard specific encoding into an independent form. This resulting form may be useful to search engines, data classification, and data mining. The important point is that the extracted content can be further processed in alignment with the ontology underlying content extraction, independently from the details of source standards. Thus any subsequent data processing API and transformation rules can remain relatively more stable than the details of the input set of standards.

2. Ontologies

Ontologies as computational entities have been studied extensively within Artificial Intelligence (AI) and its various subdomains since the late 60-s. They have been developed to support reasoning systems with domain models. Various forms of ontologies have been used in Natural Language Processing, including Machine Translation and Intelligent Information Retrieval, in Expert Systems, and so on. While AI is not as fashionable as it once was the Knowledge Representation work should still serve as a source of insights already gotten to avoid reinventing the wheel for the Semantic Web. (Similarly, Software Agents literature should be consulted within Web Services pursuits.)

An ontology, on a logical level, is a collection of "concepts" and their relationships. The concepts selected form a conceptualization of a given domain. Although English nouns are typically used to name concepts, it is important to remember that the concepts are not intended to represent words but classes of things and their familiar names are used only for mnemonic value like Java class names. In fact, some conventions augment concept names with further suffixes, e.g. CAR_1, CAR/VEHICLE, etc. where the concept name chosen is polysemous and by itself not precise enough to name a clear-cut concept. The fundamental relation partially ordering concepts is generalization typically called IS-A. The relation is analogous to SUPERCLASS relation of Object-Oriented languages; it organizes concepts hierarchically. Other relations such as PART-OF are also commonly used. It is the network of relations that constitutes the meaning of concepts and not their names (similarly the meaning of a Java class is constituted by how is it used and how is it related to other classes.

From the implementation point of view an ontology is a (networked) database of concepts and relations augmented by access mechanisms (APIs) allowing navigation among concepts.

3. A sketch of the proposed approach

Our approach based on sharing ontologies remains agnostic as to the detailed design of these ontologies. Through this "least commitment" strategy we want to have a method to survive the expected evolution of Semantic Web ontologies. The approach depends on only the most fundamental design principles of ontologies such as generalization hierarchies among concepts.

Exact synonymy

The simplest case of semantic equivalence between fragments of documents is the exact synonymy of namespace qualified element tag names. The exact synonymy entails that the same convention is used to produce/parse the element's content. Thus we may have "ns1:OrderDate" and "ns2:DateOfOrder" contain a single text node containing characters of a date serialized according to a common standard. We expect, however, that the exact synonymy will be more of an exception than a rule when semantically aligning existing standards.

Parent context

In the more general case data that are instances of common concepts are encoded in non isomorphic structures. Our method depends on XPath and XSLT mechanisms to recognize these structures and to extract content data. For example, <FirstName> is not equivalent to <First> unless the latter is a child of <Name>.

Context + positional disambiguation

<USDPrice> may be equivalent to <Price currency="USD"> (attribute disambiguation). <Town> may be equivalent to ADDRESS/ADDRLINE[2].

Given an ontology (covering an appropriate domain of discourse) and a vocabulary, we can declaratively describe where instances of the ontology's concepts are to be found in documents created within the vocabulary as well as how to extract their content.

We propose rules of the form {concept, match, extract} built with XPath and XSLT match patterns to specify extraction of conceptually qualified data instances. Such rules can be used to automatically create XSLT stylesheets to 'lift' pieces of content from a vocabulary specific to an ontology specific form that can be further processed according to the meanings from the chosen ontology.

The rules would be similar to templates of XSLT with XPath patterns locating concept instances and template bodies defining a new markup to reexpress the located information in.

The application mechanism, however, will in general be different from XSLT recursive template matching and application. It seems likely that XSLT stylesheets can be derived from the rule set based on extraction specification and the properties of the employed ontology.

4. Future work

The direction briefly sketched in this Research Note needs to be developed in much greater detail and applied to real life cases taken from existing business document standards. We also plan on presenting and testing a sofware implementation of the method soon.

Biography

Jacek R. Ambroziak, Ph.D. started Ambrosoft, Inc. in February 2002 following almost a decade with Sun Microsystems Labs and Sun's XML Technology Center. His new company specializes in high performance XML computing on the Java platform with Gregor, the new generation XSLT processor being the current flagship technology. While at Sun, Jacek started and led the XSLTC project and built one of the first XML search engines on the base of his earlier JavaHelp search engine.