Abstract
Interoperabilty between ontologies is a big, if not the single biggest, issue in B2B data exchange. For the near (and probably distant) future there will not be a single, unifying, widely accepted B2B vocabulary. Therefore we will need mappings between different ontologies. Since these mappings are inherently situational, and the context is very complex, we cannot expect computers to create more than a small part of those mappings. We need tools to leverage the intelligence of humans business experts. We need portable, reusable and standardized mappings. Topic Maps are an excellent vehicle to provide those 'Business Maps'.
Keywords
Table of Contents
First I want to give a quick summary of my presentation at XML2001 in Orlando; this presentation is a continuation of it, and I will not presume everyone here attended my presentation there.
We have lots of data and descriptions of those data. Take for instance the abundance of vocabularies for B2B exchange like xCBL, FinXML, FpML et cetera. Those vocabularies can be seen as ontologies. Older EDI technologies such as X.12 and EDIFACT are also ontologies. Beside those 'industry-strength' solutions, there are lots of tailor-made data exchanges between companies, often using nothing more than simple ASCII comma-separated files. Together with their documentation, those ASCII-files also constitute ontologies. And even within larger companies many different ontologies exist within the different legacy databases of the different departments. Those different data sources present huge interoperability problems.
One of those interoperability problems is finding out which data items from different sources are the same. To do that, we need to compare the meanings of those data items. This means we have to look up data definitions (metadata) for different data sources and compare those data definitions. Comparing human-made definitions is a though job. Different companies of departments may come up very different definitions for things that really are the same, and with very similar definitions for things that are very different in reality.
First of all, hard as we try, mistakes and obscurities occur in what we write down in our data definitions. Second, in making data definitions we may find that a lot of data aren't that well defined to start with. In other words, when we make data definitions for a data source this sometimes is the first attempt to define the data at all, and when there already is a definition, it is often not precise enough. Third, when we make a definition like 'an employee is a person working at a company', we introduce many new words ('person', 'work', 'company') from natural language. When meanings in natural language aren't precise, those definitions aren't going to be precise either. We should just not think we can fix meanings once and for all in any but the most limited contexts.
In mapping B2B vocabularies we have to distinguish between type and token mappings. I use the linguistic terms ‘type’ and ‘token’ here; the IT terms ‘class’ and ‘instance’ or the mathematical ‘set’ and ‘element’ are very similar.
A token item in a B2B exchange is for instance:
<amount currency=”USD”>200.00</amount>
or:
USD;20000;
in some non-XML format.
The type of these tokens is amount. What matters is not so much the type itself, but the type description. An example of a type description is: “Amount is the total amount due excluding VAT for all goods ordered at webwarung.com”. Additional type description data is: “amount is in USD, EUR, JPY or IDR” and “amount has a maximum of two decimals”. Type description data are, of course, metadata. There is an important distinction between ‘type’ and ‘set’ or ‘class’ here. The mathematical notion of set (and the IT notion of ‘class’) presupposes a enumerable (though maybe infinite) number of elements. Common set notations are:
A = {1, 2, 3}
or:
B = { x e N | x > 3 }
In mathematics everything is necessarily well-defined. In natural language, things are not so clear-cut. There is nothing in the word ‘love’ which requires the existence of a fixed and well-defined set of occurrences of ‘love’. Though the type-token distinction is similar to the set-element and class-instance distinction, type-token does not have the mathematical rigor of the former two. This is an important point, since the IT-tools used in B2B derive from the mathematical realm, and the business descriptions used in B2B derive from the natural language realm. So there is a clash of cultures here. The IT-side presupposes mathematical rigor, whereas the business side does not. Of course, the solution is simple: we have to make business descriptions more precise, so we can use them in algorithms. But note: this solution is the archetypical response from the IT-department, wholly from the mathematical, rigorous point of view. The business analyst might have a different solution: why don’t the IT-tools allow more flexibility and less rigor. Or, as end-users usually phrase it: why doesn’t the system do what I want it to do!
Back to mappings and conversions. A token-to-token conversion is in a B2B exchange context expressible in XSLT, or a programming language for non-XML domains. There is however no common tool to do type-to-type mappings with. There are not even many standards for type descriptions: usually data descriptions are distributed in some table-like format, usually containing things like name, data type and textual definition. However, nothing is guaranteed and the formats in which this is done are wildly different. The situation might improve somewhat through Dublin Core and similar initiatives, though Dublin Core is not specifically directed at this type of metadata.
Type-to-type mappings are the problem this paper addresses. On the type level we usually consider semantical and contextual information, whereas token mappings focus more on syntactical aspects of conversion.
In this part I want to explore current solutions in XML and Knowledge Management. The first, which I shall call 'the naive approach' is: Let's make a new vocabulary which covers everything, then let everybody use that vocabulary. Probably everybody thought of this sometime and found out it does not work in practice. Multiple vocabularies are a fact of life.
Another approach to interoperability is the use of Published Subject Indicators (PSI) as used in Topic Maps. The basic idea is to make public libraries of unique ID's for things. In our vocabularies we incorporate PSI's, and then we can compare the terms in our vocabularies. In an informal example:
Topic: 'United States of America'; PSI: US
Topic: 'Verenigde Staten van America'; PSI: US
The PSI's in the English and Dutch topics allow us to conclude that both topics are the same. Note that this really just shifts the problem from vocabularies to public libraries. In general we can say this approach is successful if the problem space consists of clearly delimited entities and there is a widely accepted canonical public library. Examples of areas were this approach will work are for instance ISO currency and country codes.
Now back to a 'real world' example. In actual mappings between ontologies, we often do not really establish semantic equivalence in a true sense as needed in PSI's. When we have found we can use 'CustomerAddress' of our trading partner as the 'billing_address' in our online billing application, we stop. We do not need to find out whether they are truly equivalent in all circumstances. There is no direct business need to find out whether they are equivalent in all situations, and therefore the boss doesn't pay to find this out. Solutions like PSI's do not work here, because PSI's require true semantic equivalence. The interesting observation here is that most real world mappings are unidirectional: we translate from a source ontology to a destination ontology for a specific business process. For instance, an order goes from buyer to supplier. It does not go back (though a different document such as an invoice or order confirmation might go back). So for an order only a translation from the buyer's ontology to the supplier's ontology is needed. This unidirectional nature of business exchange means that often we do not establish equivalence relationships, but subset relationships between ontologies. In the above example, 'CustomerAddress' is a subset of 'billing_address'. All instances of 'CustomerAddress' constitute a valid instance of 'billing_address'. We do not know whether the reverse is true, and in this example we do not need to know either.
It might be tempting to conclude that we simply have to make a mapping between every two ontologies we use. That, however, is going to far. Even when we do not always establish true semantic equivalence relationships, the mappings we make are certainly for a great part reusable. What we need to do is capture knowledge about the mapping process itself. We need to store the fact that we can use 'CustomerAddress' as 'billing_address' in this particular context. Then, when someone else needs to find out whether 'CustomerAddress' can be used as 'InvoiceAddress' in a different context, they can use this information. When we store this kind of information, we could facilitate the process of mapping ontologies through the use of semi-automated tools which show existing mappings for items in our ontology that we need to map onto another ontology. The human expert making the mapping can still make all the relevant choices and provide new mappings where existing ones can't be reused. Such semi-automated tools could then generate a new mapping, which also can be stored to provide information for the next one. It would also become much easier to exchange information about mappings without having to provide full one-on-one equivalence relationships.
A lot of the research effort in this area goes to automatic mapping. The basic idea is (usually) to describe data on a type level in such detail at design time that new mappings can be established automatically at run-time. Note that this is a enormously complicated task. It is an important one too: I think it will take some time before we can utter the next sentence happily: “Hey, that’s funny, yesterday my computer sent $50.000,= to a company I’ve never heard of!” Or to put it otherwise: I do not want my computer to do that yet!
Let’s look at an example to see how complex the task can get. Best Insurance and CarCovered.Com are merging. Best Insurance doesn’t sell car insurance, so there is a huge chance for cross-sales of car insurance to the customer base of Best Insurance. We want use Best’s customers as prospects for new sales, but we don’t want to sent offers for car insurance to Best customers who - coincidentally - already have CarCovered.Com insurance: that makes us look real stupid. Questions a marketeer will ask in such a situation:
Do we expect overlap in customer base? Is there regional or industrial overlap?
Can we find most of the doubles? (With all the different name spellings, double addresses, PO boxes versus regular addresses etc.)
How bad is it if we overlap? How much effort do we put in finding overlaps? Is 2% remaining existing customers acceptable? Or 0.2%?
How much money does it cost to remove double addresses? How stupid do we look when we make mistakes?
etc. etc
Only then can the marketeer decide whether Best’s customer data are actually usable as prospect data for CarCovered.Com. So how could a computer decide whether ‘Customer’ can map onto ‘Prospect’? This takes a whole new level of AI. Of course, I’ve chosen a real hard example, and there are other simpler contexts. My personal experience as a B2B consultant however is that things usually appear simpler than they are in B2B data exchanges.
Of course it’s fine in research to tackle problems which are hard, and may not have a solution in the near future. For the time being that is where automatic mapping will stay: in the lab. What is needed on the short to medium term is not so much automatic mapping but tools to leverage human brains: tools which enable business analysts to make new type-to-type mappings quickly and reliably, using human intelligence.
This model shows what is necessary to store when we want to store the knowledge in mappings. First we must distinguish the documents which are exchanged in B2B exchanges and the items which make up those documents. The precise structure of the document can be stored as a DTD or Schema in the document class if necessary. Items can belong to domains (data types) which in turn can be specializations of other domains. In the left-hand classes we would store the information on the business documents in use by relevant business partners. There is no need to store the full definitions of the data; it is sufficient to store identifiers that uniquely identify the documents, items therein, and domains in use. After all, the full definitions are probably already stored somewhere; simply copying them would only introduce unnecessary redundancy.
On the right-hand side of the model we have the actual mappings. First, the fact that some (usually two) documents are mapped onto each other is stored in the class document-mapping. The document-mapping is related to one or more item-mappings. The item-mappings store not only the identifiers of the mapped items, but also the kind of mapping: is this an equivalence relationship, which is potentially bidirectional? Or is it a unidirectional mapping, which means it is at least a subset-superset relationship? We can also store conversions. For instance, if the destination document allows last names of only 25 characters, but the source allows last names of indefinite length, we could specify that names need to be truncated to 25 characters, or that this is an error. We could also store more complex transformations. They should, however, be readable for ordinary humans, which would rule out XSLT ‘as is’. The intended users of a tool based on this model are business analysts, not XML programmers. (An intelligent tool could of course store XSLT for a large class of transformations and show the results in natural language.) Last, we can store domain conversions, so we wouldn’t have to store the same YYYYMMDD to DD-MM-YYYY conversion for every date in the document.
The central class is context. This is context in the broadest sense; this class could store information on B2B vocabulary, region, country, company, business unit, timeframe and whatever is necessary for the mapping under scrutiny. Context would apply to all other classes. If unspecified, an item mapping would usually inherit its context from the document mapping in which it is contained. If the document mapping is concerned with mapping some Invoice to some billing_doc, the item mappings will be valid only for this mapping unless specified otherwise. However, it does make perfect sense to store context for item mappings. We could for instance store that the ‘last_name’ of trading partner one always maps onto ‘GivenName’ of trading partner two. Context also applies to the left-hand side, though maybe we would only want to store the trading partner involved here. Of course, the Context Drivers of ebXML would constitute a good starting point for defining context.
The data model maps quite nicely onto core Topic Map constructs, to yield what I will call ‘Business Maps’.
Table 1.
| Mapping Model | Example | Topic Map construct |
|---|---|---|
| document | Invoice | topic |
| item | CustomerName | topic |
| document-to-document mapping | Invoice maps to bill | association |
| item-to-item mapping | CustomerName maps to company_name | association |
| context | Vocabulary, Company, Region, Industry, ... | scope |
| external document description | www.bizwords.org\invoice | occurrence (role: business document description) |
| external item definition | www.bizwords.org\amount | occurrence (role: definition) |
| external item datatype | www.bizwords.org\date | occurrence (role: datatype) |
| external item example | www.bizwords.org\amount\example | occurrence (role: example) |
| vocabulary identifier + document identifier or item identifier | www.bizwords.org\amount | subject identity |
What’s more, Topic Maps offer the facility to merge two distinct Topic Maps. This is an excellent way to compare separate, portable B2B-mappings. When we have two Business Maps, say one from Sales Europe and one from Sales Asia, and we want to make a new Map for Sales South America, we can merge the existing maps. Any references to external message items will be merged when they have the same subject indicator. Fortunately it is relatively easy to establish stable and unique subject indicators for vocabulary items: one can assume that the vocabularies act as namespaces, i.e. each item in it has an unique name, and one can assume the vocabularies to be identifiable by a URI. Suffixing the item name to the URI will do the job of establishing unique subject indicators for vocabulary items. (There are other ways to merge topics bases on names, but this is the most robust and appropriate one in this case.)
Once the Business Maps are merged, we can use the scope to filter out the business processes we are interested in. So Business Maps can provide an easy an flexible way to reuse knowledge stored in mappings: they are portable, reusable mappings come true.
Quite a few things remain for this vision to come true. The best thing of course would be an accepted standard for Business Maps. Having this, we would be able to exchange mappings with all companies using this standard. Note that this a far less ambitious and more tenable goal than establishing a single unifying B2B ontology. This approach could also prove to be a viable way to achieve inter-company interoperability, still a big problem in the world of ever-merging large companies... We would need tools to support the Business Maps - querying, filtering, im- and exporting, and creating and editing them. And possibly we would need a description of the properties of applications processing Business Maps. The Topic Map standards do not say a lot about what scope means (intentionally so, this is up to the applications). A description of what ‘context’ means in this model would therefore be appropriate. All in all this model could provide for a huge facilitation of human-mediated ontology mapping.
Interoperability between ontologies is one of the (if not the) most important problems in B2B data exchange. For the time being, making mappings will mainly be a human job. Therefore we need a way to leverage human intelligence to make all the required B2B mappings. Portable, reusable mappings would accomplish this. Those mappings would need to store information on business document mappings and the context that applies to those mappings. Topic Maps are an excellent vehicle to store such information, thus yielding “Business Maps”.
Samples of Business Maps are available at http:/www.marcdegraauw.com/itm
![]() ![]() |
Design & Development by deepX Ltd. 2002 |