XML 2003 logo

Know What Your Schemas Mean

Semantic Information Management for XML Assets

Abstract

If you find yourself faced with hundreds of XML Schemas in your organization, you need an architecture that can help you overcome the challenges of redundant and overlapping schemas. More importantly, you need to be able to identify and reuse existing schemas rather than write new ones. Ideally, you should be able to discover and use enterprise or industry-standard schemas when available. Where you must nonetheless create a new schema, you want to be able to follow standards for terminology and structure.

This complex situation is common, as large numbers of schemas are developed for messaging middleware, Web Service remote procedure calls, application persistence, and other purposes. Often, developers in different departments write schemas without shared standards. Even where enterprise-wide standards exist, distinct companies often develop their own standards. Subsequently, they must interoperate for circumstances such as B2B integration or after a corporate mergers. and then later must interoperate for B2B integration or after a corporate merger. Worse, automated tools generate schemas with such ease that hundreds of schemas are often produced for various business purposes, without proper consideration of the need for standards.

One solution to this is using semantics for information management. This approach is centered on an information model, built using ontological standards such as OWL (Web Ontology Language). This information model is a structured representation of the business domain. Enterprise assets such as schemas -- whether XML, relational or even COBOL flat files -- are mapped to this model as a way of representing their semantics.

At this point, the semantic architecture can solve many of the asset-management problems. A search algorithm begins with business concepts in the central information model and discovers all schema elements and attributes which represent these concepts. When there are multiple schemas for a business concept, the semantic approach merges or eliminates redundancies. If that is not possible, semantically-based transformation-generation algorithms create XSLT to seamlessly integrate otherwise incompatible XML documents.

To aid in developing schemas for a new business need, the system identifies schemas for reuse, finding standard schemas wherever possible. When no existing schema is available, the system generates new schemas whose terminology and structure conform to enterprise standards.

In this presentation, participants will learn how to manage information with a semantic architecture in which all enterprise assets are unified. Although the focus will be on W3C XML Schemas, the presentation will also discuss the application of these semantic techniques to other XML schema languages, such as DTD, as well as non-XML assets, such as relational databases. The presentation will include examples and real-life enterprise case studies to bring these techniques to life.

Keywords


Table of Contents

1. Introduction
2. A Semantic Architecture
2.1. The Central Information Model
2.2. Mapping
2.3. Semantic Discovery
2.4. An Example
2.5. Example Schemas
2.5.1. Schema Alpha--Power Plants
2.5.2. Schema Beta--Power Plant
2.5.3. Schema Gamma--Plants
3. Other Applications
3.1. Generating Schemas
3.2. Transformation
3.3. Non-XSD Schemas
3.4. Web Services
4. Conclusion
Bibliography
Glossary
Biography

1. Introduction

As enterprises develop applications, each developer often creates their interfaces, message formats, and schemas independently, according to the immediate needs of the project. This may result in thousands of items of metadata expressing fundamentally the same information in subtly incompatible ways. A semantically-based system, in which these enterprise metadata assets have their meaning expressed by mapping to a semantic model, provides the power to organize and manage the metadata, allowing for rapid reuse and integration.

Searching for existing schemas is better than creating new ones each time one is needed for an EAI message or an application interface. In many cases, schemas for a given domain have been defined by an industry body or by a central data group in the enterprise. Even if no standard exists, finding an existing schema makes efficient use of already-expended development resources, and makes for easier integration with other enterprise systems in the future.

Yet this search can be costly, since the identifiers in schemas typically do not fully express the meaning of the document elements and attributes. To take a simple example, an EAI message may require a schema describing electric power plants that includes the plant's capacity. The desired schema may express this as power, capacity, watts, mW, or perhaps, when schemas are machine-generated from existing metadata, something incomprehensible like Q2903. If we are lucky and a manual search turns up power, then only an inspection of the schema as a whole shows if the schema describes power plants, automobiles, or something else. The units involved must be understood -- for example, the power attribute may be expressing watts, megawatts, or even horsepower, which can only be accomplished by examining sample data in XML instances. After a possibly lengthy search, we may find some schemas that seem suitable.

Every time an analyst or developer finds the schemas needed for a given EAI message or application interface, he or she must go through the process of revealing the meaning of each XSD element, attribute, and type. Writing up findings in a report is of little use for the next search. But if the procedure of locating schemas can be made reusable by formalizing the semantics (the meaning) of the schemas, development expenses would shrink along with project schedules, and the quality of the result would rise as human errors are eliminated. In this way, a frustrating and lengthy process of learning the metadata anew each time is replaced by an automated system.

2. A Semantic Architecture

The road to automatic schema discovery starts with a semantic architecture centered on an information model. This model is built using the principles of ontology, the science of formally describing real worlds entities and the relationships between them.

2.1. The Central Information Model

In an ontological model, classes describe sets of real-world entities with common properties. For example, PowerPlant is the set of power plants and Energy represents the concept of energy as described in physics. Moreover, a class can inherit from another class: For example, a State inherits from Region. Properties describe the links within these entities. For example, a PowerPlant has an output linking it to a unit of Energy. It is also possible to have properties with values that are simple strings and numbers, such as the name property of PowerPlant and the wattHours property of Energy. The properties are constrained and linked together by business rules. Examples of business rules include:

  • Cardinality: Each PowerPlant has one and only one name and one location, but multiple powerLines.

  • Uniqueness: No two PowerPlants share an identifier.

  • Enumeration of possible values: A PowerPlant's type must be one of Hydro, Nuclear, Coal, Gas, or Wind.

  • Mathematical or other logical relationships: kilowatts = watts / 1000

Note that ontological concepts describe the real-life business environment: unlike schemas, which describe how strings must be arranged in an XML document, an ontological model is targeted at expressing a common business view of the concepts important to the organization or the industry which uses the model.

Ontological models as described above are now appearing more and more frequently in various industries. The theoretical basis for the science of ontology has been building in the academic world for decades; and in the last ten years, the practical applications of ontology have been emerging in the business world. Recently, ontological modeling received the imprimatur of the World Wide Web Consortium, which standardized the XML encoding of ontology in Web Ontology Language (OWL), formerly known as DARPA Agent Markup Language/ Ontology Inference Language (DAML+OIL), (See [OWL03]).

2.2. Mapping

The model expresses the business domain, and the schemas express the data formats used for data transmission or storage. To assign meaning to the schemas, the analyst maps them to the model. This is done by mapping XSD complex types or simple types to classes, and XSD elements and attributes to properties.

The mappings themselves adds significant value to the semantic architecture. In fact, every integration project involves the discovery of the meaning of each schema element, but this is generally done informally and written up in a word processor document, if at all. The formal method which the semantic architecture demands is simply a reusable way of capturing this information, so that semantic values become reusable in all future projects both by humans and by automated processes.

The XML encoding of mappings is not as standardized as OWL for Ontology. However, the Resource Definition Framework (RDF) is a standard format generally applicable to expressing relationships between concepts, and so can be used for linking schema concepts to ontological concepts.

2.3. Semantic Discovery

The model, the schemas, and the mappings between them together form a system for automated discovery of the schemas that meet the needs of a given project, whether for messaging, application interfaces, or storage. The automated search begins with the real-world classes and properties that must be included in the schemas. The algorithm traces the mappings in reverse, finding schemas holding the relevant types, elements, and attributes that express these meanings. Accidental similarities in terminology will not show unwanted schemas, nor will differences in terminology cause relevant schemas to be lost in the search.

In some cases, no schema can be found with all the needed elements. However, there may be a schema which does include functionally equivalent elements, when business rules are taken into account. When two properties have been declared equivalent, or one property can be converted to another with a mathematical business rule, then a schema can be reused, as long as a simple transformation is done.

2.4. An Example

An example will illustrate the process of building a semantic architecture and using it for semantic discovery of schemas (see Figure 1 and Section 2.5, “Example Schemas ”). For illustrative purposes, this example is much simpler than a typical enterprise deployment, but it does illustrate the principles involved.

In this example, there are three schemas, called Alpha, Beta, and Gamma.Their identifiers do not convey their meaning, but we need to find schemas suitable for EAI messages in a specific application for the energy industry. Alpha and Beta both describe power plants, although they arrange the information in different ways. The structure is different, as are the units used for power and energy. Gamma has a root element plant, but this is misleading, as it is meant to describe trees -- but nothing in the schema makes this clear.

The schemas are mapped to the model. For example, schema Alpha's power-plant, with attributes location, output, and capacity, are mapped to class PowerPlant with properties location, annualOutput, and capacity. This mapping has express, for example, the simple fact that the XML attribute output has the meaning of "output", but also that this is the annual output of the plant.

Further, by mapping to ontological properties of class Energy and Power, the units that are intended in the XML attributes are described and converted where necessary. In our case, if the capacity is to be expressed in in watts, the necessary arithmetic conversion, which is encoded in the model, allows the utilization of a schema whose capacity uses kilowatts instead.

Semantic Discovery of Schemas

Figure 1. Semantic Discovery of Schemas

2.5. Example Schemas

2.5.1. Schema Alpha--Power Plants

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
    elementFormDefault="qualified" 
    attributeFormDefault="unqualified">
  <xs:element name="region">
    <xs:complexType>
      <xs:sequence maxOccurs="unbounded">
        <xs:element name="power-plant">
          <xs:complexType>
            <xs:attribute name="location" type="xs:string" 
                  use="required"/>
            <xs:attribute name="capacity" type="xs:double" 
                  use="optional"/>
            <xs:attribute name="annual-output" type="xs:double" 
                  use="optional"/>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

2.5.2. Schema Beta--Power Plant

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
    elementFormDefault="qualified" 
    attributeFormDefault="unqualified">
  <xs:simpleType name="plantTypes">
    <xs:restriction base="xs:string">
      <xs:enumeration value="hydro"/>
      <xs:enumeration value="coal"/>
      <xs:enumeration value="nuclear"/>
      <xs:enumeration value="gas"/>
    </xs:restriction>
  </xs:simpleType>
  <xs:element name="plant">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="name" type="xs:string"/>
        <xs:element name="watts" type="xs:integer"/>
        <xs:element name="Wh" type="xs:integer"/>
      </xs:sequence>
      <xs:attribute name="type" type="plantTypes" use="required"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

2.5.3. Schema Gamma--Plants

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
    elementFormDefault="qualified" 
    attributeFormDefault="unqualified">
  <xs:simpleType name="plantTypes">
    <xs:restriction base="xs:string">
      <xs:enumeration value="tree"/>
      <xs:enumeration value="fern"/>
      <xs:enumeration value="grass"/>
    </xs:restriction>
  </xs:simpleType>
  <xs:element name="plant">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="name" type="xs:string"/>
        <xs:element name="size">
          <xs:complexType>
            <xs:simpleContent>
              <xs:extension base="xs:integer">
                <xs:attribute name="unit" type="xs:string" default="m"/>
              </xs:extension>
            </xs:simpleContent>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute name="type" type="plantTypes" use="required"/>
    </xs:complexType>
  </xs:element>
</xs:schema>
            

3. Other Applications

3.1. Generating Schemas

This architecture is ideal for locating the schema which answers the need of a specific project. Since a typical large organization has hundreds of schemas related to any problem domain, it is quite likely that a suitable one will be found. When no such schema is found, the semantic architecture can generate a schema, structuring complex and simple types according to the relevant classes, and structuring elements and attributes according to the properties. For example, if the schema must represent a power plant with name and capacity, then the starting point is the class PowerPlant and the properties capacity and kilowatts; using the semantic mappings, a schema is constructed appropriately.

3.2. Transformation

As mentioned above, when a schema does not exactly meet the requirements, it can sometimes nonetheless be repurposed though transformations derived from business rules in the model. The semantic architecture can also be leveraged for transforming between two schemas, as in an EAI scenario where the output of one application must be transformed to the input of another. This usage of the architecture was the subject of my talk at XML Conference 2002 and is described in detail in my recent EAI/Business Intelligence Journal article (see [JF02] and [JF03]).

3.3. Non-XSD Schemas

This technique can be applied to all enterprise data which is described by schemas, including data other than XML. For relational data, tables take the role of complex types, and columns take the role of elements or attributes. Likewise, for UML diagrams describing application interfaces, classes take the role of complex types and attributes or associations take the role of elements and attributes. The semantic discovery technique can be constrained to search out a specific type of schema, or to find all schemas of any language that express the required meaning.

3.4. Web Services

Another application of the semantic schema discovery architecture is Web Service discovery for B2B or internal integration. Web Services are meant to be loosely integrated, which should mean that developers can create their Services independently and integrate with each. However, Web Services are often developed in isolation in separate departments or organizations, and must later be used together. Just as the semantic architecture can be used for locating schemas, so too can it be used for locating Web Services that use the schemas as the input or output of their operations. In this way, a UDDI search can go beyond taxonomies or the syntax of WSDL to a full search on the meaning of the desired Service. (For more detail, see Borenstein and Fox, "Semantic Discovery for Web Services" [JBJF03].)

4. Conclusion

Reusing existing schemas can be nearly impossible when a manual search is required. A semantic architecture automates the discovery of schemas or the generation of new schemas, reducing costs and development time, easing integration, and increasing quality and reusability. With the semantic architecture in place, analyses of schemas are made reusable for future applications, including not only semantic search, but also transformation, and Web Service integration.

Bibliography

[TBL01] Tim Berners-Lee, "The Semantic Web," Scientific American, May 2001 (also at http://www.sciam.com/2001/0501issue/0501berners-lee.html).

[JBJF03] Joram Borenstein and Joshua Fox, "Semantic Discovery for Web Services," Web Services Journal, April 2003 (also at http://www.sys-con.com/webservices/articleprint.cfm?id=507).

[JF02] Joshua Fox, "Generating XSLT with a Semantic Hub Transformations for the Semantic Web", XML Conference, Dec. 2002 (also at http://www.idealliance.org/papers/xml02/dx_xml02/papers/04-04-05/04-04-05.pdf.).

[JF03] Joshua Fox, "Central Information Models for Data Transformation," EAI/Business Intelligence Journal, May 2003 (also at http://bijonline.com/PDF/May03Fox.pdf).

[OWL03] "OWL Web Ontology Language Use Cases and Requirements: W3C Candidate Recommendation 18 August 2003," ed. Jeff Heflin (http://www.w3.org/TR/webont-req/)

Glossary

DAML+OIL

DARPA Agent Markup Language/ Ontology Inference Language

OWL

Web Ontology Language

RDF

Resource Definition Framework

Biography

Joshua Fox serves as Senior Software Architect at Unicorn Solutions, working on the Unicorn system for semantic information management. Fox's previous experience includes the design and development of large-scale distributed Internet systems for collaboration over the Internet. He has published and lectured extensively in the field of software engineering and data management.