XML Europe 2002 logo

An open, metadata-driven architecture for managing XML development resources across a large complex organization

Abstract

Effective implementation of an XML-driven interoperability architecture within a large organization depends crucially on the ability to make a diverse and distributed collection of XML-related design and development resources available to developers. Although it is tempting to do this by introducing some form of centralized, managed repository, experience shows that centrally imposed comprehensive solutions are very unlikely to be effective and accepted across a large and diverse organization such as a national government, a large government agency, or a corporation made up of many companies acquired at different times and having distinctive technology assets. Each part of these organizations has its own local needs and priorities, and needs to be free to decide how best to organize its development efforts, IT strategy, change control and project management. This paper outlines a standards-based, metadata-driven approach to this key problem in large-scale interoperability support, covering general issues and selected technical strategies.

Although informed by my experience advising UK Government and other clients on these issues, this paper represents my own opinions, not the policy or strategy of any organization. Policy and consultation documents regarding the UK Government Interoperability Framework and related technical issues can be found on the UKGovTalk™ website[5]


Table of Contents

1. Introduction
2. A Framework for Interoperability
3. XML Schemas in Data Integration
4. Metadata for XML schemas
4.1. Technology neutral metadata for XML schemas?
4.1.1. Use the XML representation recommended for Dublin Core?
4.2. Concluding Remarks
Acknowledgements
Bibliography
Glossary
Biography

1. Introduction

Many organizations are facing the problems involved in layering XML-driven interoperability within a complex network of existing IT systems. There are no instant fixes or easy answers! However, the value of sharing best practice and lessons learned is well established, and it in this spirit that I hope this paper will be received.

The first section discusses general issues concerning the development of a standards based framework for interoperability between diverse systems within a large heterogeneous organization. The second section discusses specific issues regarding the development and deployment of a distributed library of XML schemas to support interoperability using a loosely-coupled XML messaging architecture. The final section looks in more detail at a typical technical problem in this context: providing a standard for metadata to integrate a loosely coupled, distributed collection of information into an organized electronic library - in this case, a library of XML schemas.

2. A Framework for Interoperability

An essential first step in promoting interoperability within a complex organization is to decide on a compatible set of applicable public domain standards. There are interesting issues (both organizational and technical) in that area, but this paper is about what you do next after those base standards have been agreed. The discussion below is based on a scenario where the agreed set of standards includes the use of XML documents for inter-system messaging, with the message structure and content defined using XML schemas. (The exact nature of the schemas used to describe the XML messages is not at issue here, though it is probably worth noting that the author's experience is with policy and prototype development using W3C XML Schema.)

Even if two systems can both read & write fluent XML, you don't get interoperability unless the vocabulary and structure of the information being exchanged is also understood - that is, understood fully, so the receiving system knows what it signifies and what to do with it, not just accepted mechanically at the level of XML parsing. Various technical fixes in this area have been proposed recently under the banner of "The Semantic Web"; but in fact, this is a technology independent problem, which has been well known ever since message based application integration was first invented. One way of visualizing the design space for this aspect of interoperability is by placing system-to-system interfaces within a two-by-two matrix: black box vs white box, and thin layer vs thick layer:

Table 1.

  Thick layer Thin layer
Black box    
White box    

Black box interfaces make no assumptions about the internal structure and state of the interoperating systems, whereas white box interfaces make free use of such inside knowledge. Thin layer interfaces are simple and carry relatively little information within the interface, whereas thick layer interfaces may be complex, and contain large amounts of information. All combinations are possible (and in practice, there are also many kinds of intermediate situations - which does not destroy the value of the simple picture as a framework for design trade-off discussions).

For example, an interface may consist of simple concise queries which make detailed assumptions about the internal structure of a database - this is a white box, thin layer interface. HTTP (just the protocol itself, not the information passed within it) is a black box, thin layer interface. The rich interfaces often found between component modules of complex proprietary software applications are good examples of thick layer, white box interfaces. The rich messaging model developed by the HL7 project (ref) is an example of a thick layer, black box interface

You might conclude that the right thing to do is to push interfaces as far as possible towards being thin layer, black box; however, this also pushes towards an impoverished message vocabulary which simply cannot say very much. White box interfaces are a no-go area for the scenarios envisaged in this paper (a constantly changing heterogeneous population of systems just cannot interoperate that way). So, the core problem is managing the trade-offs involved in determining just how thick or thin an interface needs to be to be fit for its intended purpose. The key requirements in practical terms are:

  • enable heterogeneous computer systems to interoperate using common standards

  • reduce data duplication and data conversion

  • enable reuse of common components across different application interfaces using similar information, and for e-services using syndicated information in any way

  • generally reduce risks and costs of system/application integration

  • support conformance specification and conformance testing

It can be very tempting at this point to wheel in one of the many development methodologies or information modelling disciplines available, and promote it across the whole organization as a unifying force and a "right way". However, I am sure that many of you have had bitter experiences of such efforts; not that these methodologies and disciplines are wrong in themselves, but the amount and kind of change required to implement any one of them uniformly across a diverse and complex organization is immense, and can also have unpredictable effects on the costs and risks of ongoing projects and services.

So what alternatives are there? There are no magic bullets, but there are ways of saying enough about what should happen concerning interoperability, without propagating unwieldy change programmes down into the interoperating systems themselves. In fact, the development methods and data standards used for interoperability specifications need to be themselves "black box" interfaces between the design principles and methodologies used by the teams maintaining each interoperating system. The general problem needs a whole book or more to do it justice - in this paper, I will concentrate on one important aspect: the management of XML schemas describing messages interchanged between systems.

3. XML Schemas in Data Integration

As with any standards based infrastructure technology, whereas it is possible to use XML to implement open, modular, loosely coupled interfaces, it is also possible to use XML to implement highly proprietary, tightly coupled interfaces which are very resistant to incremental development to meet future requirements. Unfortunately, the second of these is a lot easier to achieve than the first! Open interfaces need to be clearly specified and documented on several levels. For an XML based messaging interface to be used across a complex organization, this would include:

  • the base standard for messages, i.e. XML

  • the base standard for specifying message structure and content, for example, W3C XML Schema (Schemas beat DTDs easily for this application, because of their modularity and facilities for datatype specification - though hopefully these XML technologies will converge over the next few years)

  • data standards for commonly used information items, for example

    • addresses and personal identification for individuals

    • dates, times, locations

    • names and key properties of departments or companies within the organization

    • names and key properties for business partners and third parties such as registration authorities and web service providers

  • agreed standards for structuring schemas to facilitate reuse across the organization

  • agreed standards for structuring application-to-application transactions using sequences of messages

The level of complexity in these standards depends on the degree of complexity and heterogeneity within the domain in which the messages are expected to be effective, and also on the nature and purpose of any information models underpinning the data standards and schema structures adopted. Two interestingly contrasting examples of this kind of modelling are the relatively heavyweight approach, based on comprehensive detailed modelling, adopted by the Health Level 7 organization[1], and the lighter weight approach, based on a small core model, taken by the UK Government Schema Group's High Level Architecture subgroup[2] - and please note that heavy and light are intended to be descriptive, not synonyms for good or bad.

Whatever modelling approach is taken, developers of new system-to-system applications need access to schemas embodying these standards, and ideally other design resources such as design patterns, transaction protocols, etc. This paper concentrates on XML messages and schemas - the broader picture is discussed as it applies to UK Government in my colleague Paul Spencer's paper at this conference.

Concerning XML schemas, the ideal architecture is one that gives developers seamless access to multiple levels of schema libraries, with facilities to navigate the (potentially large) schema collection using a variety of criteria and implicit relationships. These navigation requirements are discussed further in the next section. The requisite relationships between local and other libraries are summarized in the following figure, where the arrows represent the logical flow of information from different libraries as required to design or validate a schema under development. The local part of this diagram will be repeated many times across the whole organization; some of these may be interrelated in their turn.

click image for full size view

Managing the interrelationships between schemas indicated by dotted arrows in the figure is of course important. It is also very important as a practical implementation issue that the cost of entry for a relatively small node within this structure should not be too high - and that the skills burden for maintaining parts of the library should also not be high. Larger departments or companies within the organization may well use a full strength registry/repository system, but smaller ones may not need to - and others may integrate schema management into an existing information storage and maintenance organization.

The key to making this work is that this distributed schema library must itself become an exemplar of the loosley coupled architecture and interoperability standards which it supports. That is, the various nodes of the distributed library interoperate by means of XML messages, designed using a common information model, and implemented using agreed schemas for message structure and content.

The detail of such an information model and the related XML messages is beyond the scope of this paper. However, it is a near certainty that any implementation will use metadata as a key design component, and the final section of this paper discusses the metadata appropriate for XML schemas in this kind of scenario.

4. Metadata for XML schemas

Metadata is not as well known as it should be in the XML community (though growing in popularity and understanding), so deserves a short explanation. The commonest short definition of metadata is that it is "data about data". Metadata principally supports searching and retrieval, by making selected, simple information available to applications looking for information resources. The principle is rather like using a collection of videotapes or CDs. If you had to play each one to see what was recorded on it, then finding a specific film or track would be a long tedious job. If you have a label on each item giving the title and a list of contents, then picking out the right one becomes very much easier. With a larger collection, then if you have a searchable list or a subject index, and the recordings are in groups with similar locations, then it becomes easier again. This is the basic motivation behind the kind of metadata discussed in this section. There is a wide range of XML metadata technologies, see [6].

The design of a metadata standard for XML schemas carries a number of requirements. This is not the place for a detailed requirements specification, but the key points include:

  • Metadata pertains to schema documents. These are often referred to as schemas; however, it is important to remember in this context the distinction between what constitutes a schema as used for validation (whether W3C or other XML schema), and an XML document containing all or part of the specification of that schema.

  • Metadata on schema documents needs to support navigation relevant to users of schemas, within a user interface driven by the metadata alone. For example, for a given schema document, the metadata should include identifiers for other schema documents on which it depends (With respect to W3C XML Schema, these would be the schema documents given explicit schemaLocation references in the schema document.)

  • As for any electronic information resource, the metadata needs to indicate rights such as copyright, and relationships such as one schema document being a variant of another.

The Dublin Core metadata elements[7] are a natural starting point for the definition of a more precise metadata standard. since theya re the product of extensive consultation and experience in managing information resources. When developing the metadata standard for XML schemas for use in UK Government, however, the starting point was not Dublin Core itself but the existing draft e-Government Metadata Standard (e-GMS)[4] , which is in turn based on Dublin Core.

Defining an e-GMS derived standard for XML schemas required precise definition of the usage of each element and element refinement as applied to schemas, together with specification of standard values or value sets for some elements. (Note that there is a potentially confusing clash of terminology between metadata elements in Dublin Core and elements in XML documents.)

Some issues emerging in the course of the design of this metadata standard are described in the following subsections.

4.1. Technology neutral metadata for XML schemas?

Should the metadata for XML schemas be defined in XML only, or, following the principles of Dublin Core, as a technology-neutral set of metadata elements and refinements with a representation in XML?

The decision was made to define a technology neutral metadata set, because although metadata is at present always embedded in the schema document, future systems may use standoff metadata. In addition, it was felt to be beneficial to allow the possibility of alternative XML representations of the metadata, by having a technology neutral base standard.

4.1.1. Use the XML representation recommended for Dublin Core?

Should the XML representation designed in the first instance for these XML schemas follow the proposed recommendations from the Dublin Core Metadata Initiative?[8] [9]

Although the principle of following an existing standard was attractive, the decision was made not to follow this representation for the first release of the standard, for the following reasons:

  • A mixed content model is used for qualified elements. Mixed content models are problematic in data/message oriented applications, since once the structure of the XML is validated using a schema, it is often processed by applications which do not understand mixed content.

  • RDF is used in the Dublin Core proposal, which in this situation introduces added namespace complexity without adding much value to the metadata representation. In addition, RDF is relatively unlikely to be used for any other reason by schema developers, and if it is used in the metadata that introduces either data which is not well understood, or an additional learning/training burden.

Following from these decisions, the initial XML representation has been designed using the following principles:

  • A metadata processor can always identify occurrences of a particular e-GMS element by finding occurrences of the corresponding XML element.

  • From the perspective of the schema developer, an e-GMS element refinement sometimes provides additional information (for example, temporal coverage), and sometimes clarifies the significance of the element refined (for example, Owner as a refinement of Creator). Within the XML representation, refinements providing additional information are contained within the refined element; other refinements are placed in separate occurrences of the XML elements. A uniform syntax is used for both, using subelements rather than attributes.

4.2. Concluding Remarks

This is the first XML representation to be designed for a metadata standard applying the e-GMS to a specific kind of information. It is also a metadata standard intended to work within specific usage scenarios for schemas. At the time of writing, the technology-neutral standard and first XML representation are complete in first draft, but not yet published. Technical evaluation has not yet started, but is expected to yield results at about the time of the conference. The progress in this work, and hopefully some results, will be described in the conference presentation.

Acknowledgements

I would like to thank the High Level Architecture subgroup of the UK Government Schema Group for the high quality of the formal and offline discussions in that forum.

Bibliography

[1] Health Level Seven organization website, http://www.hl7.org

[2] Benson, Wrightson et al, e-Services Development Framework Primer, draft for trials and consultations, available on the UK GovTalk™ website, http://www.govtalk.gov.uk

[3] XML Schema Recommendation 2001, W3C, available via http://www.w3c.org/XML/Schema

[4] e-Government Metadata Standard, UK Government (Office of the e-Envoy), draft available via the UK GovTalk™ website, http://www.govtalk.gov.uk

[5] The GovTalk website, containing consultation and policy papers concerning interoperability, published by the Office of the e-Envoy, which is part of the UK Government Cabinet Office: http://www.govtalk.gov.uk . Note that this website is designed to support work in progress not archival publication, and that drafts published for comment are not generally available between the end of the consultation period and subsequent publication of an updated version. Enquiries should be addressed to the current contact emails for specific topics, provided on the website, or to ukgovtalk@citu.gsi.gov.uk

[6] Ayers, Wrightson et al, Professional XML Metadata, Wrox press 2001, ISBN 186100451-6

[7] Dublin Core specifications and documentation are available from http://www.dublincore.org

[8] Dave Beckett, Eric Miller, Dan Brickley, Expressing Simple Dublin Core in RDF/XML, Dublin Core Metadata Initiative Proposed Recommendation, http://www.dublincore.org/documents/2001/11/28/dcmes-xml/

[9] Stefan Kokkelink, Roland Schwänzl, Expressing Qualified Dublin Core in RDF/XML, Dublin Core Metadata Initiative Proposed Recommendation, http://www.dublincore.org/documents/2001/11/30/dcq-rdf-xml/

Glossary

e-GMS

e-Government Metadata Standard

Biography

Principal Consultant

Ann Wrightson has specialized in generic coding, SGML & XML since 1979. She is well known in the XML/SGML field, presenting at conferences and participating actively in the continued development of international standards for XML/SGML-family technology. Following a varied and successful early career in electronic publishing, Ann spent ten years lecturing, researching, and consulting in an academic context, including, in 1998, developing the first UK postgraduate course in XML technology. Moving back to industry, she was employed by a major UK publisher as an XML/SGML technical authority, and as a consultant by a leading-edge XML technology development company. Ann is also a founding board member of KnoW, a non-profit research and development organization bringing together industrial, commercial, governmental and academic partners for projects furthering the development of Web technologies. Ann is now a Principal Consultant with alphaXML Ltd, a specialist consultancy providing R&D and implementation support services to government and industry.