Abstract
XML has become a 'lingua ubiquita' but not yet a 'lingua franca': everyone is using it, but with too many of their 'own', costly, dialects, even within particular enterprises, and without the gains so often promised. It is a standard that risks imploding, unless senior management and IT services step in to impose tightly-defined, semantic, business vocabularies as well as clear policies for the identification and naming of business objects (whether documents, processes, dossiers or other logical objects).
This is not an 'IT' issue but a business assets management issue. The European Parliament, working together with a number of other European public administrations, has developed a management-driven approach to the introduction of the XML family of standards, that is shaping a new "information architecture", in which business objects come center stage. Drawing on work from his book of the same title [Information Architecture], and the experience of IT policy management, the presentation will underline the importance of:
comprehensive and coherent object identification and naming conventions (irrespective of platform or technology), that should not be left to the implementation stage, but rather defined as enterprise policy. The value of the 'Core Components Model' of ebXML and UBL methodology will be demonstrated;
'logical object persistence', by which any business object can be identified and, where necessary, dereferenced to a context-specific representation and addressed by any system or process, without the objects 'losing' their identity within a particular context. Particular attention will be paid to pilot projects that have assessed the value of ISO Topic Maps 'Published Subject Indicators' (PSI);
'object mapping', as a necessary complement to application and business process mapping. Together, the three form a virtuous triangle necessary for building any sustainable long-term information architecture, and ensuring that an enterprise's information assets never become locked into a 'proprietary' XML vocabulary.
Keywords
Table of Contents
This paper highlights one of the key concerns in work undertaken by the European Parliament together with other public administrations in Europe regarding the use of XML-centred application development. Although the family of XML standards certainly provides a robust grammar and syntax, substantial work would be required in developing both a methodology for the use of XML and a semantic framework within which interoperability might be achieved in different layers.
One major concern was the rapid explosion of specific mark-up languages that the XML "meta-languages" permits. Freed of the constraints of a particular tag-set (as in HTML), developers felt comfortable developing as many tags as they deemed necessary within the context of their particular domain or project.
This flexibility has, however, become recognised as a potential - and major - problem. XML tag sets, that purport to offer a standardised mark-up language, can potentially become proprietary standards in their own right, if defined too closely to the context of a particular project. Such languages, if poorly supported or poorly documented, present major dangers of proprietary lock-in on a scale greater, sometimes, than the propriety standards favoured by commercial products.
There is a growing recognition that while XML offers a standardised grammar and syntax, it was not designed to offer common semantics. This work still needs to be done, and the Parliament is following closely the work of the The Electronic Business Markup Language (ebXML), now sponsored as a joint initiative of United Nations Centre for Trade Facilitation and Electronic Business (UN/CEFACT) and the Organisation for the Advancement of Structured Information Standards (OASIS), and in particular, its work in creating the Core Components Technical Specification (CCTS) standard, together with associated methodologies, which seem to offer the most robust way forward.
The European Parliament, together with the others institutions and agencies of the European Union, are faced with an additional layer of complexity, unprecedented in scale because of its multilingual regime, working 20 languages as of May 2004.
There would seem to be a paradox between ensuring that any XML vocabulary actually carries some semantics, as a way of offering clues to the meaning of the content it encapsulates, and the need to re-use the same XML tag-set across 20 different languages of content.
Again, the ebXML CCTS offers some hope in identifying and managing core business terminology and semantics, but in doing so, then requires context-specific meaning to be built from those core components, in order to develop applications that responds to particular enterprise needs.
The development of a multi-layered "Information Architecture" in which all of these issues are very explicitly separated out, offers the most flexible and scalable infrastructure at the same time rooted in extremely stable foundations.
Many large organisations have attempted to tackle the problem of interoperability by imposing an approved, standard set of software, development tools and platforms.
There are several major drawbacks with this approach, not least of which is that it can be anti-competitive and anti-innovative, but it is simply not possible to keep the rest of the world out: ever fewer enterprises can afford to cut themselves off from partners or even competitors with such an approach. Further, if your only tool is a hammer, every problem looks like a nail: problem-solving starts being driven by the toolset and not by business needs.
Most of all however, this approach does not in fact improve interoperability at all, as the same tools are used in so radically different ways, particularly in text-intensive produiction systems. Agreeing, for example, a particular word-processing format in order to exhange content only offers easier access to further processing, it does not give any clues as to intended use or structure.
In common with many large enterprises, the European Parliament quickly assimilated at least the superficial importance of XML. Awareness raising and training programs evolved as early as 1999 and has appeared on the radar of most application developers.
The institution has continually faced the dangers of two forms of reductionism in its understanding of the XML standard, and is by no means unique for this.
Firstly, XML has been chanted as a management mantra, recognised as important but essential considered as an "IT thing".
Secondly, because XML offers a means towards syntactic interoperability, that has been the initial - and sometimes exclusive - objective of some early XML development projects. Whatever application or project development methodology has been used, generation of XML has been seen as merely a grammatical objective. For example, the near automatic generation of an XML DTD or Schema from the Unified Modeling Language (UML), without consideration to the naming of elements or indeed of the re-use of those elements in another environment.
With different XML development efforts under way in parallel, there was a growing realisation that different XML element names were being used to express the same semantic intent:
<title>, <titre>, <titel>
One response to this challenge was simply to point to XSLT as a mechanism for ironing out little local difficulties. Little attention was paid at that stage to calculating the impact of the additional processing overhead that such transformations might involve, particularly in mission-critical document assembly on-the-fly.
Further comparisons of parallel development efforts threw up another class of problem, that of inconsistent granularity. Whilst a first application encapsulated a name as:
<name>Peter Brown</name>
A second application would offer a further degree of granularity:
<name> <firstName>Peter</firstName> <familyName>Brown</familyName> </name>
Whilst XSLT could be used to effect a transformation from B to A by folding together the separate XML elements, the operation cannot obviously be performed in the other direction, particularly in a multi-lingual and multi-cultural environment where any number of conventions of name presentation occurs. There are simply no stable and reliable syntactic or algorithmic rules that allow such an effort to be undertaken.
“Wasn’t XML supposed to address this sort of issue?”. The honest answer must be no. XML offers a standardised grammar and syntax, but it does not offer a path or methodology towards semantic convergence or interoperability.
One response has been to develop specific controlled vocabularies encapsulated in agreed XML element and tag sets or even more formally in XML Schema. It was hoped that a central registry, complete with repository and an authority wielding a big stick, would strong-arm developers and project managers into conformance.
Defining, maintaining and updating a central repository is a daunting task, even with the necessary authority and resources. A successful repository will consist of five essential ingredients:
content;
an infrastructure that makes the content available and useable;
an agreed methodology for maintaining and extending the initial content;
an authority endowed with the responsibility and means to take and impose decisions regarding the repository’s content and management;
a communications strategy that promotes knowledge, understanding and use of the repository
While the first ingredient may be an IT issue (but see below), and the second may need to be provided by IT services, all the other components are business management issues and should handled as such.
Whilst the principles regarding any future repository may have been agreed, there was not yet agreement regarding the content. Some argued for a tightly-managed set of XML Schema, others merely for a reference set of Best Practice guidelines for different aspects of XML development, implementation and use. The “XML Framework” that started to take shape shied away from imposing a centrally-controlled set of Schema, partly for pragmatic reasons: it is difficult to set or enforce a set of rules after the horse has bolted.
XML projects were already up and running, some at departmental level beyond the purview of central IT services. There was a recognition that programmers and project leaders will inevitably approach XML solutions in the context of their own needs and would balk at the prospect of being tied down to a particular set of rules. For any rule-set or Framework to gain acceptance would therefore require a “variable geometry” of controlled and managed extensibility together with tangible benefits, rather than hard and fast rules: the carrot rather than the stick.
Although the Framework already establishes an important expression of collaborative intent, it falls short, in this model, of being a useful and useable business asset. The next phase in the development of the Framework is intended to provide a normative reference. Although this could be a set of registered XML Schema, a number of policy questions have arisen that must first be addressed.
Many issues related to the use of XML are management - rather than technology - issues. It is important for senior techologists and management alike to argue, from an early stage, for more business involvement in key policy decisions, decisions which IT services alone are not competent nor scoped to address.
This may involve creating a (or formalising an existing) network of middle to senior managers from across all key business areas to drive these policies. They, not XML developers, should be central to the most important aspect of interoperability - semantics. This involves addressing and deciding on some key issues.
The core business of the European Parliament is producing documents, whether texts that form an integral part of future European Union (EU) legislation, or accompanying texts that influence the legislation (opinions, questions, debates, and so forth). It is a common scenario – for the Parliament at least – for a “block” of text to start life as one distinct document, only to be incorporated and distinguished later as a fragment in a second document, and later still for the same text to be melded into the text of a third document. One document can be composed of a set of other documents.
It was inevitable therefore that, sooner or later, someone was going to pose the naïve question, “what, then, is a document?”. The ensuing debate was necessary to determine a strategy for identifying and managing documents, however they are defined. Importantly, there was – and there is now – a clear semantic agreement over key terminology, including concepts such as document, dossier, file, fragment, etc.
This in turn helps make a clear distinction between different global object classes to be used in subsequent modelling.
As an example:
one object type, that the Parliament refers to, provisionally, as a “logical” or “abstract” object, is nothing more that an abstract container that defines a particular concept, and attributes to it an immutable set of metadata;
another object type consists of a particular “representation” or state of the logical object: for example, “version 1-7, in English, in a specific word processing format”;
another object type represent processes affecting another object (such as a document) in its life-cycle (“drafted by…”; “reviewed by….”; “approved on…”);
yet another object type might represent the assembly or collection of a set of other objects at a particular moment (what the European Parliament calls a “dossier object”).
This approach thus addresses the functional necessity both to distinctly identify different object types and avoid unnecessary confusion over what, exactly, is a document.
It equally became clear that to model these different object types as formal XML Schema – often with unforeseen or unpredictable combinations – was going to be a gargantuan task with considerable maintenance overhead.
An early concern therefore was to be able to identify object containers and, hopefully, at least model and normalise them: how to identify them, how to describe them (using metadata) and how to manage the relationships between simple "core" objects and more complex constructs.
The European Parliament has been concerned by the debate in the World Wide Web Consortium (W3C) in its Technical Architecture Group (TAG)and has been looking to that forum to offer guidance in establishing a clear distinction between object identification requirements, as distinct from object addressing and resolution mechanisms for actually finding a particular object’s representation(s). This issue has been the subject of a collaborative project between the EU institutions and which has examined different object identification, addressing and resolution approaches, including the Uniform Resoure Name (URN), XML Namespaces, Topic Map Published Subject Identifiers, as well as specific implementations such as PURL and Handle.
Which approach is favoured will be as much a question of its potential use within the framework of the family of XML standards (in particular preparations for the use of XLink, XQuery, XPath and XML Fragments), as it will be for other aspects of semantic interoperability. It will also be a function of the need to be able to distinguish between and manage higher-level sets of abstract "logical identifiers" as well as (or even instead of) the constantly shifting and more detailed view of actual content.
This will be of particular importance in developing policies to enable so-called "Semantic Web" technologies, defining and encapsulating relationships between objects and making assertions regarding meaning. In the public sector in Europe, these needs are further emphasised by a series of Use Cases concerned with greater public access (not just to content, but also to information about it); an increased interest in Web Services; and a desire to ensure object identification and naming policies can be managed independently of any specific content management system.
One person's metadata is another person's data. Initial research indicated that too much metadata was being defined in highly context-dependent situations. Making metadata re-useable in different contexts thus becomes as difficult as re-using content in different contexts if the “atomic” building blocks are not sufficiently distinct. The Parliament therefore is tying the task of identifiying metadata with the task of uniquely identifying different object types, whether documents or other logical objects.
Whilst XML Schema provide an extremely robust validation model for document creation, much of the “semantics” of a particular element can only be understood – and used – in context. The building blocks only become semantic building blocks in the context in which they are actually used. This does not distract from the need to identify certain “components” in an abstract manner that allows consistency in a core vocabulary and at the same time allows them to be used in different contexts: an abstract "Date" could be a CreationDate, a PublishingDate, a ReleaseDate, a RegistrationDate, depending on context.
Again, it is not amongst the design goals of XML to offer such understanding, but the hierarchical model of the XML Schema implies a clear parent-child relationship between content elements. If a particular element has a different meaning in a different context, so be it. That then requires a mechanism and infrastructure that allows “building blocks” to be defined and maintained independently – and probably upstream – from specific XML Schema.
This concern, coupled with the reality of projects under way, convinced GRI to examine the benefits of the ebXML “Core Components Model”, or CCTS.
A methodology is needed in order to set about discovering and modelling core components – the “atomic level” components that unambiguously define your key business terminology and vocabulary – and then managing them in a manner that they are useable. The approach using Core Components matched very closely the functional requirements that had been identified regarding object identification, description, composition and re-use. Because of its “atomic” nature, a reference infrastructure can be built that provides not just a stable reference vocabulary, but that permits the use of entries in that vocabulary in object identifiers, metadata, XML elements, XML attribute names and/or values, and so forth. In this manner all potentially re-useable semantic terms and made available to all systems and infrastructures, XML based or not.
This has a clear advantage of preparing the ground for more thorough development of “all-XML” systems by providing the building blocks, together with naming and identification conventions for them, whenever and wherever that might be used. As such a “Core Component” that is today only used, for example, to identify a particular metadata wrapper of a logical object, could tomorrow be equally used as part of the construction of an XPath statement used in a dynamic hyperlinking system or a sophisticated Topic Map semantic web navigation tool. Semantic interoperability is a built-in function of the approach.
If objects are to be identified in a coherent manner, an identification scheme needs to be established. Many users regularly cite document references and names, it is necessary to establish as a functional requirement that any object identifiers be “humanly retainable” and mnemonic. In addition, such citations can form the essence of the “logical object” (very close to, if not identical to, the concept of “Subject identifier” in the International Standards Organisation (ISO)“Topic Map” and related XTM standards, of which more below). Such logical object identifiers tend to be extremely stable, whereas the particular set of resources that represent the logical object, are by their nature very volatile. Whilst URIs might indeed be the axioms of Web Architecture, there is a case to be made for a more stable system of Object Identifiers as the axiom of an Information Architecture[1].
Further, as the Parliament has opted for an approach based on a “structured identifier”, it is intended that each “element” that comprises part of an identifier (for example “object type” or “originating authority”) will itself be an instance of a controlled value set of an object class (the object type superclass and a subset of the “actor class”, respectively, in the example above).
The Parliament noted a strong tendency to define metadata needs in particular contexts. For example, a system receiving official documents from another institution requires a set of metadata that in fact represents two (possibly more) sets of information about different and distinct objects that happen to be associated together in a particular context (for example: a document, that document’s particular state upon reception; and the transmission/reception process).
By separating out the distinct objects involved, it has been possible also to strip down metadata to the minimum set required to uniquely distinguish a particular (and equally stripped down) object.
By identifying the distinct metadata requirements for each object type, it has thus been possible to maximise re-useability, particularly if the metadata elements themselves are treated as distinct object types.
HTTP content negotiation, implemented in some Web server architectures, gives a glimpse of the sort of functional requirements that the Parliament needs to address: a URL is requested by a web browser and, depending on the language preference settings indicated in the user's browser, the server will “negotiate” which, of available language versions, should be delivered to the client.
In this model, a language “version” (delivered in html) is but one of many possible representations of a particular logical document. In the Information Architecture model that the Parliament is advancing, each logical document (or indeed any logical object) may have 0.n representations, based as a whole range of attributes or facets: language, version, access-rights, format, support medium. Resolution architectures will thus need to take into account any range of different associations between a logic object and its representations. In looking at whether the HTTP header might include metadata about the object being accessed, the TAG has tangentially glanced off this issue. The XRI and Topic Map Published Subjects initiatives in OASIS are another souse of encouragement, but it will ultimately be a matter of debate and policy, whether these considerations of an information architecture are taken on board within the standards organisations or left to implementors.
The multi-lingual, document-oriented and text-rich environment of the European Parliament inevitably presents a major modelling challenge as regards establishing a Core Vocabulary, avaliable to all systems developers. Encouraged by the support given by the UBL for the ebXML-inspired Core Components Technical Specification, the Parliament is currently investigating the most appropriate methodology for developing its vocabulary. Although the vocabulary, by its nature, wil contain and maintain elements and definitions independently of the context in which they might be used, it is inevitable that the process by which these core elements are discovered and identified will be through a systematic approach to higher-level business process modelling. More and more modelling tools and environments themselves are making their modelled processes available as an XML application, and this in turn will provide the hooks to the core vocabulary.
[Identity Crisis] “Curing the Web's Identity Crisis — Subject Indicators for RDF”, Paper by Steve Pepper, Ontopia http://www.ontopia.net/topicmaps/materials/identitycrisis.html
[Information Architecture] “Information Architecture with XML — A Management Strategy”, by Peter Brown http://www.XMLbyStealth.net
[Published Subjects] “Published Subjects: Introduction and Basic Requirements”, OASIS Technical Committee Recommendation http://www.oasis-open.org/committees/download.php/2897/pubsubj-pt1-1.01-cs.pdf
[Universal Resource Identifiers] “Universal Resource Identifiers — Axioms of Web Architecture”, by Tim Berners-Lee http://www.w3.org/DesignIssues/Axioms.html
[Web Architecture] “Architecture of the World Wide Web”, First Edition, Working Draft 9 December 2003. Latest edition at http://www.w3.org/TR/webarch/
Core Components Technical Specification
The Electronic Business Markup Language
European Union
International Standards Organisation
Organisation for the Advancement of Structured Information Standards
Technical Architecture Group
the Unified Modeling Language
United Nations Centre for Trade Facilitation and Electronic Business
Uniform Resoure Name
World Wide Web Consortium
[1] Allusion is made here to the paper of the same name by Tim Berners-Lee and used in the work of the TAG, “Universal Resource Identifiers — Axioms of Web Architecture”[Universal Resource Identifiers],
![]() ![]() |
Design & Development by deepX Ltd. |