Keywords: Change Management, Design, Middleware, Schema, Metadata
Biography
Jim Gabriel is the inventor of CortexML, the world's first generic XML Evolution Management product. Jim has an MBA from the School of Life.
Version and source control for schemas and schema objects is badly needed by us all, especially in complex, multi-enterprise development situations. This paper describes how to support true versioning of schemas and fine-grained schema objects. An important conclusion is that schema evolution and schema development are essentially synonymous, in that the applicable techniques and requisite technology are the same for both.
1. Introduction
2. Why we need to version schemas
2.1 Extending the scope of a schema
2.2 Changing constraints
2.3 Bug-fixing
2.4 Enabling collaborative development
3. Techniques commonly used
3.1 Version attribute
3.1.1 Schema level
3.1.2 Root element level
3.2 Comments
3.3 Naming conventions or file system location
3.4 Namespace
3.5 Database / Repository
4. Designing for evolution
5. Enabling evolution with technology
The title of this paper suggests a technical introduction to the nebulous art of schema versioning. Indeed, that is what was originally intended. And yet this paper also poses a question, for the answer is by no means straightforward. In addressing the various technical options for identifying and differentiating multiple versions of the same schema, it is hard not to conclude that we are perhaps asking the wrong question. The schema versioning challenge facing us appears to be less about how to fit schema versioning into the current technology, and a great deal more about inventing the requisite technology to satisfy all the goals of schema versioning.
In reaching a conclusion, this paper discusses four aspects of the schema versioning challenge:
The most commonly used techniques for differentiating between schema versions are generally well understood. The longer-term issues of various strategies and the technical shortcomings of schema languages (as far as versions are concerned) are often less well understood. Accordingly, this paper focuses in greater detail on the latter—that is, the considerations and technical challenges that designers need to be aware of—and less on the former.
This paper uses XML Schema as the basis for technical discussion, although most of the principles discussed are applicable to other flavors of schema language.
This section sets the scene for the discussion in the remaining sections. We need to version schemas for the following reasons:
Extensions to the scope of a schema are defined as any extensions not adversely affecting the existing scope. Typically, this means extending the namespace with new element (and other object) declarations in such a way that previously valid documents remain valid.
Extending the scope of a schema is usually only possible if the schema or schema family has been designed with evolution in mind. For example, when a similar processing is required for disparate document types, and the designer can predict that new document types will be added to the application scope in the course of time, the family of schemas controlling the documents in the process is usually highly modular to allow for the new types to be added without disrupting the existing type. A business case for this requirement can be found in the distribution and sale of content from multiple content suppliers to mobile phone users through the telco portal (for example the Vodafone Live! portal using the VCML markup language designed by digitalML Ltd.). We already know about what is being sold to users today, from MMS services to sports news, but cannot predict what the next commercial option will be.
A much-favored method for extending the scope of a schema is to create a root level in a high-level schema that functions as a container object; the container defines its own content model via a contentType definition (usually an element, although it is also possible with an attribute). The allowed contentTypes are constrained by type-specific schemas which can be included. Extending existing schema families with new contentTypes is a relatively simple exercise, although let us not forget that a schema or schema fragment is one small part of a much larger application. Web services and other software modules are also required to handle the processing of the XML constrained by the schema.
Versioning is important in order to identify to updated processors that the updated schemas are available and the new contentTypes can be processed without error.
Changing constraints means changing the schema in a way that adversely affects the schema. For the effect to be adverse, documents that were valid before are possibly no longer valid after the change. Examples of such a change are: making a previously optional element mandatory, applying a modified (and stricter) complex type to an element, changing the name of any object.
Changes are usually made to constraints due to scope-creep during development, or when the requirements change in a way that could not be foreseen by the application designer. Both have a similar, catastrophic impact. For example, a company merger forces an ID element that used to contain a unique integer to be constrained by a complex type that includes an integer, a time stamp, and a code representing the source of the data, all of which are mandatory (minoccurs=1, maxoccurs=1). We know immediately that nothing in the pre-modification application will validate in the post-modification application.
Versioning is important to enable processors to locate and use the correct schemas for validation, or — and this is where it becomes altogether more difficult to manage — the correct web services.
Bug-fixing schemas usually presents the same issues as changing constraints. Bug-fixing a schema is a particularly difficult exercise, however, not because of the edits to the schema itself (generally easy) but because of the place in the overal scheme of things that the schema occupies. An object declared in one place in one schema can be referenced in hundreds of places at runtime in the XML-based system for which the schema was designed. If we assume that an erroneous situation in the first place causes us to describe the change as a bug-fix, predicting the domino effect of the fix is considerably more difficult than for a known, working situtation such as described above for changing constraints.
Consider the analogy of mainstream software development, with conventional programming effort using 3rd generation programming langugages, compilers and a version control system. Versioning a schema because of a bug-fix is important to provide the same level of information to your developers that you would expect of a software development system: release notes, patch numbers, majorVersion.minorVersion idenfiers, regression test results, and so on.
Collaborative development with schemas is defined as development shared by a team of two or more developers. The schema, the associated XML constructs such as XSLT files or the software that depends on the constraints described by the schema, are worked on by multiple developers. This is by far the most important reason for applying good version control to schemas when developing software systems that use those schemas, and this is exacerbated by the numbers: the more developers, the more organizations involved (multiple teams), and the higher the likelihood of change, the greater the need for a watertight strategy for versioning schemas.
Since schemas are text files containing declarations, they are easy to break, their contents are difficult to control, and the impact of change is difficult to predict. There is no automated way of charting the dependencies between a declaration in a schema and every instance of every object relying on that declaration. In a project management sense with multiple developers, changing the definition of any object can often mean that somebody has a lot of clearing up to do, with the attendant (and usually invisible) budget drain. XML declarations are not objects that can be dealt with as if they were source code. XML objects from a programmer’s perspective are schemas, transformations, and so on, all of which provide a container for multiple references to single objects, duplicated to the nth degree.
Managing team development of schemas and associated other files in complex XML-driven environments is generally limited to the management of the XML objects produced by the developers. For example, in a system that relies on families of schemas, developers work on a schema, check it into a repository of some kind, and then other developers check it out and use it to develop associated pieces of the puzzle (java classes, transformations, and so on).
Applying version and source control is possible at the level of the container in which the schema is placed. When the management happens at the container level, the granularity is too coarse for a semantic understanding of the actual contents to be supported, which makes canonical comparisons impossible. Such a system therefore has no control over what the teams of developers build against the schema, which causes an exponential growth in the number of risks and potential errors per XML object compared with the situation where only one developer is working on the system.
When the process of schema design and implementation is spread across multiple team members in multiple locations, you need an owner, and an agreement in the field to honor a set of published schemas. Modifications are presented as change requests to the owner, who may or may not make the requested changes. When changes occur, a new cut of the schemas is published and deployed to the field.
The technology to support such a system relies heavily on processes and people. Schemas can be kept in a source control system (preferably one with strong support for versions), good programming practices can be applied to comments and case numbering, a UML model (or equivalent) should be maintained in a central place, visible to the organization, and everybody must agree the basic standards of schema design:
The bottom line is this: without an infrastructure to manage XML schema objects at a much finer granularity than the schema containing the objects, and thus allow us to sensibly version more than the container file of the schema itself, storing schemas in a CVS-like system is about the only applicable solution.
The following techniques are most commonly used to differentiate versions:
There are two commonly used version attributes:
The schema declaration version attribute we get ‘for free’ (although there are no constraints or guidelines for its use), the attribute on the root element we must declare ourselves. Generally, it is highly recommended to identify the version of the schema both in the schema itself and in the root element of instances that validate against that schema.
The advantage of the schema level version attribute is that applications can read the value and behave accordingly. It is, or should be, a Write Once, Read Many versioning action. Furthermore, if documents are validated against a previous schema version and persisted (as in an XML-compliant content management system), provided that the difference between the previous schema version and the new schema version does not render the document invalid, the documents do not need to be modified to handle the new version attribute. Be aware that validators ignore the schema level version attribute, because there are no recommended constraints for its use.
When the version information is captured in an attribute on the root level element, the only way to enforce it is to make sure that instances set the value correctly. The version attribute in the instance can help applications locate the correct schema version at runtime. Unfortunately, this is only possible by first parsing the document, which carries a potential performance penalty. Another drawback of this method is that creating a new version of a schema can force potentially unnecessary modifications when instances are not adversely affected by the changes to the schema, unless conventions are adopted that honor compatibility for later version numbers.
Comments are defined as any form of annotation, be that inline comments, appinfos or documentation objects. Designers should avoid comments in schemas for capturing application-critical version information (e.g. <!-- version=1.53 -->). Conversely, designers should make full use of comments to document their schemas: the purpose of important constructs, modification histories, author details, and so on.
Comments, however they are implemented, offer the same scope for identifying schema versions as version attributes (see the previous section), with the significant disadvantage that developers are forced to create more complicated parsers. The main advantage, if there is one, is that comments can be as complex or as simple as a designer wishes. All conventions for their use must be defined, implemented in the processing application, and made public, however. Whatever the complexity, and wherever they are created, an application must always parse the schema or the instance (or both) in order to discover the version information and behave accordingly.
A common trick for identifying a schema version is to use a naming convention or to store it in a different folder on the file system. A schema called order004.xsd is the previous version to the schema called order005.xsd. More succinctly, a timestamp is also commonly used.
Using naming conventions or a different file system location is fraught with problems, however. For example, creating a new version of a schema will force modifications to all instances that were validated by the previous version. Any processing application must know how to interpret the various names or locations, while the reason for the newer version may remain unclear from the name or location. Managing such development between versions is typically error-prone, particularly in more complex environments where the modified schema is imported by another schema. The ‘domino effect’ can be difficult to quantify.
On the other hand, when you need to enforce a new schema version and strictly differentiate between the new version and the previous version in an application, changing the name or the location of the schema is an excellent, if somewhat final, method.
When you need to enforce a new schema version and strictly differentiate between the new version and the previous version in an application, changing the targetNamespace in the schema declaration provides what is arguably the most elegant method for identifying and differentiating between schema versions. Any schema including the modified schema must be updated to honor the new targetNamespace, and (if relevant) all document instances that validate against the previous version must be modified. Thus it is possible to cleanly and efficiently introduce the incompatibility that prevents mistakes when processing at runtime.
Conversely, if you need an application and existing document instances to tolerate the new version, changing the targetNamespace is not to be recommended.
Managing different versions of schemas side-by-side requires a structured approach to cataloguing and storing the schemas concerned. Inevitably, once an environment exceeds a relatively low level of complexity, a database or repository mechanism becomes very useful. Schema repositories are not a particularly new concept, and various examples are available commercially. Repositories are useful during development, particularly for collaborative development across one or more teams. In an up-and-running runtime environment, repositories are less commonly needed, however.
The greatest limitation when using a repository is that the level of granularity is almost invariably at the level of the entire schema itself, or at the level of any meaningful (as opposed to arbitrary) fragment that the developer chooses to store. Repositories as such are schema storage systems enriched with the kind of functionality one would expect from any mature database management system. It is usually possible to capture semantic information about modifications between versions, and important project management information such as who performed the last modification, and when. At runtime, any processing application that needs a schema must be programmed to know where and how to find it.
In the interests of completeness, this section addresses the subject of evolving ontologies, or versioning XML vocabularies, over and above what is discussed in Paragraph 15, above. While versioning XML vocabularies is not the main focus of this paper, it is one of the primary reasons for wanting to version schemas in a highly data-centric (as opposed to document-centric) application environment. As this subject requires more space and attention than I can give it in this paper, I refer readers to the following excellent discussion of the extension of XML languages: Versioning XML Vocabularies by David Orchard.
The most important technique for evolving XML vocabularies is to allow for extensibility in the original design. This is typically achieved through a careful use of wildcards, allowing extensions through namespaces, allowing applications to ignore unknown objects, and forcing applications to understand unknown objects when no other option is available. In any environment that uses an extensible container language (SOAP is a good example), the rules described by David Orchard are applicable.
Before we discuss how to enable schema evolution with technology, let me digress a little, for I am reminded at this juncture of the mind-puzzle presented to me as a high-school student. In an experiment designed to research the aptitude of the students for ‘lateral’ thought (loosely defined as the ability to see ingenious and unconventional solutions to given problems) we were regularly asked a series of questions. The supposition was that in a high-school community composed of science students at one end of the spectrum and arts students at the opposite end of the spectrum, those with primarily Alpha brains or primarily Beta brains were less likely to think ‘laterally’. Those rare few who were as capable in science subjects as the arts, due to the strong mix of Alpha and Beta thought processes bubbling away in their heads, were more likely to think laterally.
The mind-puzzle that comes to me now is this: in a house with a basement, a large boulder fills at least half the basement room. The stairs up to ground-level are narrow and not capable of supporting the weight of more than two people. The door at the top of the stairs is much narrower than the boulder. You need to remove the boulder from the house with minimum damage to the house and the boulder, and the question is “How?”
Science students tended towards breaking the boulder into small pieces despite the preferred end state of a complete boulder, and carrying them out of the house. Reassembly was, after all, a viable option. Arts students largely agreed, although some suggested moving to a new house and others still proposed decorating the boulder and turning it into a ‘feature’. Those gifted with the ability to think laterally suggested digging a large hole in the basement floor and carrying the dirt outside, thus allowing the boulder to sink below the surface and out of sight. Voilá!
Why this is relevant to the subject of schema versioning will become apparent at the end of this section, so read on, dear reader.
The technical infrastructure that is needed to enable schema evolution must ensure the following:
Furthermore, the technology must allow for and manage access to multiple versions, provide full support for multiple developer — and multiple team — development, and manage schema objects at a very fine level of granularity (and reusability). The granularity must be set at the level of every constituent object: attribute, simple type, complex type, element, and so on. The granularity cannot possibly be managed at the schema level.
The essential concept when evolving schemas is that this is a development activity, usually carried out by developers who are highly experienced and accustomed to using sophisticated application development environments. The available tools for their non-XML development work typically offer links with source and version control systems, modeling environments, code generators, and the like. Consistency checks and compile-time error control are necessities, not luxuries. All these are needed in any technical infrastructure for evolving XML schemas.
Evolving a schema at its simplest level requires little more than a text editor. However, in a complex environment, just as with any application development environment, we need to know that changes are correct, that schemas are complete, and that all objects at any level of granularity are consistent. We need to know, when multiple developers are sharing their work, that object definitions are truly single-source. There should be transaction control, locking strategies, inter-developer communication, version control and source control.
When multiple developers collaborate on the development of schemas, and parallel changes occur to the same object, conflict resolution is necessary when checked-in objects are integrated and made visible to other developers in the system. Conflict resolution is also necessary when developers upgrade their work-in-progress project spaces to new build levels of the central repository.
Ideally, ‘what if’ impact analysis allows project managers to quantify the cost of any given modification, and scope the work resulting from the change. Generators should produce the various deployment objects — xsd, xslt, and so on — because nobody should modify schemas by editing in an xsd file. Schemas should be the auto-generated end result of the exercise, and not the starting point. The ‘file’ paradigm is an anathema. A build manager should be able to produce the usable results at one press of a button.
When the build manager presses the magic button and generates all deployment objects for that particular version of the model (schemas, transformations, et al), the integrity of the results must be a given. That is, the build manager must know that the results are consistent, complete, and correct.
This, fundamentally is the heart of the problem. Versioning schemas is not about setting attributes, interpreting namespaces, using repository technology and the like. It is a much bigger problem. We ask ourselves, “How to version schemas?” and devise ways of making the technology recognize different flavors of what is essentially the same schema. We should be asking ourselves, “Why — and when — is it necessary to version schemas?” Remember that versioning schemas in a simple situation is never a problem. It is only above a certain level of complexity that we experience real pain in not being able to evolve schemas in a controlled way. The following question must therefore be, “Why is the complexity unmanageable and what can we do about it?”
The complexity is unmanageable when the following factors apply:
These factors are par for the course in most business-to-business marketplace application environments. Ultimately, we care about evolving schemas in a controlled way in complex situations because of a set of problems for which we have long ago found elegant solutions in a conventional programming environment but which have not been adequately addressed for XML: version control, source control, support for team development, locking strategies, controlled integration and build processes, very fine level of object granularity, true single-source techniques, compilers and code-checkers, and so on. (For all these to be possible, schema evolution must a priori be a model-driven exercise, which requires a robust and extensible repository-based engine.)
When we consider why we would want to version schemas, it is clear that schema evolution and schema development are essentially synonymous in their goals. But rather like the boulder in the basement problem, it is all very well coming up with sophisticated solutions to the problem of removing the boulder, but when the boulder is removed in pieces we cannot claim to have found the most elegant solution; with schema versioning, none of the conventional means available to us during schema development can be said to be optimal or wholly elegant. Finally, even when we apply the lateral solution of digging a hole and sinking the boulder out of sight, we have to concede that we have failed; the challenge, after all, was to remove the boulder from the basement and technically speaking it is still in the basement. The moral in this analogy is that however we end up versioning schemas, we need to address the right problem. The problem is not how to fit schema versioning into the current technology, but rather how to create the requisite technology to satisfy all the goals of schema versioning.
XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.