XML Europe 2003 logo

Managing Multiple XML Schemas in the UK's Inland Revenue

Abstract

In 1997, the UK Government established a policy that aimed to have all local and central government services web-enabled by 2005. For the Inland Revenue (IR), this meant offering submission mechanisms for organisations wishing to provide tax returns over the Internet. For the 400 or so software developers that provide payroll and financial applications to UK businesses, this meant developing XML interfaces to allow their products to link to the IR's systems.

There were three primary challenges in the work to develop and subsequently maintain the IR's online submissions capability: metadata management; publication of controlled documentation; and provision of validation services. The first of these was the result of the need to generate about thirty schemas describing the complex and arcane business rules that govern tax submissions. Knowledge of these rules was spread amongst numerous geographically-separate business experts. This - and the fact that the specifications had to be followed by 400 software companies - meant that it was necessary to construct and maintain an authoritative repository of metadata.

The second challenge arose primarily out of a lack of XML expertise amongst both the IR's business experts and the technical people in the software companies. So, whilst the creation and publication of XML Schema (and XBRL taxonomies) was essential, both creators and consumers of the specifications needed much more documentation, e.g., business rules, instance documents, descriptive samples, human-readable XML, spreadsheets and other helpful outputs to get them operational quickly. And all those outputs needed to be managed and communicated across the community as the standards themselves went through multiple revisions. To achieve this, a series of APIs were written to generate any requisite output from the metadata respository. These outputs were made available in a controlled fashion via a web page.

The third challenge was to develop online services that could validate data submitted to the IR. As XML Schema can only capture the simplest cross-field validation rules, it was also necessary to maintain code fragments alongside the metadata. A special generator was developed to build and deploy validation services. These were used extensively by the software developers who needed to test the new IR interfaces in their products.

In working with the Inland Revenue, DecisionSoft developed a new approach to maintaining interface definitions and so ensuring semantic interoperability. This was achieved by storing metadata at the highest level of abstraction in a central maintainable metadata repository from which any XML asset, code engine or documentation set could be generated on demand. Understanding the limitations of XML Schema as a source of metadata allowed DecisionSoft to reduce technical complexity and make management of XML assets less cumbersome.

Keywords


Table of Contents

1. Background
2. Business requirements of the project
2.1. Data and documentation requirements
2.2. Process requirements
3. Key elements of the technical solution
3.1. A common data set
3.2. Documentation and services for third-party software developers
3.3. Data validation processes
4. Development of an interface specification repository
4.1. X-Meta - the metadata repository
4.2. Defining data constraints
4.3. Validation generation
4.4. Integrating with intermediaries
5. Latest developments
5.1. Live transformation
5.2. EDIFACT validation
5.3. XBRL
6. Summary
Biography

1. Background

The United Kingdom's Inland Revenue is responsible, under the overall direction of UK Treasury ministers, for the efficient administration of income tax, tax credits, corporation tax, capital gains tax, petroleum revenue tax, inheritance tax, national insurance contributions and stamp duties. It has taken the lead in the Government's initiative to provide all UK citizens with "joined-up" Internet access to all government services.

The Inland Revenue maintains many forms for use in communicating with taxpayers. These support the full range of the Inland Revenue's business, including maintenance of the National Insurance system, and calculation and collection of corporate and personal taxes. In 1999, the UK Government committed to making all the Inland Revenue's forms available on-line, through the World Wide Web, by 2005. Of these services, one of the most complex was the end-of-year employer payroll submission, a series of forms known as P35, P14 and P38A. Online submission of these forms was expected to originate from Web-based HTML forms (for small employers) and third-party payroll systems (for medium and large employers). Submissions were to be made as XML via a central government gateway, then passed through Revenue-based validation and translation layers and into two of the Government's largest systems - the Computerised Operation of Pay As You Earn (COP), which handles payroll taxes, and NIRS/2, which maintains National Insurance records. The whole initiative was to be known as the Filing by Internet (FBI) project.

Some of the Inland Revenue's submissions processes had already been automated, with a number of returns already being accepted as magnetic media (tape or diskette submissions) or EDI (EDIFACT and related submissions). However, the change from specialist serial data interfaces to the more easily accessible, nested, XML structures meant that the work of converting paper forms into electronic format had to be approached afresh for FBI.

2. Business requirements of the project

2.1. Data and documentation requirements

The FBI work had to be based on the latest paper forms for conventional submission. Developing XML representations of these documents required the generation of many XML Schemas, some of which were to be interdependent. However, these were only a small part of the documentation that was needed for the project.

Historically, one of the biggest burdens on the Inland Revenue has been the transcription of data from paper forms into the organisation's back-office systems. In recent years, it has encouraged the development of third-party software systems to automate this process, first for magnetic media and EDI submissions, and now for the FBI project. For FBI, the Inland Revenue wanted to provide a full development package for software vendors, consisting of schemas, documentation and data samples, and crucially, an online test service for the new software systems.

2.2. Process requirements

The FBI development process was to be a complex one requiring:

  • collation of business rules for handling data;

  • establishment of expectations for data format in the back office systems;

  • design, generation and publication of schemas with ancillary documentation and examples;

  • review of schemas and business rules;

  • creation of validation code;

  • definition of test data sets;

  • building of an online validation engine; and

  • generation of FBI/back-office interface code.

These processes required the involvement of a wide variety of geographically separate parties, many of them outside the Inland Revenue.

For example, business rules for validation of the submitted data had to be made explicit. The rules had to be drawn from a number of sources: systems staff familiar with EDI submissions; domain experts in the Inland Revenue and the National Insurance Contributions Office (NICO); the Electronic Business Unit (EBU) staff responsible for implementing Filing By Internet (FBI); and legal staff charged with updating tax collection practices. Once the business rules had been determined, XML structures could be designed to reflect business requirements. At this stage, a package consisting of XML Schema, validation requirements and sample data could be produced for third-party vendors as initial specifications for their submission applications. The schema package would be subject to review externally by vendors, and internally by domain and legal experts. The review process might be repeated several times while submission and processing mechanisms were clarified. Once a final schema had been agreed, a range of systems would be built:

  • extensions to existing third-party payroll systems to provide for online submissions;

  • an online validation service for the testing of third-party submission applications;

  • a validation mechanism for receipt of data submitted via the government gateway; and

  • a translation mechanism for converting incoming XML into formats acceptable to the Inland Revenue's back office systems.

The timetable for the FBI project was constrained by the need to fit in with existing form submission dates. In the time between the March 2000 Budget, which effectively laid down the requirements for the 2001 employer submissions, and the filing dates in Spring 2001, the FBI mechanisms had to be specified and implemented in full. Given the complexity of the development process, meeting this deadline necessitated a radical new design approach.

3. Key elements of the technical solution

The data, documentation and process requirements of the FBI project required the development of three core capabilities:

  1. metadata management;

  2. publication of controlled documentation; and

  3. provision of validation services.

3.1. A common data set

The Inland Revenue realised that the extension of their submission mechanisms from manual to magnetic media, to EDI, and now to XML, would put great strains on systems and business departments. A cumbersome review mechanism was used to bring the Inland Revenue's "corporate memory" to bear on each new structure and change. This was clearly unsustainable in the long run. So DecisionSoft decided to centralise forms data definitions in a "common data set", or metadata repository, which would provide a single source of format information and business rules for a given data element. A piece of data, such as a personal tax code, might be used in a number of different forms, each with variant formats and differing validation rules, with the potential for different versions for each tax year. In addition, each piece of data could have different formats in its paper, magnetic media, EDIFACT, and XML versions. Rather than attempt to impose uniformity on a series of complicated, mission-critical, systems, the Common Data Set would centralise the Inland Revenue's "corporate memory" and provide visibility for inconsistencies between implementations and uses of the same underlying datatypes.

3.2. Documentation and services for third-party software developers

Given the complexity of metadata and the number of disparate parties involved in the development process, a controlled publication process was essential. All documentation had to be managed and communicated across the community as the standards themselves went through multiple revisions. To achieve this a series of plugins were written to generate any requisite output from the metadata repository. These outputs were made available in a controlled fashion via a web page.

3.3. Data validation processes

The Inland Revenue had two main requirements for data validation. All data passed to the IR from the Government Gateway had to be validated fully before being passed to its back-office systems; and most of the validation mechanisms had to be made available as public services to support third party developers to test their submission software.

In order of increasing complexity there are, effectively, five different stages of validating an XML document. These are:

  1. Well-formed XML

  2. Structural validity (per XSDL Part 1 or DTD)

  3. Data-type validity (XSDL Part 2)

  4. Co-constraint validity ("cross-field rules")

  5. Document external validation ("name found in external database")

Data being submitted to the Inland Revenue clearly had to pass all these tests. As a starting point, incoming data was expected to be schema-compliant, that is, each data element had to meet the requirements specified in the XML Schema (and therefore pass the first three stages above). At the next level, internal business rules specified the relationship between data elements. These rules might include consistency checking for sub-total values, or rules which require certain elements to have specified values depending on the value of other elements. Lastly, submissions which were internally consistent had to be validated against actual data already held in the IR's sytems.

For the Inland Revenue, the validation was further complicated by the need to support test data (which would not be expected to pass stage five) and the requirement to include the possibility of additional rules at stage four for live document validation (which would not be applied to test data).

For the FBI project, the Inland Revenue decided on implementing a sequence of five different validation process levels, each one dependent in turn on passing the previous level. They were:

  1. Passes schema validation [implies Stages 1, 2, 3]

  2. Passes public co-constraint validation [implies Stage 4]

  3. Successfully submits through Government Gateway

  4. Passes remote database validation [implies Stage 5]

  5. Passes private co-constraint validation [implies Stage 4]

For the public developer test service, the first two levels were provided over the Web, with further access to the same service through the Government Gateway to allow developers to reach level three.

click image for full size view

For the validation of live submissions, all five levels were tested in turn, with failure at any level resulting in immediate rejection of the submission.

4. Development of an interface specification repository

4.1. X-Meta - the metadata repository

DecisionSoft's X-Meta product suite was used to provide a Common Data Set for the Inland Revenue. It consisted of:

  • a component metadata database with input and output APIs

  • a client application for editing metadata

  • a set of output generator plugins

  • web-based repository and validation services

click image for full size view

The metadata database provided a single point of input for the definition of datatypes and business rules for - potentially - all forms used by the Inland Revenue. Individual datatypes were versioned by format, tax year, and form. These could be compiled into data structures for each of the formats commonly used by the Inland Revenue (EDIFACT, XML, GFF).

The generators matched the datatype definitions with validation instructions, documentation and usage information. Formal structural descriptions would be generated automatically in the appropriate format (XSDL or XDR schemas for XML, MIG documents for EDI).

Over the period of the FBI project a number of different "stakeholders" have been identified, each with different documentation requirements. Business process owners within the Inland Revenue needed a formal description of business rules; government interoperability requirements mandated a full set of XSDL schemas; external developers required fully indexed specifications; internal developers needed extensive test data. Over time, new generators have been added to provide a muutually-consistent range of outputs including:

  • comparative datatype documentation in, e.g., spreadsheets;

  • automatically-generated schemas;

  • schema-valid sample data messages;

  • descriptive sample data messages;

  • test data sets;

  • submission validation code;

  • interface translation code; and

  • interface specifications.

This architecture was developed in order to speed up the creation of schemas and other documentation, by replacing the conventional process of manually-updated Microsoft Word files and repeated proof-reading. Automating the production of documentation allowed the design-review cycle to be dramatically shortened while still including external stakeholder forums.

By adding plugins for generating validation code and regression test data, deployment of test services could be properly sychronised with publication of documentation, and the deployment cycle itself cut from a matter of months to less than an hour. This was critical for the Inland Revenue, whose legislation and tax-year driven annual cycle effectively guarantees that all its submission mechanisms will undergo revision at least once a year.

click image for full size view

The metadata repository was thus used to store not only the full structural definition of each schema and instance, but also the validation expressions required to validate instances at each level, together with English descriptions of the validation rules. Storing validation expressions in X-Meta's repository provided centralised management of the coding, documentation and testing of the submission validation mechanism. This separated the creation of instance-specific rules from the development of the validation engine, so that validation expressions could be updated and tested quickly to take schema changes into account. In addition to reducing the time taken to establish the completed validation system, this separation of function improved the accuracy of validation code and reduced the likelihood of gaps in implementation and testing.

4.2. Defining data constraints

The key to the validation services was to control all validation functionality through code and properties held in the metadata database. For the Inland Revenue, the first validation level (schema validation) was implicit in the metadata which described the schemas. Thus, a datatype defined in the metadata repository might have a "pattern" property which would in due course be published as part of the schema which referenced that datatype. But the next level of validation, public co-constraint validation, could not be defined within an XSDL schema. For instance, a simple data constraint such as "if gender is male then statutory maternity pay must be zero" had to refer to both the gender and statutory maternity pay nodes within the structural definition. This was handled by the creation of a boolean expression, stored in the metadata repository and linked to the relevant structural node or datatype. This would be expressed in Java, XSLT or XPath - or any other language appropriate to the downstream implementation of the rule.

The boolean expression would be a simplified version of the code which would be required to validate the specific co-constraint. For instance, the node references would be relative rather than absolute - in other words, the expression would assume that the context of the rule - the location in the structure from where the code was being run - was already known.

Secondly, there would be no attempt made to handle the navigation between one co-constraint and the next, as the validator moved through the list of validatable rules: this would be handled independently, by generic navigation code which would crawl over the structure of the instance document, firing validation rules as it reached the appropriate node.

4.3. Validation generation

Within the metadata repository, then, a given data constraint (or "business rule") was tied to the appropriate part of the document structure and associated with documentation rubric and other information (such as whether to be fired for test or live services). As part of the process of generating the full definition of a given interface, these data constraints would be exported to the outputs API as a list of validation expressions. At the time of export each expression would be associated with its contextual information and from that information validation code would be generated. This validation code would provide for the navigation through the target instance document and the firing of each individual validation expression at the appropriate point in the traversal.

The validation code would then be passed to an architectural wrapper layer, where the validation object would be created in the format required by the relevant target architecture, as for instance, a Java class, COM object or DLL. The resulting validation objects could then be registered and made available as appropriate to the validation server for test or live services.

click image for full size view

4.4. Integrating with intermediaries

The Inland Revenue's main objective in developing their "Common Data Set" was to support the needs of the third party private sector software developers whose investment in the development and annual update of submission systems was critical to the success of Filing By Internet.

The repository-driven architecture allowed third parties to be supported without reference to developers and business experts within the Inland Revenue, and ensured that each interface definition was:

  • defined by a full range of schemas, documentation and validation services;

  • version consistent as to both documentation and validation service; and

  • delivered on time.

For a given software vendor, initial definitions of a service could be obtained from the public document repository, in the form of sample instance documents, PDF specifications and sets of schemas. Product development for the vendor could, if wished, incorporate authoritative validation DLLs drawn from the same source and drawn, like the documentation, from the same version of the underlying metadata. As development continued, beta systems could be tested against the Inland Revenue's Third Party Validation Service (TPVS), driven by validation objects generated from the metadata repository. The same validation objects could also be made available behind the Vendor Single Integrated Proving Service (VSIPS) mechanism for product accreditation, as well as the live validation service used on live data being submitted through the Government Gateway.

click image for full size view

5. Latest developments

5.1. Live transformation

Because X-Meta stores structures for XML and non-XML formats, it can be used to describe the relationship between incoming (XML) data and target (back-office) formats. As the Inland Revenue implementation continues, we expect to have to generate code to convert incoming submissions into the formats mandated by the back-office systems, cutting out the need to write interface specifications and translate those into code. This code would be run by a processing engine in a load-balanced environment designed to handle multiple transactions per second.

5.2. EDIFACT validation

I mention automated validation of EDIFACT submissions here only because of the light it shines, by contrast, on XML practices. In the XML world, machine-readable schemas and readily-available, open source, validating parsers provide a level of interoperability which other data formats can only envy. As we have seen above, even in the XML world there is a shortfall between the requirement for full validation of co-constraints and the ability of XSDL as currently constituted to validate more than structural and datatype-based parameters.

In the EDIFACT world, data specifications are encapsulated in the Message Implementation Guidelines (MIG) which are, effectively, negotiated between two transacting parties and expressed in human-readable - but not machine-readable - language. Compared with the ability to express XML data constraints - at all levels - as machine-readable metadata, this imposes tremendous costs both for maintenance of standards and updating of interface and processing code.

In what we believe is one of the first examples of XML technology leaking back to legacy technologies, we are now working with the Inland Revenue to add to their EDIFACT operations the ability to create metadata-based data definitions and automated validation mechanisms.

5.3. XBRL

The latest phase of the FBI project is developing a service to allow the online submission of Corporation Tax returns. A strategic decision has been taken by the Inland Revenue to use Extensible Business Reporting Language (XBRL) to encode these documents. XBRL taxonomies are defined within the metadata repository and a new generator plugin has been added to cope with the specific requirements of XBRL definitions. In due course, the project is expected to be expanded to support XBRL data constraints based on the Inland Revenue's internal business rules.

Validating XBRL, however, is considerably more complex than validating XML documents, and a full treatment would be outside the scope of this paper. At its simplest, XBRL's use of linkbases introduces an extra level of consistency requirements between taxonomy and instance. None of these types of validation are currently available in either the commercial or open source markets, and we are now developing a set of XBRL tools to provide generic XBRL validation. For further detail, please see http://xbrl.decisionsoft.com.

6. Summary

The Inland Revenue's implementation of a model-driven design and validation system has already delivered major benefits by rationalising the design process and drastically reducing the time taken to create documentation and validation services. As the implementation progresses, we expect further development to continue to reduce the amount of work required of the Inland Revenue's domain experts and simplify the continuing maintenance of the FBI service.

As a metadata repository for e-business systems development, the Inland Revenue's "Common Data Set" has three unique features that yield key business benefit. Firstly, the metadata is held at the highest level of abstraction. Most commercial tools only hold metadata in specific pre-defined formats, such as XSDL schema. This limits the use to which metadata can be put. In this case, however, the data and data structures are described in a simple and generic format which doesn't unduly limit the form of any input or output.

The second benefit of the system is that the repository tightly couples datatypes and structures with business process descriptions, transformation mappings, validation rules and code fragments. This significantly eases the task of maintaining and amending the metadata.

The third benefit is that client interfaces were designed with the non-XML specialist in mind. Building and maintaining the repository doesn't require any specialist XML knowledge. On the outputs side, whilst essential XML Schemas (or XBRL taxonomies) are generated, it is Decisionsoft's experience that both creators and consumers of the specifications need significant additional documentation - e.g. business rules, instance documents, descriptive samples, human-readable XML, spreadsheets and other helpful outputs - to get them operational quickly.

In working with the United Kingdom's Inland Revenue, DecisionSoft has developed a new approach to maintaining interface definitions and so ensuring semantic interoperability. The unique feature of the system implemented here is that it stores metadata at the highest level of abstraction in a central repository from which any XML asset, code engine or documentation set can be generated on demand.

Biography

Philip Allen founded DecisionSoft to develop commercial and Open Source software to support the integration of e-commerce systems. DecisionSoft's products are used to model data structures and business constraints for XML and EDI, automating schema maintenance and streamlining development and validation for clients with extensive e-commerce systems.

Philip actually started out trading financial derivatives for County Bank Limited (now NatWest Markets) and managing the US Dollar interest rate and currency swap desk for Swiss Bank Corporation, then based at the top of the World Trade Center in New York. In 1987 he joined a software startup that eventually listed on NASDAQ as CATS Software Inc. He spent ten years there working (inter alia) on product management, product development and release management, before becoming VP of Customer Services in Palo Alto, California.

In 1997 he moved to Oxford, England, where he founded DecisionSoft. DecisionSoft now counts the UK's Inland Revenue and other central government departments amongst its clients. He holds a First Class degree (MA) from Cambridge University.