You are on page 1of 20

Managing Multiple

XML Schemas in the


UK’s Inland Revenue

Philip Allen
Managing Multiple XML Schemas in the UK’s Inland Revenue
by Philip Allen

Copyright © 2003 by DecisionSoft Limited

Revision History
$Revision: 1.2 $ $Date: 2004/09/09 10:48:00 $ Revised by: $Author: plega $
Table of Contents
Abstract..................................................................................................... 1
Background .............................................................................................. 3
Business requirements of the project ..................................................... 4
Data and documentation requirements............................................... 4
Process requirements ......................................................................... 4
Key elements of the technical solution ................................................... 6
A common data set............................................................................. 6
Documentation and services for third-party software developers...... 6
Data validation processes................................................................... 7
Development of an interface specification repository ........................... 9
X-Meta - the metadata repository ...................................................... 9
Defining data constraints.................................................................. 11
Validation generation ....................................................................... 12
Integrating with intermediaries ........................................................ 13
Latest developments............................................................................... 15
Live transformation.......................................................................... 15
EDIFACT validation ........................................................................ 15
XBRL............................................................................................... 15
Summary................................................................................................. 17

iii
Abstract
In 1997, the UK Government established a policy that aimed to have all lo-
cal and central government services web-enabled by 2005. For the Inland
Revenue (IR), this meant offering submission mechanisms for organisa-
tions wishing to provide tax returns over the Internet. For the 400 or so
software developers that provide payroll and financial applications to UK
businesses, this meant developing XML interfaces to allow their products
to link to the IR’s systems.
There were three primary challenges in the work to develop and subse-
quently maintain the IR’s online submissions capability: metadata man-
agement; publication of controlled documentation; and provision of vali-
dation services. The first of these was the result of the need to generate
about thirty schemas describing the complex and arcane business rules that
govern tax submissions. Knowledge of these rules was spread amongst nu-
merous geographically-separate business experts. This - and the fact that
the specifications had to be followed by 400 software companies - meant
that it was necessary to construct and maintain an authoritative repository
of metadata.
The second challenge arose primarily out of a lack of XML expertise
amongst both the IR’s business experts and the technical people in the soft-
ware companies. So, whilst the creation and publication of XML Schema
(and XBRL taxonomies) was essential, both creators and consumers of
the specifications needed much more documentation, e.g., business rules,
instance documents, descriptive samples, human-readable XML, spread-
sheets and other helpful outputs to get them operational quickly. And all
those outputs needed to be managed and communicated across the com-
munity as the standards themselves went through multiple revisions. To
achieve this, a series of APIs were written to generate any requisite out-
put from the metadata respository. These outputs were made available in a
controlled fashion via a web page.
The third challenge was to develop online services that could validate data
submitted to the IR. As XML Schema can only capture the simplest cross-
field validation rules, it was also necessary to maintain code fragments
alongside the metadata. A special generator was developed to build and
deploy validation services. These were used extensively by the software
developers who needed to test the new IR interfaces in their products.
In working with the Inland Revenue, DecisionSoft developed a new ap-
proach to maintaining interface definitions and so ensuring semantic in-
1
teroperability. This was achieved by storing metadata at the highest level
of abstraction in a central maintainable metadata repository from which
any XML asset, code engine or documentation set could be generated on
demand. Understanding the limitations of XML Schema as a source of
metadata allowed DecisionSoft to reduce technical complexity and make
management of XML assets less cumbersome.

2
Background
The United Kingdom’s Inland Revenue is responsible, under the overall
direction of UK Treasury ministers, for the efficient administration of in-
come tax, tax credits, corporation tax, capital gains tax, petroleum revenue
tax, inheritance tax, national insurance contributions and stamp duties. It
has taken the lead in the Government’s initiative to provide all UK citizens
with "joined-up" Internet access to all government services.

The Inland Revenue maintains many forms for use in communicating with
taxpayers. These support the full range of the Inland Revenue’s business,
including maintenance of the National Insurance system, and calculation
and collection of corporate and personal taxes. In 1999, the UK Govern-
ment committed to making all the Inland Revenue’s forms available on-
line, through the World Wide Web, by 2005. Of these services, one of the
most complex was the end-of-year employer payroll submission, a series
of forms known as P35, P14 and P38A. Online submission of these forms
was expected to originate from Web-based HTML forms (for small em-
ployers) and third-party payroll systems (for medium and large employ-
ers). Submissions were to be made as XML via a central government gate-
way, then passed through Revenue-based validation and translation layers
and into two of the Government’s largest systems - the Computerised Oper-
ation of Pay As You Earn (COP), which handles payroll taxes, and NIRS/2,
which maintains National Insurance records. The whole initiative was to
be known as the Filing by Internet (FBI) project.
Some of the Inland Revenue’s submissions processes had already been
automated, with a number of returns already being accepted as magnetic
media (tape or diskette submissions) or EDI (EDIFACT and related sub-
missions). However, the change from specialist serial data interfaces to
the more easily accessible, nested, XML structures meant that the work of
converting paper forms into electronic format had to be approached afresh
for FBI.

3
Business requirements of the
project

Data and documentation requirements


The FBI work had to be based on the latest paper forms for conventional
submission. Developing XML representations of these documents required
the generation of many XML Schemas, some of which were to be interde-
pendent. However, these were only a small part of the documentation that
was needed for the project.

Historically, one of the biggest burdens on the Inland Revenue has been the
transcription of data from paper forms into the organisation’s back-office
systems. In recent years, it has encouraged the development of third-party
software systems to automate this process, first for magnetic media and
EDI submissions, and now for the FBI project. For FBI, the Inland Rev-
enue wanted to provide a full development package for software vendors,
consisting of schemas, documentation and data samples, and crucially, an
online test service for the new software systems.

Process requirements
The FBI development process was to be a complex one requiring:

• collation of business rules for handling data;


• establishment of expectations for data format in the back office systems;
• design, generation and publication of schemas with ancillary documen-
tation and examples;
• review of schemas and business rules;

• creation of validation code;


• definition of test data sets;
• building of an online validation engine; and

• generation of FBI/back-office interface code.


These processes required the involvement of a wide variety of geographi-
cally separate parties, many of them outside the Inland Revenue.
4
For example, business rules for validation of the submitted data had to
be made explicit. The rules had to be drawn from a number of sources:
systems staff familiar with EDI submissions; domain experts in the In-
land Revenue and the National Insurance Contributions Office (NICO);
the Electronic Business Unit (EBU) staff responsible for implementing Fil-
ing By Internet (FBI); and legal staff charged with updating tax collection
practices. Once the business rules had been determined, XML structures
could be designed to reflect business requirements. At this stage, a pack-
age consisting of XML Schema, validation requirements and sample data
could be produced for third-party vendors as initial specifications for their
submission applications. The schema package would be subject to review
externally by vendors, and internally by domain and legal experts. The
review process might be repeated several times while submission and pro-
cessing mechanisms were clarified. Once a final schema had been agreed,
a range of systems would be built:

• extensions to existing third-party payroll systems to provide for online


submissions;

• an online validation service for the testing of third-party submission ap-


plications;

• a validation mechanism for receipt of data submitted via the government


gateway; and

• a translation mechanism for converting incoming XML into formats ac-


ceptable to the Inland Revenue’s back office systems.

The timetable for the FBI project was constrained by the need to fit in with
existing form submission dates. In the time between the March 2000 Bud-
get, which effectively laid down the requirements for the 2001 employer
submissions, and the filing dates in Spring 2001, the FBI mechanisms had
to be specified and implemented in full. Given the complexity of the de-
velopment process, meeting this deadline necessitated a radical new design
approach.

5
Key elements of the technical
solution
The data, documentation and process requirements of the FBI project re-
quired the development of three core capabilities:

1. metadata management;

2. publication of controlled documentation; and


3. provision of validation services.

A common data set


The Inland Revenue realised that the extension of their submission mech-
anisms from manual to magnetic media, to EDI, and now to XML, would
put great strains on systems and business departments. A cumbersome re-
view mechanism was used to bring the Inland Revenue’s "corporate mem-
ory" to bear on each new structure and change. This was clearly unsus-
tainable in the long run. So DecisionSoft decided to centralise forms data
definitions in a "common data set", or metadata repository, which would
provide a single source of format information and business rules for a given
data element. A piece of data, such as a personal tax code, might be used
in a number of different forms, each with variant formats and differing
validation rules, with the potential for different versions for each tax year.
In addition, each piece of data could have different formats in its paper,
magnetic media, EDIFACT, and XML versions. Rather than attempt to
impose uniformity on a series of complicated, mission-critical, systems,
the Common Data Set would centralise the Inland Revenue’s "corporate
memory" and provide visibility for inconsistencies between implementa-
tions and uses of the same underlying datatypes.

Documentation and services for


third-party software developers
Given the complexity of metadata and the number of disparate parties in-
volved in the development process, a controlled publication process was
essential. All documentation had to be managed and communicated across
6
the community as the standards themselves went through multiple revi-
sions. To achieve this a series of plugins were written to generate any
requisite output from the metadata repository. These outputs were made
available in a controlled fashion via a web page.

Data validation processes


The Inland Revenue had two main requirements for data validation. All
data passed to the IR from the Government Gateway had to be validated
fully before being passed to its back-office systems; and most of the vali-
dation mechanisms had to be made available as public services to support
third party developers to test their submission software.
In order of increasing complexity there are, effectively, five different stages
of validating an XML document. These are:

1. Well-formed XML

2. Structural validity (per XSDL Part 1 or DTD)

3. Data-type validity (XSDL Part 2)

4. Co-constraint validity ("cross-field rules")

5. Document external validation ("name found in external database")

Data being submitted to the Inland Revenue clearly had to pass all
these tests. As a starting point, incoming data was expected to be
schema-compliant, that is, each data element had to meet the requirements
specified in the XML Schema (and therefore pass the first three stages
above). At the next level, internal business rules specified the relationship
between data elements. These rules might include consistency checking
for sub-total values, or rules which require certain elements to have
specified values depending on the value of other elements. Lastly,
submissions which were internally consistent had to be validated against
actual data already held in the IR’s sytems.
For the Inland Revenue, the validation was further complicated by the need
to support test data (which would not be expected to pass stage five) and
the requirement to include the possibility of additional rules at stage four
for live document validation (which would not be applied to test data).

For the FBI project, the Inland Revenue decided on implementing a se-
quence of five different validation process levels, each one dependent in
turn on passing the previous level. They were:
7
1. Passes schema validation [implies Stages 1, 2, 3]
2. Passes public co-constraint validation [implies Stage 4]

3. Successfully submits through Government Gateway


4. Passes remote database validation [implies Stage 5]
5. Passes private co-constraint validation [implies Stage 4]

For the public developer test service, the first two levels were provided over
the Web, with further access to the same service through the Government
Gateway to allow developers to reach level three.

Test submissions

I ni ti al Fi nal L i ve

Levels 1,2 Level 3


TPVS Government Gateway

Levels 1,2
L i ve vali dati on

Levels 4,5
I nternal vali dati on

COP, NI R S/2

For the validation of live submissions, all five levels were tested in turn,
with failure at any level resulting in immediate rejection of the submission.

8
Development of an interface
specification repository

X-Meta - the metadata repository


DecisionSoft’s X-Meta product suite was used to provide a Common Data
Set for the Inland Revenue. It consisted of:

• a component metadata database with input and output APIs

• a client application for editing metadata

• a set of output generator plugins

• web-based repository and validation services

X−M eta

Component metadata

Processi ng
plugi ns X−I ndex

Documentati on generator R epository

xVTS

Vali dati on and


processi ng servers

The metadata database provided a single point of input for the definition
of datatypes and business rules for - potentially - all forms used by the
Inland Revenue. Individual datatypes were versioned by format, tax year,
and form. These could be compiled into data structures for each of the
formats commonly used by the Inland Revenue (EDIFACT, XML, GFF).
The generators matched the datatype definitions with validation instruc-
tions, documentation and usage information. Formal structural descrip-

9
tions would be generated automatically in the appropriate format (XSDL
or XDR schemas for XML, MIG documents for EDI).

Over the period of the FBI project a number of different "stakeholders"


have been identified, each with different documentation requirements.
Business process owners within the Inland Revenue needed a formal
description of business rules; government interoperability requirements
mandated a full set of XSDL schemas; external developers required
fully indexed specifications; internal developers needed extensive
test data. Over time, new generators have been added to provide a
muutually-consistent range of outputs including:

• comparative datatype documentation in, e.g., spreadsheets;


• automatically-generated schemas;

• schema-valid sample data messages;

• descriptive sample data messages;

• test data sets;

• submission validation code;

• interface translation code; and

• interface specifications.

This architecture was developed in order to speed up the creation of


schemas and other documentation, by replacing the conventional process
of manually-updated Microsoft Word files and repeated proof-reading.
Automating the production of documentation allowed the design-review
cycle to be dramatically shortened while still including external
stakeholder forums.

By adding plugins for generating validation code and regression test data,
deployment of test services could be properly sychronised with publica-
tion of documentation, and the deployment cycle itself cut from a matter
of months to less than an hour. This was critical for the Inland Revenue,
whose legislation and tax-year driven annual cycle effectively guarantees
that all its submission mechanisms will undergo revision at least once a
year.

10
Desi gn process Automated publi cati on

I nternal revi ew
R e−usabl e Schemas and
components non−techni cal
documentati on
Forum revi ew
Datatypes

E l ements X−M eta − Depl oyment


central
data model
Schemas R egressi on testi ng
TPVS
depl oyment
Vali dati on object

1 hour
cycle

The metadata repository was thus used to store not only the full structural
definition of each schema and instance, but also the validation expressions
required to validate instances at each level, together with English descrip-
tions of the validation rules. Storing validation expressions in X-Meta’s
repository provided centralised management of the coding, documentation
and testing of the submission validation mechanism. This separated the
creation of instance-specific rules from the development of the validation
engine, so that validation expressions could be updated and tested quickly
to take schema changes into account. In addition to reducing the time taken
to establish the completed validation system, this separation of function
improved the accuracy of validation code and reduced the likelihood of
gaps in implementation and testing.

Defining data constraints


The key to the validation services was to control all validation functionality
through code and properties held in the metadata database. For the Inland
Revenue, the first validation level (schema validation) was implicit in the
metadata which described the schemas. Thus, a datatype defined in the
metadata repository might have a "pattern" property which would in due
course be published as part of the schema which referenced that datatype.
But the next level of validation, public co-constraint validation, could not
be defined within an XSDL schema. For instance, a simple data constraint
such as "if gender is male then statutory maternity pay must be zero" had
to refer to both the gender and statutory maternity pay nodes within the
structural definition. This was handled by the creation of a boolean expres-

11
sion, stored in the metadata repository and linked to the relevant structural
node or datatype. This would be expressed in Java, XSLT or XPath - or
any other language appropriate to the downstream implementation of the
rule.

The boolean expression would be a simplified version of the code which


would be required to validate the specific co-constraint. For instance, the
node references would be relative rather than absolute - in other words, the
expression would assume that the context of the rule - the location in the
structure from where the code was being run - was already known.
Secondly, there would be no attempt made to handle the navigation be-
tween one co-constraint and the next, as the validator moved through the
list of validatable rules: this would be handled independently, by generic
navigation code which would crawl over the structure of the instance doc-
ument, firing validation rules as it reached the appropriate node.

Validation generation
Within the metadata repository, then, a given data constraint (or "business
rule") was tied to the appropriate part of the document structure and asso-
ciated with documentation rubric and other information (such as whether
to be fired for test or live services). As part of the process of generating
the full definition of a given interface, these data constraints would be ex-
ported to the outputs API as a list of validation expressions. At the time of
export each expression would be associated with its contextual information
and from that information validation code would be generated. This vali-
dation code would provide for the navigation through the target instance
document and the firing of each individual validation expression at the ap-
propriate point in the traversal.
The validation code would then be passed to an architectural wrapper layer,
where the validation object would be created in the format required by the
relevant target architecture, as for instance, a Java class, COM object or
DLL. The resulting validation objects could then be registered and made
available as appropriate to the validation server for test or live services.

12
L i st of
busi ness rul e
constrai nts X−M eta −
central
Vali dati on
data model
expressi ons

Archi tectural wrapper l ayer

E JB Java cl ass COM object DLL Web servi ce

L i ve TPVS I ntermedi ary


servi ce systems

Integrating with intermediaries


The Inland Revenue’s main objective in developing their "Common Data
Set" was to support the needs of the third party private sector software de-
velopers whose investment in the development and annual update of sub-
mission systems was critical to the success of Filing By Internet.

The repository-driven architecture allowed third parties to be supported


without reference to developers and business experts within the Inland
Revenue, and ensured that each interface definition was:

• defined by a full range of schemas, documentation and validation ser-


vices;
• version consistent as to both documentation and validation service; and
• delivered on time.

For a given software vendor, initial definitions of a service could be ob-


tained from the public document repository, in the form of sample instance
documents, PDF specifications and sets of schemas. Product development
for the vendor could, if wished, incorporate authoritative validation DLLs
drawn from the same source and drawn, like the documentation, from the
same version of the underlying metadata. As development continued, beta
systems could be tested against the Inland Revenue’s Third Party Valida-
tion Service (TPVS), driven by validation objects generated from the meta-
data repository. The same validation objects could also be made available
behind the Vendor Single Integrated Proving Service (VSIPS) mechanism
13
for product accreditation, as well as the live validation service used on live
data being submitted through the Government Gateway.

I R back offi ce
Documents Archi tecture
X−M eta − systems
schemas and −speci f ic
sampl es central vali dati on
data model objects

Vali dati on servi ces


X−I ndex
publi c
reposi tory TPVS VSI PS L i ve vali dati on
servi ce

Schemas and DLL s


speci f i cati ons
Government
Gateway
Vendors or intermedi ari es

Speci fi cati on Product R egressi on Product Customer


devel opment testi ng accredi tati on usi ng product

14
Latest developments

Live transformation
Because X-Meta stores structures for XML and non-XML formats, it can
be used to describe the relationship between incoming (XML) data and
target (back-office) formats. As the Inland Revenue implementation con-
tinues, we expect to have to generate code to convert incoming submis-
sions into the formats mandated by the back-office systems, cutting out the
need to write interface specifications and translate those into code. This
code would be run by a processing engine in a load-balanced environment
designed to handle multiple transactions per second.

EDIFACT validation
I mention automated validation of EDIFACT submissions here only be-
cause of the light it shines, by contrast, on XML practices. In the XML
world, machine-readable schemas and readily-available, open source, val-
idating parsers provide a level of interoperability which other data formats
can only envy. As we have seen above, even in the XML world there is a
shortfall between the requirement for full validation of co-constraints and
the ability of XSDL as currently constituted to validate more than struc-
tural and datatype-based parameters.

In the EDIFACT world, data specifications are encapsulated in the Mes-


sage Implementation Guidelines (MIG) which are, effectively, negotiated
between two transacting parties and expressed in human-readable - but not
machine-readable - language. Compared with the ability to express XML
data constraints - at all levels - as machine-readable metadata, this imposes
tremendous costs both for maintenance of standards and updating of inter-
face and processing code.

In what we believe is one of the first examples of XML technology leaking


back to legacy technologies, we are now working with the Inland Revenue
to add to their EDIFACT operations the ability to create metadata-based
data definitions and automated validation mechanisms.

15
XBRL
The latest phase of the FBI project is developing a service to allow the on-
line submission of Corporation Tax returns. A strategic decision has been
taken by the Inland Revenue to use Extensible Business Reporting Lan-
guage (XBRL) to encode these documents. XBRL taxonomies are defined
within the metadata repository and a new generator plugin has been added
to cope with the specific requirements of XBRL definitions. In due course,
the project is expected to be expanded to support XBRL data constraints
based on the Inland Revenue’s internal business rules.
Validating XBRL, however, is considerably more complex than validat-
ing XML documents, and a full treatment would be outside the scope of
this paper. At its simplest, XBRL’s use of linkbases introduces an extra
level of consistency requirements between taxonomy and instance. None
of these types of validation are currently available in either the commer-
cial or open source markets, and we are now developing a set of XBRL
tools to provide generic XBRL validation. For further detail, please see
http://xbrl.decisionsoft.com .

Notes
1. http://xbrl.decisionsoft.com

16
Summary
The Inland Revenue’s implementation of a model-driven design and vali-
dation system has already delivered major benefits by rationalising the de-
sign process and drastically reducing the time taken to create documenta-
tion and validation services. As the implementation progresses, we expect
further development to continue to reduce the amount of work required of
the Inland Revenue’s domain experts and simplify the continuing mainte-
nance of the FBI service.
As a metadata repository for e-business systems development, the Inland
Revenue’s "Common Data Set" has three unique features that yield key
business benefit. Firstly, the metadata is held at the highest level of abstrac-
tion. Most commercial tools only hold metadata in specific pre-defined
formats, such as XSDL schema. This limits the use to which metadata can
be put. In this case, however, the data and data structures are described in
a simple and generic format which doesn’t unduly limit the form of any
input or output.
The second benefit of the system is that the repository tightly couples
datatypes and structures with business process descriptions, transformation
mappings, validation rules and code fragments. This significantly eases the
task of maintaining and amending the metadata.

The third benefit is that client interfaces were designed with the non-XML
specialist in mind. Building and maintaining the repository doesn’t require
any specialist XML knowledge. On the outputs side, whilst essential XML
Schemas (or XBRL taxonomies) are generated, it is Decisionsoft’s expe-
rience that both creators and consumers of the specifications need signif-
icant additional documentation - e.g. business rules, instance documents,
descriptive samples, human-readable XML, spreadsheets and other helpful
outputs - to get them operational quickly.
In working with the United Kingdom’s Inland Revenue, DecisionSoft has
developed a new approach to maintaining interface definitions and so en-
suring semantic interoperability. The unique feature of the system imple-
mented here is that it stores metadata at the highest level of abstraction in
a central repository from which any XML asset, code engine or documen-
tation set can be generated on demand.

17

You might also like