Professional Documents
Culture Documents
Philip Allen
Managing Multiple XML Schemas in the UK’s Inland Revenue
by Philip Allen
Revision History
$Revision: 1.2 $ $Date: 2004/09/09 10:48:00 $ Revised by: $Author: plega $
Table of Contents
Abstract..................................................................................................... 1
Background .............................................................................................. 3
Business requirements of the project ..................................................... 4
Data and documentation requirements............................................... 4
Process requirements ......................................................................... 4
Key elements of the technical solution ................................................... 6
A common data set............................................................................. 6
Documentation and services for third-party software developers...... 6
Data validation processes................................................................... 7
Development of an interface specification repository ........................... 9
X-Meta - the metadata repository ...................................................... 9
Defining data constraints.................................................................. 11
Validation generation ....................................................................... 12
Integrating with intermediaries ........................................................ 13
Latest developments............................................................................... 15
Live transformation.......................................................................... 15
EDIFACT validation ........................................................................ 15
XBRL............................................................................................... 15
Summary................................................................................................. 17
iii
Abstract
In 1997, the UK Government established a policy that aimed to have all lo-
cal and central government services web-enabled by 2005. For the Inland
Revenue (IR), this meant offering submission mechanisms for organisa-
tions wishing to provide tax returns over the Internet. For the 400 or so
software developers that provide payroll and financial applications to UK
businesses, this meant developing XML interfaces to allow their products
to link to the IR’s systems.
There were three primary challenges in the work to develop and subse-
quently maintain the IR’s online submissions capability: metadata man-
agement; publication of controlled documentation; and provision of vali-
dation services. The first of these was the result of the need to generate
about thirty schemas describing the complex and arcane business rules that
govern tax submissions. Knowledge of these rules was spread amongst nu-
merous geographically-separate business experts. This - and the fact that
the specifications had to be followed by 400 software companies - meant
that it was necessary to construct and maintain an authoritative repository
of metadata.
The second challenge arose primarily out of a lack of XML expertise
amongst both the IR’s business experts and the technical people in the soft-
ware companies. So, whilst the creation and publication of XML Schema
(and XBRL taxonomies) was essential, both creators and consumers of
the specifications needed much more documentation, e.g., business rules,
instance documents, descriptive samples, human-readable XML, spread-
sheets and other helpful outputs to get them operational quickly. And all
those outputs needed to be managed and communicated across the com-
munity as the standards themselves went through multiple revisions. To
achieve this, a series of APIs were written to generate any requisite out-
put from the metadata respository. These outputs were made available in a
controlled fashion via a web page.
The third challenge was to develop online services that could validate data
submitted to the IR. As XML Schema can only capture the simplest cross-
field validation rules, it was also necessary to maintain code fragments
alongside the metadata. A special generator was developed to build and
deploy validation services. These were used extensively by the software
developers who needed to test the new IR interfaces in their products.
In working with the Inland Revenue, DecisionSoft developed a new ap-
proach to maintaining interface definitions and so ensuring semantic in-
1
teroperability. This was achieved by storing metadata at the highest level
of abstraction in a central maintainable metadata repository from which
any XML asset, code engine or documentation set could be generated on
demand. Understanding the limitations of XML Schema as a source of
metadata allowed DecisionSoft to reduce technical complexity and make
management of XML assets less cumbersome.
2
Background
The United Kingdom’s Inland Revenue is responsible, under the overall
direction of UK Treasury ministers, for the efficient administration of in-
come tax, tax credits, corporation tax, capital gains tax, petroleum revenue
tax, inheritance tax, national insurance contributions and stamp duties. It
has taken the lead in the Government’s initiative to provide all UK citizens
with "joined-up" Internet access to all government services.
The Inland Revenue maintains many forms for use in communicating with
taxpayers. These support the full range of the Inland Revenue’s business,
including maintenance of the National Insurance system, and calculation
and collection of corporate and personal taxes. In 1999, the UK Govern-
ment committed to making all the Inland Revenue’s forms available on-
line, through the World Wide Web, by 2005. Of these services, one of the
most complex was the end-of-year employer payroll submission, a series
of forms known as P35, P14 and P38A. Online submission of these forms
was expected to originate from Web-based HTML forms (for small em-
ployers) and third-party payroll systems (for medium and large employ-
ers). Submissions were to be made as XML via a central government gate-
way, then passed through Revenue-based validation and translation layers
and into two of the Government’s largest systems - the Computerised Oper-
ation of Pay As You Earn (COP), which handles payroll taxes, and NIRS/2,
which maintains National Insurance records. The whole initiative was to
be known as the Filing by Internet (FBI) project.
Some of the Inland Revenue’s submissions processes had already been
automated, with a number of returns already being accepted as magnetic
media (tape or diskette submissions) or EDI (EDIFACT and related sub-
missions). However, the change from specialist serial data interfaces to
the more easily accessible, nested, XML structures meant that the work of
converting paper forms into electronic format had to be approached afresh
for FBI.
3
Business requirements of the
project
Historically, one of the biggest burdens on the Inland Revenue has been the
transcription of data from paper forms into the organisation’s back-office
systems. In recent years, it has encouraged the development of third-party
software systems to automate this process, first for magnetic media and
EDI submissions, and now for the FBI project. For FBI, the Inland Rev-
enue wanted to provide a full development package for software vendors,
consisting of schemas, documentation and data samples, and crucially, an
online test service for the new software systems.
Process requirements
The FBI development process was to be a complex one requiring:
The timetable for the FBI project was constrained by the need to fit in with
existing form submission dates. In the time between the March 2000 Bud-
get, which effectively laid down the requirements for the 2001 employer
submissions, and the filing dates in Spring 2001, the FBI mechanisms had
to be specified and implemented in full. Given the complexity of the de-
velopment process, meeting this deadline necessitated a radical new design
approach.
5
Key elements of the technical
solution
The data, documentation and process requirements of the FBI project re-
quired the development of three core capabilities:
1. metadata management;
1. Well-formed XML
Data being submitted to the Inland Revenue clearly had to pass all
these tests. As a starting point, incoming data was expected to be
schema-compliant, that is, each data element had to meet the requirements
specified in the XML Schema (and therefore pass the first three stages
above). At the next level, internal business rules specified the relationship
between data elements. These rules might include consistency checking
for sub-total values, or rules which require certain elements to have
specified values depending on the value of other elements. Lastly,
submissions which were internally consistent had to be validated against
actual data already held in the IR’s sytems.
For the Inland Revenue, the validation was further complicated by the need
to support test data (which would not be expected to pass stage five) and
the requirement to include the possibility of additional rules at stage four
for live document validation (which would not be applied to test data).
For the FBI project, the Inland Revenue decided on implementing a se-
quence of five different validation process levels, each one dependent in
turn on passing the previous level. They were:
7
1. Passes schema validation [implies Stages 1, 2, 3]
2. Passes public co-constraint validation [implies Stage 4]
For the public developer test service, the first two levels were provided over
the Web, with further access to the same service through the Government
Gateway to allow developers to reach level three.
Test submissions
I ni ti al Fi nal L i ve
Levels 1,2
L i ve vali dati on
Levels 4,5
I nternal vali dati on
COP, NI R S/2
For the validation of live submissions, all five levels were tested in turn,
with failure at any level resulting in immediate rejection of the submission.
8
Development of an interface
specification repository
X−M eta
Component metadata
Processi ng
plugi ns X−I ndex
xVTS
The metadata database provided a single point of input for the definition
of datatypes and business rules for - potentially - all forms used by the
Inland Revenue. Individual datatypes were versioned by format, tax year,
and form. These could be compiled into data structures for each of the
formats commonly used by the Inland Revenue (EDIFACT, XML, GFF).
The generators matched the datatype definitions with validation instruc-
tions, documentation and usage information. Formal structural descrip-
9
tions would be generated automatically in the appropriate format (XSDL
or XDR schemas for XML, MIG documents for EDI).
• interface specifications.
By adding plugins for generating validation code and regression test data,
deployment of test services could be properly sychronised with publica-
tion of documentation, and the deployment cycle itself cut from a matter
of months to less than an hour. This was critical for the Inland Revenue,
whose legislation and tax-year driven annual cycle effectively guarantees
that all its submission mechanisms will undergo revision at least once a
year.
10
Desi gn process Automated publi cati on
I nternal revi ew
R e−usabl e Schemas and
components non−techni cal
documentati on
Forum revi ew
Datatypes
1 hour
cycle
The metadata repository was thus used to store not only the full structural
definition of each schema and instance, but also the validation expressions
required to validate instances at each level, together with English descrip-
tions of the validation rules. Storing validation expressions in X-Meta’s
repository provided centralised management of the coding, documentation
and testing of the submission validation mechanism. This separated the
creation of instance-specific rules from the development of the validation
engine, so that validation expressions could be updated and tested quickly
to take schema changes into account. In addition to reducing the time taken
to establish the completed validation system, this separation of function
improved the accuracy of validation code and reduced the likelihood of
gaps in implementation and testing.
11
sion, stored in the metadata repository and linked to the relevant structural
node or datatype. This would be expressed in Java, XSLT or XPath - or
any other language appropriate to the downstream implementation of the
rule.
Validation generation
Within the metadata repository, then, a given data constraint (or "business
rule") was tied to the appropriate part of the document structure and asso-
ciated with documentation rubric and other information (such as whether
to be fired for test or live services). As part of the process of generating
the full definition of a given interface, these data constraints would be ex-
ported to the outputs API as a list of validation expressions. At the time of
export each expression would be associated with its contextual information
and from that information validation code would be generated. This vali-
dation code would provide for the navigation through the target instance
document and the firing of each individual validation expression at the ap-
propriate point in the traversal.
The validation code would then be passed to an architectural wrapper layer,
where the validation object would be created in the format required by the
relevant target architecture, as for instance, a Java class, COM object or
DLL. The resulting validation objects could then be registered and made
available as appropriate to the validation server for test or live services.
12
L i st of
busi ness rul e
constrai nts X−M eta −
central
Vali dati on
data model
expressi ons
I R back offi ce
Documents Archi tecture
X−M eta − systems
schemas and −speci f ic
sampl es central vali dati on
data model objects
14
Latest developments
Live transformation
Because X-Meta stores structures for XML and non-XML formats, it can
be used to describe the relationship between incoming (XML) data and
target (back-office) formats. As the Inland Revenue implementation con-
tinues, we expect to have to generate code to convert incoming submis-
sions into the formats mandated by the back-office systems, cutting out the
need to write interface specifications and translate those into code. This
code would be run by a processing engine in a load-balanced environment
designed to handle multiple transactions per second.
EDIFACT validation
I mention automated validation of EDIFACT submissions here only be-
cause of the light it shines, by contrast, on XML practices. In the XML
world, machine-readable schemas and readily-available, open source, val-
idating parsers provide a level of interoperability which other data formats
can only envy. As we have seen above, even in the XML world there is a
shortfall between the requirement for full validation of co-constraints and
the ability of XSDL as currently constituted to validate more than struc-
tural and datatype-based parameters.
15
XBRL
The latest phase of the FBI project is developing a service to allow the on-
line submission of Corporation Tax returns. A strategic decision has been
taken by the Inland Revenue to use Extensible Business Reporting Lan-
guage (XBRL) to encode these documents. XBRL taxonomies are defined
within the metadata repository and a new generator plugin has been added
to cope with the specific requirements of XBRL definitions. In due course,
the project is expected to be expanded to support XBRL data constraints
based on the Inland Revenue’s internal business rules.
Validating XBRL, however, is considerably more complex than validat-
ing XML documents, and a full treatment would be outside the scope of
this paper. At its simplest, XBRL’s use of linkbases introduces an extra
level of consistency requirements between taxonomy and instance. None
of these types of validation are currently available in either the commer-
cial or open source markets, and we are now developing a set of XBRL
tools to provide generic XBRL validation. For further detail, please see
http://xbrl.decisionsoft.com .
Notes
1. http://xbrl.decisionsoft.com
16
Summary
The Inland Revenue’s implementation of a model-driven design and vali-
dation system has already delivered major benefits by rationalising the de-
sign process and drastically reducing the time taken to create documenta-
tion and validation services. As the implementation progresses, we expect
further development to continue to reduce the amount of work required of
the Inland Revenue’s domain experts and simplify the continuing mainte-
nance of the FBI service.
As a metadata repository for e-business systems development, the Inland
Revenue’s "Common Data Set" has three unique features that yield key
business benefit. Firstly, the metadata is held at the highest level of abstrac-
tion. Most commercial tools only hold metadata in specific pre-defined
formats, such as XSDL schema. This limits the use to which metadata can
be put. In this case, however, the data and data structures are described in
a simple and generic format which doesn’t unduly limit the form of any
input or output.
The second benefit of the system is that the repository tightly couples
datatypes and structures with business process descriptions, transformation
mappings, validation rules and code fragments. This significantly eases the
task of maintaining and amending the metadata.
The third benefit is that client interfaces were designed with the non-XML
specialist in mind. Building and maintaining the repository doesn’t require
any specialist XML knowledge. On the outputs side, whilst essential XML
Schemas (or XBRL taxonomies) are generated, it is Decisionsoft’s expe-
rience that both creators and consumers of the specifications need signif-
icant additional documentation - e.g. business rules, instance documents,
descriptive samples, human-readable XML, spreadsheets and other helpful
outputs - to get them operational quickly.
In working with the United Kingdom’s Inland Revenue, DecisionSoft has
developed a new approach to maintaining interface definitions and so en-
suring semantic interoperability. The unique feature of the system imple-
mented here is that it stores metadata at the highest level of abstraction in
a central repository from which any XML asset, code engine or documen-
tation set can be generated on demand.
17