You are on page 1of 64

Data Management & Warehousing

WHITE PAPER

Process Neutral Data Modelling


DAVID M WALKER
Version: 1.0
Date: 10/02/2009

Data Management & Warehousing

138 Finchampstead Road, Wokingham, Berkshire, RG41 2NU, United Kingdom

http://www.datamgmt.com
White Paper - Process Neutral Data Modelling

Table of Contents
Table of Contents ...................................................................................................................... 2 
Synopsis .................................................................................................................................... 4 
Intended Audience .................................................................................................................... 4 
About Data Management & Warehousing ................................................................................. 4 
Introduction................................................................................................................................ 5 
The Problem .............................................................................................................................. 6 
The Example Company......................................................................................................... 6 
The Real World ..................................................................................................................... 9 
The Customer Paradigm ......................................................................................................... 10 
Requirements of a Data Warehouse Data Model.................................................................... 12 
Assumptions........................................................................................................................ 12 
Requirements...................................................................................................................... 12 
The Data Model ....................................................................................................................... 14 
Major Entities ...................................................................................................................... 14 
Type Tables ........................................................................................................................ 17 
Band Tables ........................................................................................................................ 19 
Property Tables................................................................................................................... 20 
Event Tables ....................................................................................................................... 22 
Link Tables.......................................................................................................................... 23 
Segment Tables .................................................................................................................. 24 
The Sub-Model ........................................................................................................................ 25 
History Tables ..................................................................................................................... 26 
Occurrences and Transactions ........................................................................................... 27 
Implementation Issues ............................................................................................................ 33 
The ‘Party’ Special Case..................................................................................................... 33 
Partitioning .......................................................................................................................... 35 
Data Cleansing.................................................................................................................... 36 
Null Values .......................................................................................................................... 36 
Indexing Strategy ................................................................................................................ 36 
Enforcing Referential Integrity............................................................................................. 36 
Data Insert versus Data Update.......................................................................................... 37 
Row versus Set Based Loading in ETL............................................................................... 37 
Disk Space Utilisation ......................................................................................................... 38 
Implementation Effort .......................................................................................................... 38 
Data Commutativity ................................................................................................................. 39 
Data Model Explosion and Compression ................................................................................ 40 
How big does the data model get?...................................................................................... 40 
Can the data model be compressed? ................................................................................. 40 
Which Results to Store? .......................................................................................................... 41 
The Holistic Approach ............................................................................................................. 42 
Summary ................................................................................................................................. 43 
Appendix 1 – Data Modelling Standards ................................................................................. 44 
General Conventions .......................................................................................................... 44 
Table Conventions .............................................................................................................. 44 
Column Conventions........................................................................................................... 46 
Index Conventions .............................................................................................................. 50 
Standard Table Constructs ................................................................................................. 50 
Sequence Numbers For Primary Keys................................................................................ 52 
Appendix 2 – Understanding Hierarchies ................................................................................ 53 
Sales Regions ..................................................................................................................... 53 
Internal Organisation Structure ........................................................................................... 53 
Appendix 3 – Industry Standard Data Models ......................................................................... 55 
Appendix 4 – Information Sparsity .......................................................................................... 57 
Appendix 5 – Set Processing Techniques............................................................................... 59 
Appendix 6 – Standing on the shoulders of giants .................................................................. 60 

© 2009 Data Management & Warehousing Page 2


White Paper - Process Neutral Data Modelling

Further Reading ...................................................................................................................... 61 


Overview Architecture for Enterprise Data Warehouses..................................................... 61 
Data Warehouse Governance............................................................................................. 61 
Data Warehouse Project Management ............................................................................... 62 
Data Warehouse Documentation Roadmap ....................................................................... 62 
How Data Works ................................................................................................................. 63 
List of Figures .......................................................................................................................... 64 
Copyright ................................................................................................................................. 64 

© 2009 Data Management & Warehousing Page 3


White Paper - Process Neutral Data Modelling

Synopsis
This paper describes in detail the process for creating an enterprise data warehouse physical
data model that is less susceptible to change. Change is one of the largest on-going costs in
a data warehouse and therefore reducing change reduces the total cost of ownership of the
system. This is achieved by removing business process specific data and concentrating on
core business information.

The white paper examines why data-modelling style is important and how issues arise when
using a data model for reporting. It discusses a number of techniques and proposes a specific
solution. The techniques should be considered when building a data warehouse solution even
when an organisation decides against using the specific solution.

This paper is intended for a technical audience and project managers involved with the
technical aspects of a data warehouse project.

Intended Audience
Reader Recommended Reading
Executive Synopsis
Business Users Synopsis
IT Management Synopsis
IT Strategy Entire Document
IT Project Management Entire Document
IT Developers Entire Document

About Data Management & Warehousing


Data Management & Warehousing is a specialist consultancy in data warehousing, based in
Wokingham, Berkshire in the United Kingdom. Founded in 1995 by David M Walker, our
consultants have worked for major corporations around the world including the US, Europe,
Africa and the Middle East. Our clients are invariably large organisations with a pressing need
for business intelligence. We have worked in many industry sectors but have specialists in
Telco’s, manufacturing, retail, financial and transport as well as technical expertise in many of
the leading technologies.

For further information visit our website at: http://www.datamgmt.com

Crossword Clue: Expert Gives Us Real Understanding (4 letters)

© 2009 Data Management & Warehousing Page 4


White Paper - Process Neutral Data Modelling

Introduction
Commissioning a data warehouse system is a major undertaking. Organisations will invest
significant capital in the development of the system. The data model is always a major
consideration and many projects will invest a significant part of the budget on developing and
re-working the initial data model.

Unfortunately projects also often fail to look at the maintenance costs of the data model that
they develop. A data model that is fit for purpose when developed will rapidly become an
expensive overhead if it needs to change when the source systems change. The cost
involved is not only in the change to the data model but also in the changes to the ETL that
feed the data model.

This problem is exacerbated by the fact that changes to the data model may be done in an
inconsistent way from the original design approach. The data model loses transparency and
becomes even more difficult to maintain.

For many large data warehouse solutions it is not uncommon to have a resource permanently
assigned to maintaining the data model and several more resources assigned to managing
the change in the associated ETL within a short time of going live.

By understanding the problem and using techniques imported from other areas of systems
and software development and well as change management techniques it is possible to
define a method that will greatly reduce this overhead.

This white paper sets out an example of the issues from which to develop a statement of
requirements for the data model and then demonstrates a number of techniques which, when
used together, can address those requirements in a sustainable way.

© 2009 Data Management & Warehousing Page 5


White Paper - Process Neutral Data Modelling

The Problem
Data modelling is the process of defining the database structures in which to hold information.
To understand the Process Neutral Data Modelling approach first this paper looks at why
these database structures have such an impact on the data warehouse.

In order to demonstrate the issues with creating a data model for a data warehouse more
experienced readers are asked bear with the necessarily simplistic examples that follow.

The Example Company


A company supplies and installs widgets. There are a number of different widget types,
each having a name and specific colour. Each individual widget has a unique serial
number and can have a number of red lamps and a number of green lamps plugged
into it. The widgets are installed into cabinets at customer sites and from time to time
engineers come in and change the relative numbers of red and green lamps. The
customer name and a customer cabinet number identify cabinets. For operational
1
systems the data model might look something like this :

2
Figure 1 - Initial Operational System Data Model

This simple data model describes both the widget and the cabinet and provides the
current combinations. It does not provide any historical context: “What was the
previous configuration and when was it changed?”

Historical data can be recorded by simply adding start date and end date to each of
3
the main tables. This provides the ability to report on the historical configuration . In
order to facilitate this a separate reporting environment would be setup because
retaining history in the operational system would unacceptably reduce the operational
system performance. There are three consequences of doing this:

• Queries are now more complex. In order to report the information for a given
date the query has to allow for the required date being between the start date

1
Data models in this document are illustrative and therefore should be viewed as suitable for making
specific points rather than complete production quality solutions. Some errors exist to explicitly
demonstrate certain issues.
2
The are several conventions for data modelling. In this and subsequent diagrams the link with a 1 and
∞ represents a one-to-many relationship where the ‘1’ record is a primary key field and the ‘∞’
represents the foreign key field.
3
Note that the ‘WIDGET_LOCATIONS’ table requires an additional field called ‘INSTALL_SEQUENCE’
to allow for the case where a widget is re-installed in a cabinet.

© 2009 Data Management & Warehousing Page 6


White Paper - Process Neutral Data Modelling

and the end date of the record in each of the tables. The extra complexity
slows the execution of the query.

o The volume of data stored has also increased. The storage of dates has a
minor impact on the size of each row but this is small when compared to the
4
number of additional rows that need to be stored.

o Data has to be moved from the operational system to the reporting system
via an extract, transform and load (ETL) process. This process has to extract
the data from the operational system, compare the records to the current
records in the reporting system to determine if there are any changes and if
so make the required adjustments to the existing record (e.g. updating the
end date) and insert the new record. Already the process is more complex
5
and time consuming than simply copying the data across.

Figure 2 - Initial Reporting System Data Model

When the reporting system is built, it accurately reflects the current business
processes, operational systems and provides historical data. From a systems
management perspective there is now an additional database, and a series of ETL or
interface scripts that have to be run reliably every day.

The systems architecture may be further enhanced so that the reporting system
becomes a data warehouse and the users make their queries on data marts, or sets
of tables where the data has been re-structured in order to simplify of the users query
environment. The ‘data marts’ typically use star-schema or snowflake-schema data
6
modelling techniques or tool specific storage strategies . This adds an additional layer
of ETL to move between the data warehouse and the data mart.

However the company doesn’t stop here. The product development team create a
new type of widget. This new widget allows amber lamps and can optionally be
mounted in a rack that is in turn mounted in a cabinet. The IT director also insists that
the new OLTP application is more flexible for other future developments.

4
Assume that everything remains the same except that widgets are moved around (i.e. there are no
new widgets and no new cabinet/customer combination) then the WIDGET_LOCATIONS table grows in
direct proportion to the number of changes. If each widget were modified in some way once a month
then the reporting system table would be twelve times bigger than the operational system after one year
and this before any other change is handled.
5
Additional functionality such as data cleansing will also impact the complexity of ETL and affect
performance
6
This is accepted good practice and the design and implementation of data marts is outside the scope
of this paper.

© 2009 Data Management & Warehousing Page 7


White Paper - Process Neutral Data Modelling

These business process changes results in a new data model for the operational
system.

Figure 3 - Second Version Operational System Data Model

The reporting system is also now a live system with a large amount of historical
information. It too can be re-designed. The operational system will be implemented to
meet the business requirements and timescales regardless of whether the reporting
system is ready. It also may not be possible to create the history required for the new
7
data model when it is changed.

If a data mart is built from the data warehouse there are two impacts. Firstly that the
data mart model will need to be changed to exploit the new data and secondly that
the change to data warehouse model will require the data mart ETL to be modified
regardless of any changes to the data mart data model.

The example company does not stop here however as senior management decide to
acquire a smaller competitor. The new subsidiary has it’s own systems that reflect
their own business processes. The data warehouse was built with a promise of
providing an integrated management reporting so there is an expectation that the
data from the new source system will be quickly and seamlessly integrated into the
data warehouse. From a technical perspective this could present issues around
mapping the new source system data model to the existing data warehouse data
8 9
model, critical information data types , duplication of keys , etc. that all cause
problems with the integration of data and therefore slow down the processing.

Within a few short iterations of change it is possible to see the dramatic impact on the
data warehouse and that the system is likely to run into issues.

7
A common example of this is an organisation that captures the fact that an individual is married or not.
Later the organisation decided to capture the name of the partner if someone is married. It is not
possible to create the historical information systemically so for a period of time the system has to
support the continued use of the marital status and then possibly run other activities such as outbound
calling to complete the missing historical data.
8
The example database assumed that serial number was numeric and used it as a primary key but what
happens if the acquired company uses alphanumeric serial numbers?
9
If both companies use numbers starting from 1 for their customer ID then there will be two customers
who have the same ‘unique’ id, and customers that have two ‘unique’ IDs.

© 2009 Data Management & Warehousing Page 8


White Paper - Process Neutral Data Modelling

The Real World


The example above is designed to illustrate some of the issues that affect data
warehouse data modelling. In reality business and technical analysts will handle some
of these issues in the design phase but how big is the data-modelling problem in the
real world?

o A UK transport industry organisation has three mainframes, each of which is


only allowed to perform one release a quarter. Each system also feeds the
data warehouse. As a consequence the mainframe feeds require validation
and change every month. Whilst the main data comes from these three
systems there are sixty-five other Unix based operational system that feed
the data warehouse and data from several hundred desktop based
applications that are also provide data. Most of these source systems do not
have good change control or governance procedures to assist in impact
analysis. Change for this organisation is business as usual.

o A global ERP vendor supplies a system with over five thousand database
objects and typically makes a major release every two years, a ‘dot’ release
every six months and has numerous patches and fixes in between each
major release. This type of ERP system is in use in nearly every major
company and the data is a critical source to most data warehouses.

o A global food and drink manufacturer that came into existence as a result of
numerous mergers and acquisitions and also divested some assets found
itself with one hundred and thirty-seven general ledger instances in ten
countries with seventeen different ERP packages. Even where the ERP
packages were the same they were not necessarily using the same version of
the package. The business intelligence requirement was for a single data
warehouse and a single data model.

o A European Telco purchased a three hundred-table ‘industry standard’


enterprise data model from a major business intelligence vendor and then
spent two years analysing it before they started the implementation. Within
six months of implementation they had changed some sixty percent of tables
as a result of analysis omissions.

o A UK based banking and insurance business outsources all of its product


management to business partners and only maintains the unified customer
management systems (website, call centres and marketing). As a result
nearly all of the ‘source systems’ are external to the organisation and whilst
there are contractual agreements about the format and data remaining fixed
in practice there is significant regular change in the format and information
provided to both operational and reporting systems.

Obviously these issues cannot be fixed just by creating the correct data model for the
10
data warehouse but the objective of the data model design should be two fold:

o To ensure that all the required data can be stored effectively in the data
warehouse.

o To ensure that the design of the data model does not impose cost and where
possible actively reduces the cost of change on the system.

10
Data Management & Warehousing have published a number of other white papers that are available
at http://www.datamgmt.com and look at other aspects of data warehousing and address some of these
issues. See Further Reading at the end of this document for more details.

© 2009 Data Management & Warehousing Page 9


White Paper - Process Neutral Data Modelling

The Customer Paradigm


Data Warehouse development often start with a requirements gathering exercise. This may
take the form of interviews or workshops where people try to define what the customer is. If a
number of different parts of the business are involved then the definition of customer soon
becomes confused and controversial and negatively impacts the project. Most organisations
have a sales funnel that describes the process of capturing, qualifying, converting and
retaining customers.

Marketing say that the customer is anyone


and everyone that they communicate with.

The sales teams view the customer as


those organisations in their qualified lead
database or for whom they have account
management responsibility post-sales.

The customer services team are clear that


the customer is only those organisations
who have purchased a product and where
appropriate have purchased a support
agreement as well.

Other questions are asked in the


workshops such as “What about customers
who are also suppliers or partners?” and
“How do we deal with customers who have
gone away and then come back after a
long period of time?”

Figure 4 - The Sales Funnel The most common solutions that are
created as a result either add ‘flag’ or
‘indicator’ columns to the customer table to represent each category or to create multiple
tables for the different categories required and to repeat the data in each of the tables.

This example clearly demonstrates that the business process is being embedded into the
data model. The current business process definition(s) of customer are defining how the data
model is created. What has been forgotten is that these ‘customers’ exist outside the
organisation and it is their interaction with different parts of the organisation that defines their
status of being a customer, supplier, etc. In legal documents there is the concept of a ‘party’
where a party is a person or group of persons that compose a single entity that can be
11
identified as one for the purposes of the law . This definition is one that should be borrowed
and used in the data model.

If users query a data mart that is loaded with data extracted from the transaction repository
and data marts are built for a specific team or function that only requires one definition of the
12
data then the current definition can be used to build that data mart and different definitions
used for other departments.

11
http://en.wikipedia.org/wiki/Party_(law)
12
This also allows flexibility, as, when business processes change, it is possible at a cost to change the
rules by which data is extracted. The cost of change is relatively much lower than trying to rebuild the
data warehouse and data mart with a new definition.

© 2009 Data Management & Warehousing Page 10


White Paper - Process Neutral Data Modelling

As a result of this approach two questions are common:

• Isn’t one of the purposes of building a data warehouse to have a single version of the
truth?
Yes. There is a single version of the truth in the data warehouse and this single
version is perpetuated into the data marts, the difference is that the information in the
data mart is qualified. Asking the question “How many customers do we have?”
should get the answer “Customer Services have X active service contract customers”
and not the answer “X” without any further qualification.

• What happens if different teams or departments have different data?


People within the organisation work within different processes and with the same
terminology but often different definitions, it is unlikely and impractical in the short
term to change this, although it is possible that in the long term the data warehouse
project will help with the standardization process. In the mean time it is an education
process to ensure that answers are qualified. It is important to recognise that different
departments legitimately have different definitions and therefore to recognise and
understand the differences, rather than fighting about who is right.

It might be argued that there are too many differences to put all individuals and organisations
in a single table; this and other issues will be discussed later in the paper.

© 2009 Data Management & Warehousing Page 11


White Paper - Process Neutral Data Modelling

Requirements of a Data Warehouse Data Model


Having looked at the problems that can affect a data warehouse data model it is possible to
describe the requirements that should be made on any data model design.

Assumptions
1. The data model is for use in the architectural component called the transaction
13
repository or data warehouse.

2. As the data model is used in the data warehouse it will not be a place where
users go to query the data, instead users will query separate dependant data
marts.

3. As the data model is used in the data warehouse data will be extracted from it
to populate the data marts by ETL tools.

4. As the data model is used in the data warehouse the data will be loaded into it
from the source systems by ETL tools.

5. Direct updates (i.e. not through formally released ETL processes) will be
prohibited; instead a separate application or applications will exist as a
surrogate source.

6. The data model will not be used in a ‘mixed mode’ where some parts use one
data modelling convention and other parts use another. (This is generally bad
practice with any modelling technique but often the outcome where the
responsibility for data modelling changes is distributed or re-assigned over
time).

Requirements
1. The data model will work on any standard business intelligence relational
14
database. This is to ensure that it can be deployed on any current platform
and if necessary re-deployed on a future platform.

2. The data model will be process neutral i.e. it will not reflect current business
processes, practices or dependencies but instead will store the data items and
relationships as defined by their use at the point in time when the information is
acquired.
15
3. The data model will use a design pattern i.e. a general reusable solution to a
commonly occurring problem. A design pattern is not a finished design but a
description or template for how to solve a problem that can be used in many
different situations.

13
For further information on Transaction Repositories see the Data Management & Warehousing white
paper ”An Overview Architecture For Enterprise Data Warehouses”
14
A typical list would (at the time of writing) include IBM DB2, Microsoft SQL Server, Netezza, Oracle,
Sybase, Sybase IQ, and Teradata. For the purposes of this document it implies compliance with at least
the SQL92 standard
15
http://en.wikipedia.org/wiki/Software_design_pattern

© 2009 Data Management & Warehousing Page 12


White Paper - Process Neutral Data Modelling

16
4. Convention over configuration : This is a software design paradigm which
seeks to decrease the number of decisions that developers need to make,
gaining simplicity, but not necessarily losing flexibility. It can be applied
successfully to data modelling and reduce the number of decisions of the data
modeller by ensuring that tables and columns use a standard naming
convention and are populated and queried in a consistent fashion. This also
has a significant impact on the efforts of an ETL developer.

5. The design should also follow the DRY (Don’t Repeat Yourself) principle. This
is a process philosophy aimed at reducing duplication. The philosophy
emphasizes that information should not be duplicated, because duplication
increases the difficulty of change, may decrease clarity, and leads to
17
opportunities for inconsistency.

6. The data model should be significantly static over a long period of time, i.e.
there should not be a need to add or modify tables on a regular basis. In this
case there is a difference between designed and implemented, it is possible to
have designed a table but not to implement it until it is actually required. This
does not affect the static nature of the data model, as the placeholder already
exists.
18
7. The data model should store data at the lowest possible level and avoid the
storage of aggregates.

8. The data model should support the best use of platform specific features whilst
19
not compromising the design.

9. The data model should be completely time-variant, i.e. it should be possible to


20
reconstruct the information at any available point in time.

10. The data model should act as a communication tool to aid the refinement of
requirements and an explanation of possibilities.

16
For further information see http://en.wikipedia.org/wiki/Convention_over_Configuration and
http://softwareengineering.vazexqi.com/files/pattern.html. The Ruby on Rails language
(http://www.rubyonrails.org/) makes extensive use of this principle.
17
DRY is a core principle of Andy Hunt and Dave Thomas's book The Pragmatic Programmer. They
apply it quite broadly to include "database schemas, test plans, the build system, even documentation."
When the DRY principle is applied successfully, a modification of any single element of a system does
not change other logically unrelated elements. Additionally, elements that are logically related all change
predictably and uniformly, and are thus kept in sync. (http://en.wikipedia.org/wiki/DRY). This does not
automatically imply database normalisation but database normalisation is one method for ensuring
‘dryness’.
18
This is the origin of the term ‘Transaction Repository’ rather than ‘Data Warehouse’ in Data
Management & Warehousing documentation. The transaction repository stores the lowest level of data
that is practical and/or available. (See An Overview Architecture for Enterprise Data Warehouses)
19
This turns out to be both simple and very effective. For Oracle the most common features that need
support include partitioning and materialized views. For Sybase IQ and Netezza there is a preference for
inserts over updates due to their internal storage mechanisms. For all databases there is variation in
indexing strategies. These and other features should be easily accommodated.
20
Also known as temporal. Most data warehouses are not linearly time variant but quantum time variant.
If a status field is updated three times in a day and the data warehouse reflects all changes then it is
linearly time-variant. If a data warehouse holds the first and last values only because a batch process
loads it once a day then it is quantum time-variant where the quantum is, in this case, one day.
Quantum time variant solutions can only resolve data to the level of the quantum unit of measure.

© 2009 Data Management & Warehousing Page 13


White Paper - Process Neutral Data Modelling

The Data Model


As this white paper has defined requirements for the data model it is now possible to start
looking at what is needed to design a data model. This is done by breaking down the tables
that will be created into different groups depending on how they are used. The section below
discusses the main elements of the data models. There are some basics such as naming
conventions, standard short names, keys used in the data model, etc. that are not described.
A complete set of data modelling rules and example models can be found in the appendices.

Major Entities
Party is, as described in the customer paradigm section above, an example of a type of
table within the Process Neutral Data Modelling method known as a ‘Major Entity’.
These are tables that deliver the placeholders for all major subject areas of the data
model and around which other information is grouped. Each business transaction will
relate to a number of major entities. Some major entities are global i.e. they apply to all
types of organisation (e.g. Calendar) and there are a number of major entities that are
industry specific (e.g. for Telco, Manufacturing, Retail, Banking, etc.). It would be very
unusual for an organisation to need a major entity that was not industry wide. Below is
a list of some of the most common:

• Calendar
Every data warehouse will need a calendar. It should always contain data to
the day level and never to parts of the day. In some cases there is a need to
21
support sub-types of calendar for non-Gregorian calendars .

• Party
Every organisation will have dealings between parties. This will normally
include three major sub-types: individuals, organisations (any formal
organisation such as a company, charity, trust, partnership, etc.) and
organisational units (the components within an organisation including the
system owners organisation).

• Geography
The information about where. This is normally sub-typed into two components,
22
address and location. Address information is often limited to postal addresses
whilst location is normally described by the longitude and latitude via GPS co-
ordinates. Other specialist geographic models exist that may need to be taken
23
into account.

• Product_Service (also known as Product or as Service)


This is the catalogue of the products and/or services that an organisation
supplies.

• Account
Every customer will have at least one account if financial transactions are
involved (even those organisations that do not think they currently use the
concept of account will do so as accounting systems always have the concept
of a customer with one or more accounts).

21
See http://www.qppstudio.net/footnotes/non-gregorian.htm for various calendars, notably 2008 is the
Muslin Year 1429 and the Jewish Year 5968
22
Some countries, such as the UK, have validated lists of all addresses (see the UK Post Office
Postcode Address File at http://www.royalmail.com/portal/rm/jump2?mediaId=400085&catId=400084)
23
Network Rail in the UK use an Engineers Line Reference, which is based on a linear reference model
and refers to a known distance from a fixed point on a track. In Switzerland they have an entire national
co-ordinate system (http://en.wikipedia.org/wiki/Swiss_coordinate_system)

© 2009 Data Management & Warehousing Page 14


White Paper - Process Neutral Data Modelling

• Electronic_Address
Any electronic address such as a telephone number, email address, web
address, IP address etc. This is normally sub-typed by the categories used.

• Asset (also known as Equipment)


A physical object that can be uniquely identified (normally by a serial number or
similar). This may be used or incorporated in a PRODUCT_SERVICE, or sold
to a customer etc. In the example Cabinet, Rack and Widget were all examples
of Asset, whilst Widget Type was an example of PRODUCT_SERVICE.

• Component
A physical object that cannot be uniquely identified by a serial number but has
a part number and is used in the make-up of either an asset or of a product
service. In the example company there was not a particular record of the serial
numbers of the lamps, however they would all have had a part number that
described the type of lamp to be used.

• Channel
A conceptual route to market (e.g. direct, indirect, web-based, call-centre, etc.).

• Campaign
A marketing exercise that is designed to promote the organisation, e.g. the
running of a series of adverts on the television.

• Campaign Activities
The running of a specific advert as part of a larger campaign.

• Contract
Depending on the type of business the relationship between the organisation
and its supplier or its customer may require the concept of a contract as well as
that of an account.

• Tariff (also known as Price_List)


A set of charges and discounts that can be applied to product services as a
point in time.

This list is not comprehensive by if an organisation can effectively describe their major
entities and combine this information with the interactions between them (the
occurrences or transactions) then they have the basis of a very successful data
warehouse.

Major Entities can have any meaningful name provided it is not a reserved word in the
database or (as will be seen below) a reserved word within the design pattern of
Process Neutral Data Modelling.

Some readers, who are familiar with the concepts of star schemas and data marts, will
also be aware that these are very close to the basic dimensions that most data marts
use. This should come as no surprise as these are the major data items of any
business regardless of their business processes or of their specific industry sector and
a data mart is only a simplification of the data presented for the user. This effect is
called “natural star schemas” and will be explored in more detail later.

© 2009 Data Management & Warehousing Page 15


White Paper - Process Neutral Data Modelling

Lifetime Value
The next decision is which columns (attributes) should be included in the table.
24
Much like the processes involved in normalising a database the objective is to
minimise duplication of data and there is also a requirement to minimise updates.
To this end the attributes that are included should therefore have ‘lifetime value’,
i.e. they should remain constant once they have been inserted into the database.
This means that variable data needs to be handled elsewhere.

Using some of the major entities above as examples:

Calendar:
Lifetime Value Attributes: Date, Public Holiday Flag

Geography:
Lifetime Value Attributes: Address Line 1, Address Line 2, City,
25
Postcode , County, Country
Non-Lifetime Value Attributes: Population

Party (Individuals):
26
Lifetime Value Attributes: Forename, Surname , Date of Birth,
27
Date of Death, Gender , State ID Number
Non-Lifetime Value Attributes: Marital Status, Number of Children, Income

Party (Organisations):
Lifetime Value Attributes: Name, Start Date, End Date,
State ID Number
Non-Lifetime Value Attributes: Number of Employees, Turnover,
Shares Issued

Account:
Lifetime Value Attributes: Account Number, Start Date, End Date.
Non-Lifetime Value Attributes: Balance

Other than this lifetime value requirement for columns every table must comply with the
general rules for any table. For example every table will have a key column that uses
28
the table short name made up of six characters and the suffix _DWK , a TIMESTAMP
column and an ORIGIN column.

24
http://en.wikipedia.org/wiki/Database_normalization: Database normalization is a technique for
designing relational database tables to minimize duplication of information and, in so doing, to safeguard
the database against certain types of logical or structural problems, namely data anomalies.
25
This may occasionally be a special case as postal services do, from time to time, change postal codes
that are normally static.
26
There is a specific special case that deals with the change of name for married women that will be
dealt with in the section ‘The Party Special Case’ later.
27
One insurance company had to deal with updatable genders due to the fact that underwriting rules
require assessment based on birth gender and not gender as a result of re-assignment surgery.
Therefore for marketing it had to handle ‘current’ gender and for underwriting it had to deal with ‘birth’
gender.
28
See the data modelling rules appendix for how this name is created.

© 2009 Data Management & Warehousing Page 16


White Paper - Process Neutral Data Modelling

Type Tables
There is often a need to categorise information into discrete sets of values. The valid
set of categories will probably change over time and therefore each category record
also needs to have lifetime value. Examples of the categorisation have already
occurred with the some of the major entities:

• Party: Individual, Organisation, Organisation Unit


• Geography: Postal Address, Location
• Electronic Address: Telephone, E-Mail

To support this and to comply with the requirement for convention over configuration all
_TYPES tables of this format have a standard data model as follows:

• The table will have the same name as the major entity but with the suffix
_TYPES (e.g. PARTY_TYPES, GEOGRAPHY_TYPES, etc.).
• The table will always have a key column that uses the six character short code
and the _DWK suffix.
• The table will have a _TYPE column that is the type name.
• The table will have a _DESC column that is a description of the type.
• The table will have a _GROUP column that groups certain types together.
• The table will have a _START_DATE column and a _END_DATE column.

This is a type table in its entirety. If a table needs more information (i.e. columns) then
this is not a _TYPES table and must not have the _TYPES extension, as it does not
comply with the rules for a _TYPES table.

Examples of data in _TYPES tables might include:

PARTY_TYPES

Column Example Rows


PARTYP_DWK 1 2 3 4
PARTY_TYPE INDIVIDUAL LTD COMPANY PARTNERSHIP DIVISION
PARTY_TYPE_DESC An Individual A company in This is a business A division of a
which the liability owned by two or larger
of the members in more people who organisation
respect of the are personally
company’s debts liable for all
is limited business debts.
PARTY_TYPE_GROUP INDIVIDUAL ORGANISATION ORGANISATION UNIT
PARTY_TYPE_START_DATE 01-JAN-1900 01-JAN-1900 01-JAN-1900 01-JAN-1900
PARTY_TYPE_END_DATE
Figure 5 - Example data for PARTY_TYPES

The start date in this context has little initial value in this context, although it is a
29
mandatory field and therefore has to be completed with a date before the earliest
party in this example. Legal types of organisation do change over time and so it is
possible that the start and end dates of these will become significant.

These types do not describe the type of role that the party is performing (i.e. Customer,
Supplier, etc.) they describe the type of the party (e.g. Individual, etc.). Describing the
role comes later. The type and group column are repeated for INDIVIDUAL, as there is
no hierarchy of information for this value but the field is mandatory.

29
Start Dates in _TYPES tables are mandatory as, with only a few exceptions, they are required
information. In order to be consistent they therefore have to be mandatory for all _TYPES tables

© 2009 Data Management & Warehousing Page 17


White Paper - Process Neutral Data Modelling

GEOGRAPHY_TYPES

Column Example Rows


GEOTYP_DWK 1 2
GEOGRAPHY_TYPE POSTAL LOCATION
GEOGRAPHY_TYPE_DESC An address as supported by A point on the surface of the earth
the postal service defined by it’s longitude and
latitude
GEOGRAPHY _TYPE_GROUP POSTAL LOCATION
GEOGRAPHY _TYPE_START_DATE 01-JAN-1900 01-JAN-1900
GEOGRAPHY _TYPE_END_DATE
Figure 6 - Example Data for GEOGRAPHY_TYPES

The start date in this context has little initial value, although it is a mandatory field and
therefore has to be completed with a date.

These types do not describe the type of role that the geography is performing (i.e.
home address, work address, etc.) they describe the type of the geography (postal
address, point location, etc.).

The type and group column are repeated for both values, as there is no hierarchy of
information for them.

CALENDAR_TYPES

The convention over configuration design aspect allows for this table, however it is
rarely needed and can therefore be omitted. This is an example where a table can be
described as designed (i.e. it is known exactly what it looks like) but not implemented.

_TYPES tables will appear in other parts of the data model but they will always have
the same function and format.
30
The consequence of this design re-use is that implementing an application to manage
the source of _TYPE data is easy. The system than manages the type data needs to
have a single table with the same columns as a standard _TYPES table and an
additional column called, for example, DOMAIN. This DOMAIN column has the target
system table name (e.g. PARTY_TYPES) in it. The ETL then simply maps the data
from the source system to the target system where the DOMAIN equals the target table
name. This is an example of re-use generating a significant saving in the
implementation.

30
This is a good use of a Warehouse Support Application as defined in “An Overview Architecture for
Enterprise Data Warehouses”

© 2009 Data Management & Warehousing Page 18


White Paper - Process Neutral Data Modelling

Band Tables
Whilst _TYPES tables classify information into discrete values it is sometimes
necessary to classify information into ranges or bands i.e. between one value and
another. The classic example of this is for telephone calls which are classified as ‘Off-
Peak Rate’ if they are between 00:00 and 07:59 or between 18:00 and 23:59. Calls
between 08:00 and 17:59 are classified as ‘Peak Rate’ and charged at a premium.

_BANDS is a special case of the _TYPES table and would store the data as follows:

Column Example Rows


TIMBAN_DWK 1 2 3
TIME_BAND Early Off Peak Peak Late Off Peak
31
TIME_BAND_START_VALUE 0 480 1080
TIME_BAND_END_VALUE 479 1079 1439
TIME_BAND_DESC Early Off Peak Peak Late Off Peak
TIME_BAND_GROUP Off Peak Peak Off Peak
TIME_BAND_START_DATE 01-JAN-1900 01-JAN-1900 01-JAN-1900
TIME_BAND_END_DATE
Figure 7 - Example data for TIME_BANDS

Once again the _BANDS table has a standard format as follows

• The table will have the same name as the major entity but with the suffix
_BANDS (e.g. TIME_BANDS, etc.).
• The table will always have a key column that uses the six character short code
and the _DWK suffix.
• The table will have a _BAND column that is the type name.
• The table will have a _START_VALUE and a _END_VALUE that represent the
starting and finishing values of the band.
• The table will have a _DESC column that is a description of the band.
• The table will have a _GROUP column that groups certain band together.
• The table will have a _START_DATE column and a _END_DATE column.

The table has to comply with this convention in order to be given the _BANDS suffix.

31
Note that values are stored as a number of minutes since midnight.

© 2009 Data Management & Warehousing Page 19


White Paper - Process Neutral Data Modelling

Property Tables
In the discussion of major entities and lifetime value the data that failed to meet the
lifetime value principle was omitted from the major entity tables, however it still needs
to be stored. This is handled via a property table. Property tables also help to support
the extensibility aspects of the data model.

If we use PARTY as an example then as already identified the marital status does not
possess lifetime value and therefore is not included in the major entity. Everyone starts
as single, some marry, some divorce and some are widowed, these ‘status changes’
occur through the lifetime of the individual.

To deal with this problem the property table can be modelled as follows:

Figure 8 - Party Properties Example

As can be seen from example above in order to handle the properties two new tables
are created. The first is the PARTY_PROPERTIES table itself and the second a
supporting PARTY_PROPERTY_TYPES table.

In order to store the marital status of an individual a set of data needs to be entered in
the PARTY_PROPERTY_TYPES table:

TYPE GROUP
Single Marital Status
Married Marital Status
Divorced Marital Status
Co-Habiting Marital Status
Figure 9 - Example Party Property Data

The description, start and end date would be filled in appropriately. Note that the start
and end date here represent the start and end date of the type and not that of the
32
individuals’ use of that type.

It is now possible to insert a row in the PARTY_PROPERTIES table that references the
individual in the PARTY table and the appropriate PARTY_PROPERTY_TYPES (e.g.
‘Married’). The PARTY_PROPERTIES table can also hold the start date and end date
of this status and optionally where appropriate a text or numeric value that relates to
that property.

32
The need for start and end dates on such items is often questioned however experience shows that
legislation changes supposed static values in most countries over the lifetime of the data warehouse.
For example in December 2005 the UK permitted a new type of relationship called a civil partnership.
http://en.wikipedia.org/wiki/Civil_partnerships_in_the_United_Kingdom.

© 2009 Data Management & Warehousing Page 20


White Paper - Process Neutral Data Modelling

This means that not only the current marital status can be stored but also historical
information.
33
PARTY_DWK PARTY_PROPERTY_DWK START_DATE END_DATE
John Smith Single 01-Jan-1970 02-Feb-1990
John Smith Married 03-Feb-1990 04-Mar-2000
John Smith Divorced 05-Mar-2000 06-Apr-2005
John Smith Co-Habiting 07-Apr-2005
Figure 10 - Example data for PARTY_PROPERTIES

The data shown here describes the complete history of an individual with the last row
showing the current state as the START_DATE is before ‘today’ and the END_DATE is
null. There is also nothing to prevent future information from being held. If John Smith
announces that he is going to get married on a specific date in the future then the
current record can have it’s end date set appropriately and a new record added.

If another property is required (e.g. Number of Children) then no change is required to


the data model. New rows are entered into the PARTY_PROPERTY_TYPES table:

TYPE GROUP
Male Number of Children
Female Number of Children
Figure 11 - Example Data for PARTY_PROPERTY_TYPES

This allows data to be added to the PARTY_PROPERTIES as follows:

PARTY_DWK PARTY_PROPERTY_DWK START_DATE END_DATE VALUE


John Smith Single 01-Jan-1970 02-Feb-1990
John Smith Married 03-Feb-1990 04-Mar-2000
John Smith Divorced 05-Mar-2000 06-Apr-2005
John Smith Co-Habiting 07-Apr-2005
John Smith Male 09-Jun-2001 1
John Smith Female 10-Jul-2002 1
Figure 12 - Example Data for PARTY_PROPERTIES

In fact any number of new properties can be added to the tables as business processes
and source systems change and new data requirements come about.

The effect of this method when compared to other methods of modelling this
information is to create very narrow (i.e. not many columns) long (i.e. many rows)
tables instead of making very much wider, shorter tables. However the properties table
34
is very effective. Firstly, unlike the example, the two _DWK columns are integers , as
are the start and end dates. Many of the _VALUE fields will be NULL, and those that
are not will be predominately numeric rather than text values.

The PARTY_PROPERTY_TYPE acts as a natural partitioning key in those databases


that support table partitions. This method is very effective in terms of performance and
storage of data in databases that use column or vector type storage.

33
Text from the related table is used in the _DWK column rather than the numeric key for clarity in these
examples.
34
Integers are better than text strings for a number of reasons: they usually require less storage and
there is less temptation to mix the requirements of identification and description (a problem clearly
illustrated by car registration numbers in the UK).
Keys are more reliable when implemented as integers because databases often have key generation
mechanisms that deliver unique values. Integers do not suffer from upper/lower case ambiguities and
can never contain special characters or ambiguities caused by different padding conventions (trailing
spaces or leading zeros).

© 2009 Data Management & Warehousing Page 21


White Paper - Process Neutral Data Modelling

The real saving in the number of rows is normally less than expected when compared
to more conventional data model techniques that store duplicated rows for changed
data. The example above has seven rows of data. The alternate approach of repeated
sets of data requires six rows of data and considerably more storage because of the
duplicated data:

PARTY_DWK START_DATE END_DATE MARITAL_STATUS

UNKNOWN

FEMALE
CHILD

CHILD

CHILD
MALE
John Smith 01-Jan-1970 02-Feb-1990 Single 0 0 0
John Smith 03-Feb-1990 08-Jun-2001 Married 0 0 0
John Smith 09-Jun-2001 09-Jul-2002 Married 0 1 0
John Smith 10-Jul-2002 04-Mar-2000 Married 0 1 1
John Smith 05-Mar-2000 06-Apr-2005 Divorced 0 1 1
John Smith 07-Apr-2005 Co-Habiting 0 1 1
Figure 13 - Example Data for PARTY_PROPERTIES

The other main objection to this technique is often described as the cost of matrix
transformation of the data. That is the changing of the data from rows into columns in
the ETL to load the data warehouse and then changing the columns back to rows in the
ETL to load the data mart(s). This objection is normally due to a lack of knowledge of
appropriate ETL techniques that can make this very efficient such as using SQL set
operations such as ‘UNION’, ‘MINUS’ and ‘INTERSECT’.

Event Tables
An event table is almost identical to a property table except that instead of having
_START_DATE and _END_DATE columns it has a single column _EVENT_DATE. It
also has the appropriate _EVENT_TYPES table. The table name has a suffix of
_EVENTS. For example a wedding is an event (happens at a single point in time), but
‘being married’ is a property (happens over a period of time). Events can be stored in
property tables simply by storing the same value in both the start date and end date
columns and this is a more common solution than creating a separate table. The use of
_EVENTS tables is usually limited to places where events form a significant part of the
data and the cost of storing the extra field becomes significant.

It should be noted that this is only required where the event may occur many times
(e.g. a wedding date) rather than information that can only happen once (e.g. first
wedding date) which would be stored in the appropriate major entity as, once set, it
would have lifetime value.

Figure 14 - Party Events Example

_EVENTS tables are a special case of _PROPERTIES tables.

© 2009 Data Management & Warehousing Page 22


White Paper - Process Neutral Data Modelling

Link Tables
Up to this point major entity attributes within a single record have been examined. It is
also possible that records within the major entities will also relate to other records in the
same major entity (e.g. John Smith is married to Jane Smith, both of whom are records
within the PARTIES table). This is called a peer-to-peer relationship and is stored in a
table with the suffix _LINKS and the appropriate _LINK_TYPES table.

Figure 15 - Party Links Example

The significant difference in a _LINK table is that there are two relationships from the
major entity (in this case PARTIES).

This also allows hierarchies to be stored so that:

John Smith (Individual) works in Sales (Organisational Unit)


Sales (Organisation Unit) is a division of ACME Enterprises (Organisation)

where ‘works in’ and ‘is a division of’ are examples of the _LINK_TYPE.

It should also be noted that there is a priority to the relationship because one of the
linking fields is the main key (in this case PARTIE_DWK) and the other is the linked
key (in this case LINKED_PARTIE_DWK). There are two options; one is to store the
relationship in both directions (e.g. John Smith is married to Jane Smith and Jane
35
Smith is married to John Smith). This can be made complete with a reversing view
but defeats both the ‘Convention over Configuration’ principle and the ‘DRY (Don’t
Repeat Yourself)’ principle. The second method is to have a convention and only
store the relationship in one direction (e.g. John Smith is married to Jane Smith,
therefore the convention could be that that the male is being stored in the main key
and the female is being stored in the linked key).

35
A reversing view is one that has all the same columns as the underlying table except that the two key
columns are swapped around. In this example PARTIE_DWK would be swapped with
LINKED_PARTIE_DWK.

© 2009 Data Management & Warehousing Page 23


White Paper - Process Neutral Data Modelling

Segment Tables
The final type of information that might be required about a major entity is the
segment. This is a collection of records from the major entity that share something in
common but more detail is not known. The most common business example of this
would be the market segmentations done on customers. These segments are
normally a result of detailed statistical analysis and then storing the results.

In our example John Smith and Jane Smith could both be part of a segment of
married people along with any number of other individuals for whom it is known that
they are married but there is no information about when or to whom they are married.

Where the _LINKS table provided the peer-to-peer relationship the segment provides
the peer group relationship.

Figure 16 - Party Segments Example

© 2009 Data Management & Warehousing Page 24


White Paper - Process Neutral Data Modelling

The Sub-Model
The major entities and the six supporting data structures (_TYPES, _BANDS,
_PROPERTIES, _EVENTS, _LINKS, and _SEGMENTS) provide sufficient design pattern
structure to hold a large part of the information in the data warehouse. This is known as a
Major Entity Sub-Model. Significantly the information that has been stored for a single major
entity sub-model is very close to the typical dimensions of a data mart. This design pattern
provides complete temporal support and the ability to re-construct a dimension or dimensions
based on a given set of business rules.

The set of a major entity and the supporting structures is known as a sub-model. For example
the designed PARTY sub-model consists of:

• PARTIES

• PARTY_TYPES
• PARTY_BANDS

• PARTY_PROPERTIES
• PARTY_PROPERTY_TYPES

• PARTY_EVENTS
• PARTY_EVENT_TYPES

• PARTY_LINKS
• PARTY_LINK_TYPES

• PARTY_SEGMENTS
• PARTY_SEGMENT_TYPES

Those tables in bold italics might represent the implemented PARTY sub-model

Importantly what has not been provided is the relationships between major entities and the
business transactions that occur as a result of the interaction between major entities.

© 2009 Data Management & Warehousing Page 25


White Paper - Process Neutral Data Modelling

History Tables
Extending the example above it is noticeable that the party does not contain any
address information; this is held in the geography major entity. This is also another
example where current business processes and requirements may change. At the
outset the source system may provide a contract address and a billing address. A
change in process may require the capture of additional information e.g. contact
addresses and installation addresses.

In practice the only difference between this type of relationship between major entities
and the _LINKS relationship is that instead of two references to the same major entity
there is one relationship to each of two major entities.

The data model is therefore relatively simple to construct:

Figure 17 – Party Geography History Example

There is one minor semantic difference between links and histories. _LINKS tables join
back on to the major entity and therefore one half of the relationship has to be given
priority. In a _HISTORY table there is no need for priority as each of the two attributes
is associated with a different major entity.

Finally note that in this example the major entity is shown without the rest of the sub-
model that can be assumed.

© 2009 Data Management & Warehousing Page 26


White Paper - Process Neutral Data Modelling

Occurrences and Transactions


The final part of the data model is to build up all the occurrence or transaction tables. In
the data mart these are most akin to the fact tables although as this is a relational
model they may occur outside a pure star relationship. Like the major entities there is
no standard suffix or prefix, just a meaningful name.

To demonstrate what is required an example from a retail bank is described. The


example is not nearly as complex as a real bank but necessarily longer and more
complex than most examples to demonstrate a number of features. Furthermore
banking has been chosen as an example because the concepts will be familiar to most
readers. The example only looks at some core banking function and not at the activities
such as marketing or specialist products such as insurance.

The Example

The bank has a number of regions and a central ‘premium’ account function that
caters for some business customers. Each region has a number of branches.
Branches have a manager and a number of staff. Each branch manager reports
to a regional manager.

If a customer has a personal account then the account manager is a branch


personal account manager, however if the individual has a net worth in excess of
USD1M the branch manager acts as the account manager. Personal accounts
have contact and statement addresses and a range of telephone numbers, e-
mails, addresses, etc.

If the account belongs to a business with less than USD1M turnover then the
account manager is a business account manager at the branch who reports to
the branch manager. If the account belongs to a business with a turnover of
between USD1M and USD10M then the account manager is an individual at the
regional office who reports to the regional manager. If the account belongs to a
business with a turnover more than USD10M then the account managers at the
central office are responsible for the account. Businesses have contact and
statement addresses as well as a number of approved individuals who can use
the company account and contact details for them.

Branch and account managers periodically review the banding of accounts by


income for individuals and turnover for companies and if they are likely to move
band in the coming year then they are added to the appropriate (future) category.
Note that this is only partially fact based, the rest being based on subjective input
from account managers.

The bank offers a range of services including current, loan and deposit accounts,
credit and debit cards, EPOS (for business accounts only), foreign exchange,
etc.

The bank has a number of channels including branches, a call centre service, a
web service and the ability to use ATMs for certain transactions.

The bank offers a range of transaction types including cash, cheque, standing
order, direct debit, interest, service charges, etc.

© 2009 Data Management & Warehousing Page 27


White Paper - Process Neutral Data Modelling

After the close of business on the last working day of each month the starting
and ending balances, the average daily balance and any interest is calculated for
each account.

On a daily basis the exposure (i.e. sum of all account balances) is calculated for
each customer along with a risk factor that is a number between 0 and 100 that
is influenced by a number of factors that are reviewed from time to time by the
risk management department. Risk factors might include sudden large deposits
or withdrawals, closure of a number of accounts, long-term non-use of an
account, etc. that might influence account managers’ decisions.

Every transaction that is made is recorded every day and has three associated
dates, the date of the transaction, the date it appeared on the system and the
cleared date.

De-constructing the example

The bank has a number of regions and a central ‘premium’ account function that
caters for some business customers. Each region has a number of branches.
Branches have a manager. Each branch manager reports to a regional manager.

• The bank itself must be held as an organisation.


• The regions and central ‘premium’ account function are held as
36
Organisation Units.
• The bank and the regions have links.
• The branches are held as organisational units.
• The regions and the branches have links.
• The branches have addresses via a history table.
• The branches have electronic addresses via a history table.
• There are a number of roles stored as organisation units.
• There roles and the individuals have links.
• The roles may have addresses via a history table.
• The roles may have electronic addresses via a history table.
• The individuals may have addresses via a history table.
• The individuals have electronic addresses via a history table.

At this point only existing major entities and history tables have been used. Also
this information would be re-usable in many places just like the conformed
dimensions concept of star schemas but with more flexibility.

If a customer has a personal account then the account manager is a branch


personal account manager, however if the individual has a net worth in excess of
USD1M the branch manager acts as the account manager. Personal accounts
have contact and statement addresses and a range of telephone numbers, e-
mails, etc.

• Customers are held as Parties, either Individuals or Organisations.


• Customers have addresses via a history table.
• Customers have electronic addresses via a history table.
• Accounts are held in the Accounts major entity.
• Customers are related to accounts via a history table.
• Branches are related to accounts via a history table.
• Accounts are associated with a role via a history table.
• An individual’s net worth is generated elsewhere and stored as a property
of the party.

36
See Appendix 2 – Understanding Hierarchies for an explanation as to why the regions are
organisational units and not geography.

© 2009 Data Management & Warehousing Page 28


White Paper - Process Neutral Data Modelling

• A high net worth individual is a member of a similarly named segment.


• The accounts may have addresses via a history table.
• The accounts may have electronic addresses via a history table.

If the account belongs to a business with less than USD1M turnover then the
account manager is a business account manager at the branch who reports to
the branch manager. If the account belongs to a business with a turnover of
between USD1M and USD10M then the account manager is an individual at the
regional office who reports to the regional manager. If the account belongs to a
business with a turnover over USD10M then the account managers at the central
office are responsible for the account. Businesses have contact and statement
addresses as well as a number of approved individuals who can use the
company account, and contact details for them.

• Businesses are held as parties.


• The business turnover is held as a party property.
• The category membership based on turnover is held as a segment.
• The businesses may have addresses via a history table.
• The businesses may have electronic addresses via a history table.

Branch and account managers periodically review the banding of accounts by


turnover for both individuals and companies and if they are likely to move band in
the coming year then they are added to the appropriate (future) category. Note
that this is only partially fact based, the rest being based on subjective input from
account managers.

• There is a need to allow manual input via a warehouse support


application for the party segments.

At this point only the PARTY, ADDRESS, ELECTRONIC ADDRESS sub-models


and associated _HISTORY tables have been used.

The bank offers a range of services including current, loan and deposit accounts,
credit and debit cards, epos (for business accounts only), foreign exchange, etc.

• The product services are held in the product service major entity.
• The product services are associated with an account via a history.

The bank has a number of channels including branches, a call centre service, a
web service and the ability to use ATMs for certain transactions.

• The channels are held in the channels major entity.


• The ability to use a channel for a specific product service is held in the
history that relates the two major entities.

This adds the PRODUCT_SERVICE and CHANNEL major entities into the
model.

The bank offers a range of transaction types including cash, cheque, standing
order, direct debit, interest, service charges, etc.

• This requires a TRANSACTION_TYPE table that will be added to the


transaction table, which has not yet been defined.

After the close of business on the last working day of each month the starting
and ending balances, the average daily balance and any interest is calculated for
each account.

• This is stored as an account property (it may be an event).

© 2009 Data Management & Warehousing Page 29


White Paper - Process Neutral Data Modelling

On a daily basis the exposure (i.e. sum of all account balances) is calculated for
each customer along with a risk factor that is a number between 0 and 100 that
is influenced by a number of factors that are reviewed from time to time by the
risk management department. Risk factors might include sudden large deposits
or withdrawals, closure of a number of accounts, long-term non-use of an
account, etc. that might influence account managers’ decisions.

• The exposure is stored as a party property (or event).


• The party risk factor is stored as a party property.

Everything that is required to describe the transaction table is now available.

Every transaction that is made is recorded every day and has three associated
dates, the date of the transaction, the date it appeared on the system and the
cleared date.

• The Transaction Table will have the following columns


o Transaction Date
o Transaction System Date
o Transaction Cleared Date
o From Account
o To Account
o Transaction Type
o Amount

This would complete the model for the example. There are some interesting
features to examine. The first is that all amounts would be positive. This is
because for a credit to an account the ‘from account’ would be the sending party
and the ‘to account’ would be the customer’s account. For a debit the ‘to account’
would be the recipient and the ‘from account’ would be the customer’s account.

This has a number of effects. Firstly it complies with the DRY (Don’t Repeat
Yourself) principle and means that extra data is not stored for the transaction. It
also means that a collection of account information not related to any current
party (e.g. a customer at another bank) is built up. This information is useful in
the analysis of fraud, churn, market share, competitive analysis, etc.

For a customer analysis data mart the data can be extracted and converted into
the positive credit/negative debt arrangement required by the users.

The payment of bank changes and interest would also have accounts and this
information in a different data mart could be used to look at profitability,
exposure, etc.

The process has used seven major entities’ sub-models, an additional type table
and an occurrence or transaction table. Storing this information should
accommodate and absorb almost any change in business process or source
system without the need to change the data warehouse model and will allow
multiple data marts to be built from a single data warehouse quickly and easily.
In effect the type tables act as metadata for how to use and extend the data
model rather than defining the business process explicitly in the data model,
hence the name process neutral data modelling.

It also demonstrates the ability of the data model to support the requirements
process. By knowing the major entities and using a storyboard approach similar
to the example above, and familiar as an approach to agile developers, it is
possible to quickly and easily identify business, data and query requirements.

© 2009 Data Management & Warehousing Page 30


White Paper - Process Neutral Data Modelling

Party Sub Model


including:
• Individuals
History • Organisations History
• Organisation Units
• Roles

Addresses Sub Model Electronic Addresses Sub Model


including: including:
• Postal Address • Telephone Numbers
• Point Location • E-Mail Addresses
• Telex

History

Accounts Sub Model


History History

History History

Channel Sub Model Product Service Sub Model

Retail Banking Transactions Transaction


Calendar
Types
Sub Model

Figure 18 - The Example Bank Data Model

© 2009 Data Management & Warehousing Page 31


White Paper - Process Neutral Data Modelling

The model above has been almost fully described in detail by this document since the self-
similar modelling for all the sub-model components has been described along with the history
tables, most of the retail banking transactions and some of the lifetime attributes of the major
entities. To complete the model just needs these additional attributes to be added.

Two other effects that will influence the creation of data marts from this model can also be
seen. Firstly the creation of dimensions will revolve around the de-normalisation of the
attributes that are required from each of the major entities into one of the two dimensions
associate with account as these have the hierarchies for the customer, account manager, etc
associated with them.

The second effect is that of the natural star schema. It is clear from this diagram that the fact
tables will be based around the ‘Retail Banking Transactions’ table. As has already been
stated there are several data marts that can be built from this fact table, probably at different
levels of aggregation and with different dimensions.

The occurrence or transaction table above is one of perhaps twenty that a large enterprise
would require along with approximately thirty _HISTORY tables. This would be combined with
around twenty major entity sub models to create an enterprise data warehouse data model.

For those readers who have also read and are familiar with the Data Management &
37
Warehousing white paper ‘How Data Works’ that describes natural star schemas in more
detail and also a technique called left to right entity diagrams will see a correlation as follows:

Level Description
1 _TYPE and _BAND tables, simple small volume reference data.
2 Major Entities, complex low volume data.
3 Some major entities that are dependent on others along with _PROPERTIES and _SEGMENTS
tables, less complex but with greater volume.
4 _HISTORY tables and some occurrence or transaction tables.
5 Occurrence or transaction tables. Significant volume but low complexity data.
Figure 19 - Volume & Complexity Correlations

37
Available for download from http://www.datamgmt.com/whitepapers

© 2009 Data Management & Warehousing Page 32


White Paper - Process Neutral Data Modelling

Implementation Issues
The use of a process neutral data model and a design pattern is meant to ease the design of
a system but there will always be exceptions and things that need further explanation in order
to fit them into the solution. Much of this section refers to ETL issues that can only be briefly
38
described in this context.

The ‘Party’ Special Case


The examples throughout this document have used the PARTY table as a major entity
but in practice this is one of the more difficult tables to deal with. The first issue is that
in many cases name does not have lifetime value, for example when a woman gets
39
married or divorced and changes her name or when a company renames itself. Also
Individual names often have multiple parts (title, forename, surname).

There is also a requirement to track some form of state identity number. In the United
Kingdom an individual has their National Insurance number and in the United States
their social security number, other numbers (e.g. passport, ID card, etc are simply
stored as properties). Organisations have other numbers (Companies have registration
numbers, charities and trusts have different registration numbers, but VAT numbers are
properties as they can and do change).

Another minor issue is that people have a date of birth and a date of death. This is
simply resolved as date of birth is the Individual Start Date and date of death is the
Individual End Date however this terminology can sometimes prove controversial.

The solution to the PARTY special case depends on the database technology being
used. If the database supports the creation of views and the ‘UNION ALL’ SQL
40
operator then the preferred solution is as follows:

Create the INDIVIDUALS table as follows:

• PARTY_DWK
• PARTY_TYPE_DWK
• TITLE
• FORENAME
41
• CURRENT_SURNAME
• PREVIOUS_SURNAME
• MAIDEN_SURNAME
• DATE_OF_BIRTH
• DATE_OF_DEATH
• STATE_ID_NUMBER
• Other lifetime attributes as required

38
Data Management & Warehousing provide consultancy on ETL design and techniques to ensure that
data warehouses can be loaded effectively regardless of the data modelling approach used.
39
Interestingly, in Scotland, which has different regulations from England & Wales, birth marriage and
death certificates (also known as vital records) have, since 1855, understood the importance of knowing
the birth names of everyone on the certificate. For example on a wedding certificate you will get the
groom’s mother’s maiden name and a married woman’s death certificate will also feature the her maiden
name. Effectively the birth name has lifetime value and all other names are additional information.
http://www.scotlandspeople.gov.uk/content/help/index.aspx?r=554&628
40
Nearly all business intelligence databases support this functionality.
41
CURRENT_ and PREVIOUS_ are reserved prefixes; see Appendix 1 Data Modelling Standards.

© 2009 Data Management & Warehousing Page 33


White Paper - Process Neutral Data Modelling

Create the ORGANISATIONS table as follows:

• PARTY_DWK
• PARTY_TYPE_DWK
• CURRENT_ORGANISATION_NAME
• PREVIOUS_ORGANISATION_NAME
• START_DATE
• END_DATE
• STATE_ID_NUMBER
• Other lifetime attributes as required

Create the ORGANISATION_UNITS table as follows:

• PARTY_DWK
• PARTY_TYPE_DWK
• CURRENT_ORGANISATION_UNIT_NAME
• PREVIOUS_ORGANISATION_UNIT_NAME
• START_DATE
• END_DATE
• Other lifetime attributes as required

This can then be mapped to a view called PARTIES as follows:

PARTIES INDIVIDUALS ORGANISATIONS ORGANISATION_UNITS


PARTY_DWK PARTY_DWK PARTY_DWK PARTY_DWK
PARTY_TYPE_DWK PARTY_TYPE_DWK PARTY_TYPE_DWK PARTY_TYPE_DWK
CURRENT_NAME FORENAME + CURRENT_ CURRENT_
CURRENT_SURNAME ORGANISATION_ ORGANISATION_
NAME UNIT_NAME
PREVIOUS_NAME FORENAME + PREVIOUS_ PREVIOUS _
PREVIOUS_SURNAME ORGANISATION_ ORGANISATION_
NAME UNIT_NAME
START_DATE DATE_OF_BIRTH START_DATE START_DATE
END_DATE DATE_OF_DEATH END_DATE END_DATE
STATE_ID_NUMBER STATE_ID_NUMBER STATE_ID_NUMBER Null
Figure 20 - PARTIES view mapping

It should be noted that:

• The PARTY_DWK must be unique across all the tables.


• The PARTY_TYPE_DWK will be a single value in the INDIVIDUALS table.
• The ORGANISATION_UNITS STATE_ID_NUMBER will be null in the view.
• The ‘+’ sign represents concatenation and should include a space between
words.
• Other attributes can be included in the view as deemed appropriate.

Where possible it is often beneficial to create this as a materialized view so that it can
be indexed and used as a primary key to the other tables.

Whilst the PARTIES table needs all these techniques they can also be used in part on
other major entities if required.

The alternate strategy where UNION ALL views are not available is to create a single
table including all the columns and use those columns that are appropriate as required
by the query.

© 2009 Data Management & Warehousing Page 34


White Paper - Process Neutral Data Modelling

Partitioning
The Party Special Case is an example of vertical partitioning, i.e. tables that are split
based on the different columns required for the different types. Queries require a view
across the information in order to be able to access all the information.

Common Individuals
Data Data

Common Organisations
Data Data

Common Organisation
Data Units Data

Figure 21 - Vertically Partitioned Data

Tables can also be horizontally partitioned, i.e. whilst the table structure remains the
same the table is split on some data item that changes, most commonly the date. This
sometimes requires a view to be able to access all the information but is more
commonly implemented in the database architecture itself.

Common Data
For January

Common Data
For February

Common Data
For March

Figure 22 - Horizontally Partitioned Data

If both horizontal and vertical partitioning are used together this is known as matrix
partitioning. This is uncommon.

In Process Neutral Data Modelling, as a consequence of the approach, vertical


partitioning, if required, usually occurs on tables with a _TYPE and uses the _TYPE as
the partitioning key. Horizontal partitioning happens almost exclusively on the
transaction tables and should be based on the _START_DATE, which has lifetime
value and is not updated (unlike the _END_DATE which is updated).

Horizontal partitioning is not effective and often not supported on MPP platforms that
hash the data internally to multiple nodes. Column or vector storage databases render
horizontal partitions meaningless as a storage strategy.

© 2009 Data Management & Warehousing Page 35


White Paper - Process Neutral Data Modelling

Data Cleansing
Data cleansing itself is outside the scope of this document however the model must
make allowances for it. In particular if data is to be cleaned or standardized then the
original data must also be stored. To this end every column that is to be modified in this
way should have an additional column with the prefix STANDARDIZED_ added to it.

For example there may be a column in a table called NAME that has ‘Fred Bloggs’
stored with mixed case, two spaces between the words and a trailing space. The
cleaning routine would convert this in such a way as to replace multiple white spaces
with a single space character, and then remove leading and training white space before
converting the text to uppercase producing ‘FRED BLOGGS’. The result would be
stored in the column STANDARDIZED_NAME leaving the original data in NAME. This
technique should be used wherever data cleansing takes place.

If this column is created then it must always be populated even if the individual row has
not changed. This is because the fact that there is no change is information in itself and
also to avoid the need on load and extraction to determine whether the original or
cleansed data should be used.

Null Values
Process neutral data modelling does not require many nulls in the database at all and
they should be avoided wherever possible. All _END_DATES must allow nulls. Some
_START_DATES will need to allow nulls. The _VALUE columns must also allow nulls.
Other than these cases the principles of lifetime value should ensure the data model
requires few other columns that allow a null value.

Indexing Strategy
The data warehouse should only be indexed where necessary, i.e. primary and foreign
keys and one or two other essential columns for good performance for the extraction of
information into data marts. Users are not exploiting the data warehouse and therefore
the indexes should be aimed at ensuring that the ETL is as effective as possible.

Enforcing Referential Integrity


Data warehouse projects often have long debates on whether referential integrity
should be enforced in the data warehouse. The discussion is centred on the cost of
inserts and updates when referential integrity is enabled and therefore slowing down
the load of the data warehouse.

Whilst on the face of it removing referential integrity is an attractive proposition the


question has to be where is the cost in doing so because nothing is free. The cost
comes in two places. Firstly in the extra code required to ensure referential integrity
outside the database that has to be built into the process before loading and secondly
in the cost of handling the data quality issues when something is missed.

Where a holistic view of the data warehouse processing is taken, regardless of the data
modelling technique used, it becomes apparent that disabling referential integrity is
more expensive than enabling it and designing processes to accommodate it. Process
Neutral Data Models should always have referential integrity enabled unless there is a
specific case for individual tables that means it cannot be done.

© 2009 Data Management & Warehousing Page 36


White Paper - Process Neutral Data Modelling

There is also an approach of disabling referential integrity, loading data and then re-
enabling referential integrity. This is acceptable as long as any issues are resolved but
in practice many systems ignore issues and ultimately this affects the quality and
therefore the longevity of the system.

Finally it is possible to write ETL that always complies with referential integrity even
when there is missing data, using a technique called ‘Defensive Programming’. For
example if a type is missing from a _TYPE table it is possible to write the value into the
_TYPE table before inserting the data into the main table. Doing so will create a row in
the _TYPE table where the description, group, etc. are set to ‘Unknown’. This allows all
data to be processed and data quality metrics to be run (‘How much of my reference
data is unknown?’), provide early warning of unplanned changes in the source system
and allowing users, via the data maintenance application, to fix reference data in a
timely fashion without impacting the load process.

Data Insert versus Data Update


The process neutral data model requires very few updates, the notable exception being
the _END_DATE column. This is useful for database platforms that perform better with
fewer updates such as the MPP appliance platforms and the column storage/vector
platforms. In (traditional) database platforms where insert and update have equal cost
the fact that one method is preferred is of no consequence.

Where the data warehouse platform favours inserts it is preferable that the processing
of the data and any staging that is update intensive is performed in the ETL tool or a
dedicated staging database (depending on the architectural constraints and platform
choices made by the organisation) outside the data warehouse.

Row versus Set Based Loading in ETL


Most ETL is written to perform Row Based Loading i.e.

For each row


For each column
Process Data
Next column
Next row

This technique is common because it uses procedural language techniques familiar to


most developers and ETL tools provide procedural language interfaces. However
relational databases were written with set theory in mind and have high performance
set operator commands such as UNION, MINUS and INSERT that can greatly reduce
the processing needed to load any data model but especially process neutral data
models.

A description of the basic principles of set processing can be found in Appendix 5.

© 2009 Data Management & Warehousing Page 37


White Paper - Process Neutral Data Modelling

Disk Space Utilisation


Using this approach means that initially more disk space is used when reference data
and first values of the column-based structure of properties are created when
compared to other modelling techniques.

Since the change in properties only affects the individual cell rather than the entire row
it means that as the data warehouse grows each change uses less space and therefore
the total disk space used drops below that used by the other techniques.

Over the lifetime of the data warehouse it is unlikely that either approach will see either
significant cost or significant savings in disk space.

Implementation Effort
The method chosen for the data modelling can have a significant effect on the effort
involved in building a data warehouse. Process Neutral Data Modelling typically has the
following characteristics when compared to more traditional approaches:

• Simpler and quicker requirements gathering


This comes about because the major entities and therefore the frame of
reference can exist before the detailed data requirements exist and therefore it
is possible to use them as a communication tool to aid the gathering of
requirements.

• Quick Data Warehouse and Data Mart Design


The data warehouse model and data marts using natural star schemas are
quickly drawn out of the modelling technique.

• Easy build and configuration of reference data management applications.


This effect is seen as a result of the self-similar modelling which significantly
reduces the build effort.

• Longer initial build cycles on the ETL.


It takes time to develop optimum algorithms for performance and reuse based
on a site-specific set of tools and platforms.

• Shorter later build cycles on the ETL


The time taken to do later cycles is hugely reduced because of the reuse
designed in the earlier stages

• Reduced maintenance costs


The long term maintenance cost of data warehouses is rarely measured
however the ability of this technique to allow rapid change ensures that
maintenance costs will be significantly lower than other approaches

• Simple database sizing.


Given the self-similar nature of the data model makes it is easy to size the
database. All _TYPE and _BAND tables can be ignored for sizing purposes. It
is possible to work out a ratio for the number of rows between the major entity
and _PROPERTIES table and the column width is fixed, etc. This greatly
reduces the DBA overhead.

© 2009 Data Management & Warehousing Page 38


White Paper - Process Neutral Data Modelling

Data Commutativity
42
In mathematics there is a concept of commutativity , the ability to change the order of
something without changing the end result, for example 2 + 1 is the same as 1 + 2 and is
therefore commutative, however 2 - 1 is not the same as 1 – 2 and is therefore not
commutative. In general data is not commutative, however it is hierarchical and therefore can
be derived in one direction.

A common question asked about process neutral data modelling is that with so many places
that data can be held which is the right place? The answer is simple: data should be held at
the most detailed level possible

Links: Detailed knowledge of the relationship e.g. John Smith was


married to Jane Smith between 01-Jan-2000 and 01-Jul-2005

Data cannot be derived


Properties: Detailed knowledge of part of the relationship e.g. John
Data can be derived

Smith was married between 01-Jan-2000 and 01-Jul-2005

Events: Less detailed knowledge of part of the relationship e.g.


John Smith’s wedding was on 01-Jan-2000

Segments: Minimal knowledge of the relationship e.g. John Smith


was married at some point in time

Figure 23 - Data Commutativity

Since data will be extracted into data marts the ETL that performs the extraction should
consolidate the information to the appropriate level for that data mart. It is important to note
that this is also a core part of the change management process for the data warehouse.

For example the initial system that is used as a source collects data at the segment level. A
new system is commissioned to replace the initial source. The new system collects data at
the link level. The data can immediately be loaded at the link level and then extracted to the
data mart at the segment level. Over time the initial system is de-commissioned and all the
information is gathered at the link level. At this point the data mart can be updated to supply
the data mart with information at the link, property or event level as required.

It is against the DRY (Don’t Repeat Yourself) principle to store the derived data at every level
in the data warehouse unless there is some specific added value that is provided by doing so.

42
http://en.wikipedia.org/wiki/Commutative

© 2009 Data Management & Warehousing Page 39


White Paper - Process Neutral Data Modelling

Data Model Explosion and Compression


Two commonly asked questions are:

• How big will this data model get, especially if every major entity has ten supporting
43
tables?
• Can’t all the type and band tables be put in a single table; actually can we merge all
the properties, events, links and segments together into a single table too?

Before answering these two specific questions it is important to make some observations
about the process of the data modelling.

The objective of a data model is to create a clear, structured environment in which to store
data. Every data modeller will have their own preferences for the way in which they design the
data model. Process neutral data models strive to find a balance between:

• Re-use of design patterns that provide consistency of information and algorithm


• Clarity of model that aids understanding
• Size of model that affects maintainability
• Performance of the system that affects usability

It is possible for data modellers to change the rules that they apply to the data model
however before doing so the data modeller should understand the effect on the overall
balance of the system. Projects inevitably fail when the balance is lost and one of these
aspects overrides all the others. Projects should always enforce a single data modelling style.

How big does the data model get?


Since the approach uses a design pattern it is possible to design tables and not use
them. Experience shows that about fifteen to twenty major entities will be needed with,
on average, five supporting tables. This combined with about fifty history and
occurrence or transaction tables means the data model will be around one hundred and
fifty tables in total.

This compares very favourably with other data warehouse models. Large data
warehouses that have been in production usually exceed this number. Smaller and
newer data warehouses often start with fewer but quickly grow to this sort of size. The
advantage of this approach is that nearly everything that comes along in the future has
already been designed into the solution, therefore there is no long-term data model size
increase provided the model is properly managed.

Can the data model be compressed?


It is possible from an implementation point of view to reduce the number of tables that
are implemented by combining, for example, all the _TYPE tables into a single table
but this rarely benefits the solution. The data model as described in this document can
be indexed and have referential integrity applied. Each table has clear and un-
ambiguous meaning. Combining any of these tables loses the transparency of the
solution and impacts performance of the solution. It is, with some thought, possible to
reduce the entire data model to less that ten tables, it is also virtually impossible to
understand the finished data model.

43
A Major Entity could have a type table, a band table, a property table with its own type table, an event
table with its own type table, a link table with its own type table and a segment table with its own type
table which is a total of ten tables excluding the major entity

© 2009 Data Management & Warehousing Page 40


White Paper - Process Neutral Data Modelling

Which Results to Store?


Results are the outcome of some processing performed within the operational system that
creates information. A data warehouse will often be faced with a question about whether to
store the results or to store sufficient data to reproduce the algorithm and therefore generate
the results. For example:

• In a bank the interest calculation


• In a telephone company the call rating calculation
• In an airline the frequent flyer points
• In a manufacturing company the sales person’s commission

In general the purpose of a data warehouse is as a reporting system, a mirror of the


information in the operational system, albeit in a different format therefore by design it should
take the results as calculated elsewhere and store them to allow reporting. It is also important
to understand the complexity of the systems that generate the results.

A typical Telco billing process will handle billions of unrated call data records through complex
algorithms in order to generate the rated call record and consequently the bill. Furthermore
the billing systems allow rapid change in the rules used for billing so that the company can
bring new products to market quickly.

Given the engineering that has gone into building high performance billing systems and the
amount of change in billing requirements it is impractical to try and reliably recreate the billing
process in the ETL process. It is therefore important to know what factors were used in the
rating of an individual record (e.g. time bands, distance, number types, etc) but not exactly
how they were applied. The accurate storing of the results generated elsewhere is the
objective of the data warehouse.

This approach can be extended to a general principle that data warehouses should store the
results of batch processes in source systems rather than try to reproduce the algorithms that
generated the result sets.

This has consequences for data quality. Users in the example given above might perceive
data quality issues if the sum of the rated calls does not equal the billed amount. There are
two possible causes for this.

The first cause is inaccurate ETL that of course, is a data warehouse problem that has to be
resolved.

The second cause is an issue in the batch process in the source system. This second type of
issue often goes un-detected because users in the operational system look at individual bills,
whilst in the data warehouse they are likely to analyse across multiple bills. There is also no
simple remedy for this problem. If data is loaded and reconciled against the source system(s)
differences will be found. It may not be possible for the data warehouse to resolve them all.
Instead they must all be explained and the users of the system educated to understand how
the differences between the source systems and the data marts come about. As a result the
users may consider changing the source system or business process to get more accurate
information.

© 2009 Data Management & Warehousing Page 41


White Paper - Process Neutral Data Modelling

The Holistic Approach


Using a process neutral data modelling technique for data modelling enforces a holistic
approach to developing the data warehouse solution. There are constant trades between the
efficiency of load, storage, and query in the execution of the system. There are also trades
between the cost of bespoke development and re-development when compared to the use of
convention over configuration techniques.

Even if the data models themselves are not used there is much benefit from using the
techniques as a method of analysis for the data modelling. Given that major entities have
types, bands, properties, events, links and segments it becomes much easier to ensure all the
data that might be required has been analysed and discussed in the requirements stage.

Unlike most data modelling approaches this method has a basis in the understanding of
enterprise architecture and therefore it is possible to tune the data model to get the optimal
overall solution for the specific situation because the impact of changing one aspect (e.g. ETL
loading) at the expense of another (e.g. data maintenance) can be clearly seem.

The holistic approach also requires technical discipline and good change control, as should
any other method. The use of this approach often highlights the failure of an organisation in
these areas and means that sometimes organisations will chose to use other methods that do
not directly expose these failures. Unfortunately hiding these failures does not mean there is
no impact, just that the impact is hidden until it becomes critical and causes problems.

© 2009 Data Management & Warehousing Page 42


White Paper - Process Neutral Data Modelling

Summary
This white paper has looked at an example company and how the data models of operational
systems within that company evolve. Using this example it has been possible to study the
impact those changes have on the reporting and data warehousing solutions.

To mitigate the impact of these changes the use of a process neutral data model has been
examined. This method creates a data model that stores the core business data in a format
that is abstracted from the current operational systems.

The technique also takes advantage of the benefits of using convention over configuration to
define standard format tables and lifetime value principles to implement a DRY or “Don’t
Repeat Yourself” concept that makes the data model easily understood. Of course there is no
perfect solution to developing a data model and so the implementation issues associated with
this technique are also examined.

Combining the techniques described in the white paper allows a data model to be quickly and
easily developed that is easily understood and that will lower the total cost of ownership
because it is not so susceptible to change.

© 2009 Data Management & Warehousing Page 43


White Paper - Process Neutral Data Modelling

Appendix 1 – Data Modelling Standards


The data modelling standards outlined below are the ones used by Data Management &
Warehousing. Where a choice has been made (plural vs. singular for example) it is not
important what the choice is, but it is important that a choice has been made and that it is
documented and enforced.

General Conventions
All table and column names must use uppercase letters, the digits 0-9 and an
underscore ‘_’ to replace a space. No other characters are allowed. This is for
44
database compatibility reasons .

Table and column names must be no longer than 30 characters including underscores.
This is for database compatibility reasons

Table names and column names should be in English, this is because regardless of
where in the world the system is operating the majority of source systems will have
English table names and the amount of time lost trying to translate and match tables
and column names compared with the visual inspection and quick comparison in the
45
same language is significant

Table Conventions
Table Names are always plural; a table is a collection of zero, one or more rows

Every table should have a short name or alias. The short name is created using the
following rules:

Every short name is six characters long.

If a table name is less than six characters the short name is the table name right
padded with ‘Z’ until it is six characters long
E.g. BILLS becomes BILLSZ

If a table name is made up of one word of six or more characters then the short name
is the first six characters
E.g. ACCOUNTS becomes ACCOUN

If a table name is made up of two words then the first three characters of each word
are used to create the short name
E.g. ACCOUNT_TRANSACTIONS becomes ACCTRA

If a table name is made up of three words then the first two characters of each word
are used to create the short name
E.g. CALL_DISTANCE_BAND becomes CADIBA

If a table name is made up of four words then the first two characters of the first two
words and the first character of the third and fourth words are used to make up the
short name
E.g. CALL_DISTANCE_BAND_GROUPS becomes CADIBG

44
Database Identifier Lengths Comparison
https://test.kuali.org/confluence/display/KULRICE/Database+Table+and+Column+Name+Standards
45
Taking this further there are minor differences between UK and US English (e.g. COLOUR vs.
COLOR) and therefore strictly speaking the data model should be in US English.

© 2009 Data Management & Warehousing Page 44


White Paper - Process Neutral Data Modelling

If a table name is made up of five words then the first two characters of the first word
and the first character of the second, third, fourth and fifth words are used to make up
the short name
E.g. THE_QUICK_BROWN_FOX_JUMPED becomes THQBFJ

If a table name is made up of six or more words then the first character of each of the
first six words is used to make up the short name
E.g. THE_QUICK_BROWN_FOX_JUMPED_OVER becomes TQBFJO

If there are any conflicts as a result of this then they should be resolved and
documented by the data modeller.

Table Suffixes.
There are a series of table name suffixes that are reserved for specific functions;
these are:

_TYPES [Alternate shorter name _TYP]

A _TYPES table provides a classification of the associated table data into


discrete values. The singular form of the table that is being classified always
prefixes the _TYPES. (e.g. PARTIES is classified by PARTY_TYPES,
PARTY_PROPERTIES is classified by PARTY_PROPERTY_TYPES, etc.).
Where the table being classified has more than one classification the attribute
being classified is added between the table name and the _TYPES (e.g.
PARTY_GENDER_TYPES is classifying GENDER in the PARTIES table). A
_TYPES table can be associated with any other table except for another
_TYPES or a _BANDS table.

_BANDS (a _TYPES special case) [Alternate shorter name _BAN]

A _BANDS table provides a classification of the associated table data into a


range of values. The singular form of the table that is being classified always
prefixes the _BANDS. Where the table being classified has more than one
classification the attribute being classified is added between the table name
and the _BANDS. A _BANDS table can be associated with any other table
except for the following type: _TYPES, _BANDS, _PROPERTIES,
_EVENTS, _LINKS, _SEGMENTS.

_PROPERTIES [Alternate shorter name _PRO]

A _PROPERTIES table provides time variant data storage support for non-
lifetime value attributes of a major entity. The singular form of the table that is
being supported always prefixes the _PROPERTIES. (e.g. PARTIES is
supported by PARTY_PROPERTIES, PRODUCTS is supported by
PRODUCT_PROPERTIES, etc.). _PROPERTIES tables can only be
associated with major entities and always have a related _TYPES table.

_EVENTS (a _PROPERTIES special case) [Alternate shorter name _EVE]

A _EVENTS table provides time variant data storage support for non-lifetime
value attributes of a major entity that occur more than once but at a point in
time rather than over a period of time (which is covered by _PROPERTIES).
The singular form of the table that is being supported always prefixes the
_EVENTS. (e.g. PARTIES is supported by PARTY_EVENTS, PRODUCTS is
supported by PRODUCT_EVENTS, etc.) _EVENTS tables can only be
associated with one major entity table and always have a related _TYPES
table.

© 2009 Data Management & Warehousing Page 45


White Paper - Process Neutral Data Modelling

_LINKS [Alternate shorter name _LIN]

A _LINKS table provides a time-variant peer-to-peer relationship support


between two records within the same major entity. The singular form of the
table that is being supported always prefixes the _LINKS. (e.g. PARTIES is
supported by PARTY_LINKS, PRODUCTS is supported by
PRODUCT_LINKS, etc.) _LINKS tables can only be associated with major
entities and always have a related _TYPES table.

_SEGMENTS [Alternate shorter name _SEG]

A _SEGMENTS table provides a time-variant peer group support for records


within the same major entity. The singular form of the table that is being
supported always prefixes the _EVENTS. (e.g. PARTIES is supported by
PARTY_SEGMENTS, PRODUCTS is supported by
PRODUCT_SEGMENTS, etc.) _SEGMENTS tables can only be associated
with major entities and always have a related _TYPES table.

_HISTORY [Alternate shorter name _HIS]

A _HISTORY table provides a time-variant peer-to-peer relationship support


between two records in different major entities. The singular form of each of
the two major entity tables that are being supported always prefixes the
_HISTORY. (e.g. PARTY and GEOGRAPHY is supported by
PARTY_GEOGRAPHY_HISTORY, etc.) _HISTORY tables can only be
associated with two major entities and always have a related _TYPES table.

Column Conventions
Column Names are always singular; a column is a single element within a row

Stand Alone Column


There are a number of columns that are added to every table

TIMESTAMP

A timestamp for each record is held. This is either the date and time that the
row was created, or subsequently when it was last modified. If two systems
update any part of a row within one load process only the last modification is
preserved, and no count of modifications is maintained. The data type of a
timestamp must be TIMESTAMP where supported by the database or DATE
otherwise.

ORIGIN

This is used to identify what made the last change to the record. This should
be the name of the ETL process or mapping that performed the insert or last
update. If two systems update any part of a row within one load process only
the last updating system is preserved, and no count of modifications is
maintained. It is important to note that the origin only reflects the last process
in the chain to insert or update a record. A record comes from possibly
multiple sources system passing through many ETL processes before being
inserted into the database. The ORIGIN is set to the last ETL process and the
ETL tool must then contains the audit trail back to the previous system, and
so on. The data type and format of the ORIGIN column must be
46
VARCHAR(32) . This approach is known as tracking the data lineage
46
This document uses ANSI SQL92 standards. Other databases may use other data types e.g. Oracle
would VARCHAR2(32)

© 2009 Data Management & Warehousing Page 46


White Paper - Process Neutral Data Modelling

Column Suffixes
Standard extensions added to column names

_DWK

The use of _DWK indicates a Data Warehouse Key – a key generated and
maintained within the Data Warehouse, allowing the use of the words _ID,
_CODE, _NUMBER, etc to denote identifiers brought in from the source data.
Every table must use a _DWK surrogate key rather than any source system
key that may change when the source system is changed. All _DWK are
integer data type.

_TIME

Any field that has the suffix _TIME must contain a time. This information is
stored in a TIME data type if available, otherwise it is stored in the DATE data
type with the date component set to ‘01-JAN-1900’. This is to allow arithmetic
to be performed on time fields.

_DATE

Any field that has the suffix _DATE must contain information stored in the
DATE data type.

_START_DATE

The _START_DATE can have two types of value:

Value Type Meaning


Start Date before the current date An event that has actually happened
Start Date after the current date An event that is certain to happen at a point
in the future
Figure 24 - _START_DATE Rules

It should be noted that it is impossible to obtain some _START_DATE and


therefore whilst not strictly compliant with the definition a NULL might have to
be allowed to represent unknown data. The alternative is to enter a default
value but this is to be avoided as it may bias aggregate results.

_END_DATE

The _END_DATE can have three types of value:

Value Type Meaning


Null A status with no planned change of status
End Date before the current date An event that has actually happened
End Date after the current date An event that is planned to happen.
Figure 25 - _END_DATE Rules

_EVENT_DATE

A _EVENT_DATE is always found in a _EVENTS table instead of a


_START_DATE and a _END_DATE. It represents the date on which the
event took place

© 2009 Data Management & Warehousing Page 47


White Paper - Process Neutral Data Modelling

_DESC

The description fields are free text fields that describe the record. This should
not be relied on for queries and instead keys and appropriate joins used. The
standard data type and format for a description is VARCHAR (255).

_NUMERIC_VALUE

Holds a floating point number for use in _PROPERTY, _LINK, _SEGMENT


and _HISTORY tables

_TEXT_VALUE

Holds VARCHAR(255) text for use in _PROPERTY, _LINK and _HISTORY


tables

Column Data Type and Sizes

Short text columns should be VARCHAR(32)


Long text columns should be VARCHAR(255)
Numbers should be INTEGER unless specifically required to be otherwise
Dates should be have a data type of DATE
Times should have a data type of TIME where supported, otherwise DATE
MEMO, LONG, BLOB, CLOB should be avoided at all costs

Column Prefixes
There are also a number of standard column prefixes

STANDARDIZED_

Standardized fields are fields that have been cleaned in some way

CURRENT_

The CURRENT_ prefix denotes a current value that might not have lifetime
value in a major entity such as SURNAME in the PARTIES table.

PREVIOUS_

The PREVIOUS_ prefix denotes value of a field held in a CURRENT_ field


prior to the last update. No further history is kept of this value; it is always the
value before the one held in the CURRENT_ field.

LINKED_

Used where two foreign keys from the same table are used in a _LINK table.

Column Null / Not Null

All columns should be NOT NULL unless otherwise specified (e.g. in


_END_DATE and _VALUE columns)

© 2009 Data Management & Warehousing Page 48


White Paper - Process Neutral Data Modelling

Column Name Abbreviations

Due to the 30-character limit occasionally column names have to be abbreviated.


The following abbreviations are acceptable for suffixes:

Abbreviation Long Description


_B _BAND
_BSV _BAND_START_VALUE
_BEV _BAND_END_VALUE
_BD _BAND_DESC
_BG _BAND_GROUP
_BSD _BAND_START_DATE
_BED _BAND_END_DATE
_T _TYPE
_TD _TYPE_DESC
_TG _TYPE_GROUP
_TSD _TYPE_START_DATE
_TED _TYPE_END_DATE
_PSD _PROPERTY_START_DATE
_PED _PROPERTY_END_DATE
_PNV _PROPERTY_NUMERIC_VALUE
_PTV _PROPERTY_TEXT_VALUE
_PT _PROPERTY_TYPE
_PTD _PROPERTY_TYPE_DESC
_PTG _PROPERTY_TYPE_GROUP
_PTSD _PROPERTY_TYPE_START_DATE
_PTED _PROPERTY_TYPE_END_DATE
_ED _EVENT_DATE
_ENV _EVENT_NUMERIC_VALUE
_ETV _EVENT_TEXT_VALUE
_ET _EVENT_TYPE
_ETD _EVENT_TYPE_DESC
_ETG _EVENT_TYPE_GROUP
_ETSD _EVENT_TYPE_START_DATE
_ETED _EVENT_TYPE_END_DATE
_LSD _LINK_START_DATE
_LED _LINK_END_DATE
_LNV _LINK_NUMERIC_VALUE
_LTV _LINK_TEXT_VALUE
_LT _LINK_TYPE
_LTD _LINK_TYPE_DESC
_LTG _LINK_TYPE_GROUP
_LTSD _LINK_TYPE_START_DATE
_LTED _LINK_TYPE_END_DATE
_SSD _SEGMENT_START_DATE
_SED _SEGMENT_END_DATE
_SNV _SEGMENT_NUMERIC_VALUE
_STV _SEGMENT_TEXT_VALUE
_ST _SEGMENT_TYPE
_STD _SEGMENT_TYPE_DESC
_STG _SEGMENT_TYPE_GROUP
_STSD _SEGMENT_TYPE_START_DATE
_STED _SEGMENT_TYPE_END_DATE
_HSD _HISTORY_START_DATE
_HED _HISTORY_END_DATE
_HNV _HISTORY_NUMERIC_VALUE
_HTV _HISTORY_TEXT_VALUE
_HT _HISTORY_TYPE
_HTD _HISTORY_TYPE_DESC
_HTG _HISTORY_TYPE_GROUP
_HTSD _HISTORY_TYPE_START_DATE
_HTED _HISTORY_TYPE_END_DATE
Figure 26 - Column Name Abbreviations

© 2009 Data Management & Warehousing Page 49


White Paper - Process Neutral Data Modelling

The following abbreviations are acceptable for prefixes:

Abbreviation Long Description


CUR_ CURRENT_
LIN_ LINKED_
PRE_ PREVIOUS_
STA_ STANDARDIZED_

For large projects and models it is sometimes useful to consider using all
abbreviations from the outset.

Index Conventions
Where databases use administrator-defined indexes the following conventions should
be used.

Primary Key Index

Primary Key indexes should be named PK_XXXXXX where XXXXXX is the six-
character short table name

Foreign Key Index

Foreign Key indexes should be named FK_XXXXXX_YYYYYY_N where


XXXXXX is the six-character short table name of the table with the primary key,
YYYYYY is the six-character short table name of the table with the foreign key
and N represents a sequence number between 1 and 9 for the index

Unique Key Index

Unique Key indexes should be named UK_XXXXXX_N where XXXXXX is the


six-character short table name of the table being indexed and N represents a
sequence number between 1 and 9 for the index

Non-Unique Key Index

Non-unique Key indexes should be named NK_XXXXXX_N where XXXXXX is


the six-character short table name of the table being indexed and N represents a
sequence number between 1 and 9 for the index. Note that if more than nine
indexes are needed then there is something wrong.

Standard Table Constructs


The following provide the standard table definitions for each table constructs

_TYPES
Column Data Type Length Optional
_TYPE_DWK INTEGER NOT NULL
_TYPE VARCHAR 32 NOT NULL
_TYPE_DESC VARCHAR 255 NOT NULL
_TYPE_GROUP VARCHAR 32 NOT NULL
_TYPE_START_DATE DATE NOT NULL
_TYPE_END_DATE DATE NULL
TIMESTAMP DATE NOT NULL
ORIGIN VARCHAR 32 NOT NULL
Figure 27 - Standard _TYPES table

© 2009 Data Management & Warehousing Page 50


White Paper - Process Neutral Data Modelling

_BANDS
Column Data Type Length Optional
_BAND_DWK INTEGER NOT NULL
_BAND VARCHAR 32 NOT NULL
_BAND_START_VALUE NUMBER NOT NULL
_BAND_END_VALUE NUMBER NULL
_BAND_DESC VARCHAR 255 NOT NULL
_BAND_GROUP VARCHAR 32 NOT NULL
_BAND_START_DATE DATE NOT NULL
_BAND_END_DATE DATE NULL
TIMESTAMP DATE NOT NULL
ORIGIN VARCHAR 32 NOT NULL
Figure 28 - Standard _BANDS table

_PROPERTIES
Column Data Type Length Optional
_DWK INTEGER NOT NULL
_PROPERTY_TYPE_DWK INTEGER NOT NULL
_PROPERTY_TEXT_VALUE VARCHAR 32 NULL
_PROPERTY_NUMERIC_VALUE NUMBER NULL
_PROPERTY_START_DATE DATE NOT NULL
_PROPERTY_END_DATE DATE NULL
TIMESTAMP DATE NOT NULL
ORIGIN VARCHAR 32 NOT NULL
Figure 29 - Standard _PROPERTIES table

_EVENTS
Column Data Type Length Optional
_DWK INTEGER NOT NULL
_EVENT_TYPE_DWK INTEGER NOT NULL
_EVENT_TEXT_VALUE VARCHAR 32 NULL
_EVENT_NUMERIC_VALUE NUMBER NULL
_EVENT_DATE DATE NOT NULL
TIMESTAMP DATE NOT NULL
ORIGIN VARCHAR 32 NOT NULL
Figure 30 - Standard _EVENTS table

_LINKS
Column Data Type Length Optional
_DWK INTEGER NOT NULL
LINKED_ _DWK INTEGER NOT NULL
_LINK_TYPE_DWK INTEGER NOT NULL
_LINK_TEXT_VALUE VARCHAR 32 NULL
_LINK_NUMERIC_VALUE NUMBER NULL
_LINK_START_DATE DATE NOT NULL
_LINK_END_DATE DATE NULL
TIMESTAMP DATE NOT NULL
ORIGIN VARCHAR 32 NOT NULL
Figure 31 - Standard _LINKS table

© 2009 Data Management & Warehousing Page 51


White Paper - Process Neutral Data Modelling

_SEGMENTS
Column Data Type Length Optional
_DWK INTEGER NOT NULL
_SEGMENT_TYPE_DWK INTEGER NOT NULL
_SEGMENT_TEXT_VALUE VARCHAR 32 NULL
_SEGMENT_NUMERIC_VALUE NUMBER NULL
_SEGMENT_START_DATE DATE NOT NULL
_SEGMENT_END_DATE DATE NULL
TIMESTAMP DATE NOT NULL
ORIGIN VARCHAR 32 NOT NULL
Figure 32 - Standard _SEGMENTS table

_HISTORY
Column Data Type Length Optional
_DWK INTEGER NOT NULL
_DWK INTEGER NOT NULL
_HISTORY_TYPE_DWK INTEGER NOT NULL
_HISTORY_TEXT_VALUE VARCHAR 32 NULL
_HISTORY_NUMERIC_VALUE NUMBER NULL
_HISTORY_START_DATE DATE NOT NULL
_HISTORY_END_DATE DATE NULL
TIMESTAMP DATE NOT NULL
ORIGIN VARCHAR 32 NOT NULL
Figure 33 - Standard _HISTORY table

Sequence Numbers For Primary Keys


The major entities require a sequence number to populate the _DWK field. Each major
entity should have it’s own sequence

The _TYPES and _BANDS tables all need a sequence to populate their _DWK field.
This should be a single sequence that is shared amongst all _TYPES and _BANDS
tables. This has two effects; it prevents a larger number of sequences being created
than necessary and also means that reference data cannot inadvertently be joined to
other reference data

_PROPERTIES, _EVENTS, _SEGMENTS and _HISTORIES do not need a primary


key.

Occurrence or transaction tables do not normally need a primary key. If they do then
like major entities each one should have it’s own sequence

It is often asked if the CALENDAR table should have a _DWK column or if the date is
sufficient. Either approach will work however for consistency the use of a _DWK is
47
preferred.

47
Some organisations compromise by using Julian Day Number i.e. the integer part of the date (see
http://en.wikipedia.org/wiki/Julian_day) as a surrogate key that obscures the underlying information from
the users but aids development. This does, from time to time risk inconsistencies.

© 2009 Data Management & Warehousing Page 52


White Paper - Process Neutral Data Modelling

Appendix 2 – Understanding Hierarchies


Hierarchies are an essential part of any reporting system and yet there are two common
mistakes that regularly affect their implementation.

Sales Regions
Most business will have a ‘geographic’ structure of some type - sales region being a
prime example. The title ‘sales region’ and the names of the elements (e.g. country
names, state names, city names) adds to the confusion in implying that this is a
geographic hierarchy but this is wrong; it is an organisational hierarchy.

Whilst in concept the business allocated resource to cover different geographic regions
the practicalities of running the business soon overtake the situation. A series of
exemptions are soon created, some accounts being looked after by people out of
region and some accounts being looked after by non-geographic functions. There is no
direct geographical relationship, just a use of geographic names for familiarity.

It is possible to associate the organisational structure via a history table to the


addresses of clients but this is of little value. The requirement should be to accurately
report the hierarchy and not to become the system of record for how the sales teams
are organised. It is important to remember that management teams will subjectively
change this structure as their resources permit.

Internal Organisation Structure


The second common mistake is to treat roles and individuals within an organisation
structure as being synonymous. Below is a typical organisation chart:

Figure 34 - Typical Organisation Hierarchies

This is often stored as Jack Doe reporting to John Smith, etc. However the people in
the organisation structure are not the hierarchy. What needs to be stored is the role
as an organisation unit:

Figure 35 - Stored Organisational Hierarchy

© 2009 Data Management & Warehousing Page 53


White Paper - Process Neutral Data Modelling

The role hierarchy is significantly less dynamic than the people within it and the
organisation changes are much more controlled as the business chooses when to re-
structure but does not choose when staff join or leave. Most large organisations will
have a personnel or human resources department that manages the organisation
hierarchy and if they use a Human Resource Management System it is likely that
every role will have a unique ID and a documented position in the hierarchy.

It is then possible to relate the roles as organisational units to the individuals so:

Individual Type Organisational Unit


John Smith Works as Sales Manager
Jack Doe Works as Sales Executive 1
etc.
Figure 36 - Relating Individuals to Roles

This has the added advantage of dealing with temporary resources and also with the
transition of resources (e.g. when someone is moving from one team to another and
fulfils two roles for a short period of time).

© 2009 Data Management & Warehousing Page 54


White Paper - Process Neutral Data Modelling

Appendix 3 – Industry Standard Data Models


A number of organisations offer industry standard data models. These can be broken down
into two types of provider:

• Vendors such as IBM, Oracle, Sybase and Teradata, who all provide some form of
standard data models for some industry sectors. These models usually started from
a client project and then a period of refinement internally before becoming
‘productized’
• Industry Organisations such as TMForum in the telecommunications industry that
have decided that there is value in building an industry wide common data model

Both types of provider usually provide logical models and these can be used in one of two
ways:

• As a reference data model used for the accumulated industry knowledge


• As a real implementation data model

A logical data model will need to be converted to a physical model and it is in this conversion
process when the physical data model is created that process neutral data modelling can be
used.

As an example one of the best described and available data models is the Information
48 49
Framework (SID) from TMForum.org an industry association focused on transforming
business processes, operations and systems for managing and monetizing on-line
Information, Communications and Entertainment services.

The Information Framework provides the foundation of a "common language" that allows
common representation, as well as a standardized meaning for the relationships that exist
among logical entities. For example, a common definition of what is "customer" and how It
relates to other elements, such as mailing address, purchase order, billing records, trouble
tickets, and so on.

This is an ideal basis for a process neutral data model as there are defined set of major
entities with lifetime values and relationships

The model is broken down into a number of domains:

• Market / Sales
• Product
• Customer
• Service
• Resources
• Supplier / Partner
• Common Business
• Enterprise

Within these there are a number of subject areas

48
TMForum Information Framework (SID)
http://www.tmforum.org/InformationFramework/1684/home.html
49
TMForum: http://www.tmforum.org/browse.aspx

© 2009 Data Management & Warehousing Page 55


White Paper - Process Neutral Data Modelling

50
Figure 37 - TMForum Information Framework (SID) Version 8.0 overview

At first glance this appears to offer up a view of the world incompatible with the design
objectives of a process neutral data model (e.g. because there is a customer and a supplier
rather than just a party). This is an incorrect assumption. There are two ways in which the
Information Framework can be used with the approach described in this document.

The first method is to trust the Information Framework and implement it as a process neutral
data model. Therefore there is no major entity called Party, instead there is one called
Customer and one called Supplier. This approach trusts the reference model to have thought
through the industry specific lifetime value issues and be satisfied that it will be fit for purpose.
In the specific case of the TMForum Information Framework this is a safe assumption as it is
widely peer reviewed and widely used by industry experts. This is however not always true of
all vendor data models.

The second method is to use the Information Framework as a point of reference and create a
process neutral data model that meets all the described entities and attributes of the
Information Framework. In this case a Party entity would exist and the attributes and
associated properties, links, etc. would be validated to ensure that all information held in the
Information Framework could be stored in the resulting data model.

Both of these approaches have been successfully used with the TMForum Information
Framework and could be used with other industry standard data models. The choice of
approach will often depend on the quality of the reference data model, the likelihood of
change and the needs of the business.

50
The SID model is copyright TMForum.org and was taken from
http://www.tmforum.org/sdata/content/PracticesStandards/sid/default.aspx

© 2009 Data Management & Warehousing Page 56


White Paper - Process Neutral Data Modelling

Appendix 4 – Information Sparsity


Once all the processing is complete and data quality issues resolved how much information
does a data warehouse have? The answer is inevitably less than the organisation believes.
The rise of social networking sites has helped to develop understanding in this area. It shows
51
that most people (80%) have less than 100 friends and people do not know much as they
think about those friends. One simple test is to assume that you have 100 friends and assess
for what percentage of your friends you know the following information:

• First Name & Last Name


• Middle Names
• Birthday (e.g. 6 July)
• Date of Birth (6 July 1966)
• Partners Name
• Home Address
• Home Telephone Number
• Home e-Mail Address
• Work Address
• Work Telephone Number
• Work e-Mail Address
• Mobile Number
• All of the above

The chances are that are that you will not know the answer for 100% of your friends to any
question. (What is Mrs Smith’s first name, she lives two doors down and looks after the cat
when you are away?)

The situation is also one that deteriorates rapidly. As a result of reading this white paper you
might decide to contact your 100 best friends and get all the above information. Your friends
are tolerant of your request and provide you will all this information. In six months time you
decide to update your address book and you contact your tolerant friends again to check that
all the details are still correct. The chances are that at least twenty percent of your friends will
52
have changed some part of the information over the six months.
53
The use of synchronisation tools, social network and personal address book sites has
improved the automation of change notification. It is now possible to update your own details
on a service and for that to automatically update the records of your friends who also use the
service but the change rate is still high.

51
From a survey by RapLeaf
http://www.marketingvox.com/more-women-than-men-on-social-networks-have-more-friends-than-men-
do-038384/
52
The percentage is variable with age and socio-economic factors. Middle age and high incomes
improve stability and reduce the percentage of change. Youth and older age as well as lower incomes
increase the amount of change. This was dramatically demonstrated in the UK with the introduction of
the community charge or “poll tax” (http://en.wikipedia.org/wiki/Community_Charge) between 1990 and
1993. Local authorities were responsible for collecting the basic household information and struggled to
maintain an accurate list of households. Whilst there was quite a lot of deliberate avoidance that cannot
be factored in there was also regularly 20% of notified change in any single month.
53
Sites include Facebook, Bebo, Plaxo, LinkedIn and Naymz. Many of these sites are now adding
features that allow you to better qualify and quantify these friends into true friends, acquaintances, etc.

© 2009 Data Management & Warehousing Page 57


White Paper - Process Neutral Data Modelling

This issue transfers into the data warehouse environment. If an individual cannot keep track
of their friends then how does a business keep track of their customers? Businesses only get
informed of changes when the customer requires something. For example if you register a
‘pay as you go’ mobile telephone and when doing so are required to provide an address do
you bother to update their records when you change address? However if you later need
something sent from the telephone company then you will contact them to update their
records.

One Telco data warehouse team attempted to measure how poor the address data was. They
decided to look at post-paid customers who receive a printed bill each month. The method
was obvious and simple once it was identified. They went to the mailroom and asked how
many bills were returned by the postal service. The answer was about 25,000 per month or
1% of the bills generated. They also discovered that a team handled the returned mail and by
various methods updated them in the main billing system. Therefore each month 1% of the
post-paid customer data expired. Pre-paid customers, the much larger proportion of total
customer base, would have much less reason to update their information and therefore a
much larger percentage.

The process neutral data model aids this situation in two simple ways:

• The first way in which it helps is that there is a separate record for each piece of
information (home address is stored separately from work address within the
PARTY_ADDRESS_HISTORY, etc.) and this means that it is easy to maintain each
of different pieces of information without impacting other pieces of information.

• The second way in which it helps is that each piece of information has its own
TIMESTAMP. It is therefore possible to exclude information based on the age of the
information.

© 2009 Data Management & Warehousing Page 58


White Paper - Process Neutral Data Modelling

Appendix 5 – Set Processing Techniques


This appendix outlines the basic principles involved in writing set based techniques, it is not
comprehensive but provides the basic flow used in the technique. This technique assumes
that the database supports set operators and that there is a staging area where a previous
copy of the table can be held (on the first day the previous copy exists but is empty). It is not
an exhaustive description. Change capture techniques such as this offer some of the biggest
opportunities to improve data warehouse load times In this example the table is called TABLE

PREVIOUS_ 1. PREVIOUS_TABLE exists from previous run of the process


TABLE or is created empty for the very first run

CURRENT_ 2. A copy of the source system table is taken to create


TABLE CURRENT_TABLE

INS_UPD_ 3. The INS_UPD_TABLE is created as


TABLE CURRENT_TABLE MINUS PREVIOUS_TABLE

DEL_UPD_ 4. The DEL_UPD_TABLE is created as


TABLE PREVIOUS_TABLE MINUS CURRENT_TABLE

INS_ 5. The INS_TABLE is created as


TABLE INS_UPD_TABLE MINUS DEL_UPD_TABLE

DEL_ 6. The DEL_TABLE is created as


TABLE DEL_UPD_TABLE MINUS INS_UPD_TABLE

UPD_ 7. The UPD_TABLE is created as


TABLE DEL_UPD_TABLE INTERSECT INS_UPD_TABLE

PREVIOUS_ 8. The PREVIOUS_TABLE is dropped and the


TABLE CURRENT_TABLE renamed to PREVIOUS_TABLE

INS_ DEL_ UPD_


TABLE TABLE TABLE

Inserted as new Updates the end-date Processed to end-


records into the on existing records in date existing records
appropriate tables the appropriate tables and create new
records in the
Figure 37 - Set Processing Technique appropriate tables

© 2009 Data Management & Warehousing Page 59


White Paper - Process Neutral Data Modelling

Appendix 6 – Standing on the shoulders of giants


"Bernard of Chartres used to say that we are like dwarfs on the shoulders of giants, so that
we can see more than they, and things at a greater distance, not by virtue of any sharpness
of sight on our part, or any physical distinction, but because we are carried high and raised up
54
by their giant size."

Process Neutral Data Modelling may seem a large leap from more widely discussed methods
of data modelling for a data warehouse but it has been used for over fifteen years by some of
the largest organisations in the world.

The techniques in this document have been influenced by a number of people:

• Ralph Kimball
Creator of the data mart concept and the need to deliver simple, easy to use
information to business users

• Bill Inmon
Known as the father of data warehousing whose approach required a normalised
database in which to store the lowest level of information

• Paul Winder (formally of Oracle Corp)


Creator of Oracles’ Telco Reference Data Model and responsible for the abstraction
of major entities within that model to allow it to be used across many different Telcos

• Ward Cunningham
Owner of c2.com (home of the Portland Pattern Repository) and signatory of the agile
manifesto. Ward also worked with a number of the author’s contemporaries at
Sequent Computers in the early and mid 1990s.

• David Heinemeier Hansson


Ruby on Rails designer who conceived and implemented many of the concepts used
in “convention over configuration” from a coding perspective.

• Andy Hunt and Dave Thomas


Authors of the book The Pragmatic Programmer in which DRY (Don’t Repeat
Yourself) is a core principle.

And many others who will no doubt feel that they should have been included and to whom the
author can only apologise for their omission.

54
This quote, often attributed to Isaac Newton, is by John of Salisbury, from his 1159 Metalogicon.
http://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants

© 2009 Data Management & Warehousing Page 60


White Paper - Process Neutral Data Modelling

Further Reading
Data Management & Warehousing have published a number of white papers on data
warehousing and related issues. The following papers are available for download from
http://www.datamgmt.com

Overview Architecture for Enterprise Data Warehouses


This is the first of a series of papers published by Data Management & Warehousing to
look at the implementation of Enterprise Data Warehouse solutions in large
organisations using a design pattern approach. A design pattern provides a generic
approach, rather than a specific solution. It describes the steps that architecture, design
and build teams will have to go through in order to implement a data warehouse
successfully within their business

This particular document looks at what an organisation will need in order to build and
operate an enterprise data warehouse in terms of the following:

* The framework architecture


What components are needed to build a data warehouse, and how do they fit together

* The toolsets
What types of products and skills will be used to develop a system

* The documentation
How do you capture requirements, perform analysis and track changes in scope of a
typical data warehouse project.

This document is, however, an overview and therefore subsequent documents deal
with specific issues in detail

Data Warehouse Governance


An organisation that is embarking on a data warehousing project is undertaking a long-
term development and maintenance programme of a computer system. This system
will be critical to the organisation and cost a significant amount of money, therefore
control of the system is vital. Governance defines the model the organisation will use to
ensure optimal use and re-use of the data warehouse and enforcement of corporate
policies (e.g. business design, technical design, and application security) and ultimately
derive value for money.

This paper has identified five sources of change to the system and the aspects of the
system that these sources of change will influence in order to assist the organisation to
develop standards and structures to support the development and maintenance of the
solution. These standards and structures must then evolve, as the programme
develops to meet its changing needs.

“Documentation is not understanding, process is not discipline, formality is not skill”

The best governance must only be an aid to the development and not an end in itself.
Data Warehouses are successful because of good understanding, discipline and the
skill of those involved. On the other hand systems built to a template without
understanding, discipline and skill will inevitably deliver a system that fails to meet the
users’ needs and sooner rather than later will be left on the shelf, or maintained at a
very high cost but with little real use.

© 2009 Data Management & Warehousing Page 61


White Paper - Process Neutral Data Modelling

Data Warehouse Project Management


Data warehouse projects pose a specific set of challenges for the project manager.
Whilst most IT projects are a development to support a well defined pattern of work a
data warehouse is, by design, there to support users asking ad hoc questions of the
data available to the business. It is also a project that will have more interfaces and
more change than any other system within the organisation.

Projects often have poorly set expectations in terms of timescales; the likely return on
investment, the vendors’ promises for tools or the expectations set between the
business and IT within an organisation. They also have large technical architectures
and resourcing issues that need to be handled.

This document will outline the building blocks of good project control including the
definition of phases, milestones, activities, tasks, issues, enhancements, test cases,
defects and risks and will discuss how they can be managed, and when, using an event
horizon, the project manager can expect to get information.

To help manage these building blocks this paper will look at the types of tools and
technology that are available and how they can be used to assist the project manager.
It also looks at how these tools fit into methodologies.

The final section of the paper has looked at how effective project leadership and
estimating can improve the chances of success for a project. This includes
understanding the roles of the executive sponsor, project manager, technical architect
and senior business analyst along with the use of different leadership styles,
organisational learning and team rotation.

Data Warehouse Documentation Roadmap


All projects need documentation and many companies provide templates as part of a
methodology. This document describes the templates, tools and source documents
used by Data Management & Warehousing. It serves two purposes:

• For projects using other methodologies or creating their own set of documents to
use as a checklist. This allows the project to ensure that the documentation covers
the essential areas for describing the data warehouse.
• To demonstrate our approach to our clients by describing the templates and
deliverables that are produced.

Documentation, methodologies and templates are inherently both incomplete and


flexible. Projects may wish to add, change, remove or ignore any part of any document.
Some may also believe that aspects of one document would sit better in another. If this
is the case then users of this document and these templates are encouraged to change
them to fit their needs.

Data Management & Warehousing believes that the approach or methodology for
building a data warehouse should be to use a series of guides and checklists. This
ensures that small teams of relatively skilled resources developing the system can
cover all aspects of the project whilst being free to deal with the specific issues of their
environment to deliver exceptional solutions, rather than a rigid methodology that
ensures that large teams of relatively unskilled staff can meet a minimum standard.

© 2009 Data Management & Warehousing Page 62


White Paper - Process Neutral Data Modelling

How Data Works


Every business believes that their data is unique. However the storage and
management of that data uses similar methods and technologies across all
organisations. As a result the same issues of consistency, performance and quality
occur across all organisations. The commercial difference between organisations is not
whether they have data issues but how they react to them in order to improve the data.

This paper examines how data is structured and then examines characteristics such as
the data model depth, the data volumes and the data complexity. Using these
characteristics it is possible to look at the effects on the development of reporting
structures, the types of data models used in data warehouses, the design and build of
interfaces (especially ETL for data warehouses), data quality and query performance.
Once the effects are understood it is possible for programmes and projects to reduce
(but never remove) the impact of these characteristics resulting in cost savings for the
business.

This paper also introduces concepts created by Data Management & Warehousing
including:

• Left to right entity diagrams


• Data Model Depth
• Natural Star Schemas
• The Data Volume and Complexity graph
• Incremental Phase Benefit Model

© 2009 Data Management & Warehousing Page 63


White Paper - Process Neutral Data Modelling

List of Figures
Figure 1 - Initial Operational System Data Model...................................................................... 6 
Figure 2 - Initial Reporting System Data Model......................................................................... 7 
Figure 3 - Second Version Operational System Data Model..................................................... 8 
Figure 4 - The Sales Funnel .................................................................................................... 10 
Figure 5 - Example data for PARTY_TYPES .......................................................................... 17 
Figure 6 - Example Data for GEOGRAPHY_TYPES .............................................................. 18 
Figure 7 - Example data for TIME_BANDS ............................................................................. 19 
Figure 8 - Party Properties Example ....................................................................................... 20 
Figure 9 - Example Party Property Data ................................................................................. 20 
Figure 10 - Example data for PARTY_PROPERTIES............................................................. 21 
Figure 11 - Example Data for PARTY_PROPERTY_TYPES.................................................. 21 
Figure 12 - Example Data for PARTY_PROPERTIES ............................................................ 21 
Figure 13 - Example Data for PARTY_PROPERTIES ............................................................ 22 
Figure 14 - Party Events Example........................................................................................... 22 
Figure 15 - Party Links Example ............................................................................................. 23 
Figure 16 - Party Segments Example ..................................................................................... 24 
Figure 17 – Party Geography History Example ....................................................................... 26 
Figure 18 - The Example Bank Data Model ............................................................................ 31 
Figure 19 - Volume & Complexity Correlations ....................................................................... 32 
Figure 20 - PARTIES view mapping........................................................................................ 34 
Figure 21 - Vertically Partitioned Data..................................................................................... 35 
Figure 22 - Horizontally Partitioned Data ................................................................................ 35 
Figure 23 - Data Commutativity............................................................................................... 39 
Figure 24 - _START_DATE Rules .......................................................................................... 47 
Figure 25 - _END_DATE Rules............................................................................................... 47 
Figure 26 - Column Name Abbreviations ................................................................................ 49 
Figure 27 - Standard _TYPES table ........................................................................................ 50 
Figure 28 - Standard _BANDS table ....................................................................................... 51 
Figure 29 - Standard _PROPERTIES table ............................................................................ 51 
Figure 30 - Standard _EVENTS table ..................................................................................... 51 
Figure 31 - Standard _LINKS table ......................................................................................... 51 
Figure 32 - Standard _SEGMENTS table ............................................................................... 52 
Figure 33 - Standard _HISTORY table.................................................................................... 52 
Figure 34 - Typical Organisation Hierarchies .......................................................................... 53 
Figure 35 - Stored Organisational Hierarchy ........................................................................... 53 
Figure 36 - Relating Individuals to Roles................................................................................. 54 
Figure 37 - TMForum Information Framework (SID) Version 8.0 overview............................. 56 
Figure 37 - Set Processing Technique .................................................................................... 59 

Copyright
© 2009 Data Management & Warehousing. All rights reserved. Reproduction not permitted
without written authorisation. References to other companies and their products use
trademarks owned by the respective companies and are for reference purposes only.

Some terms and definitions taken from Wikipedia

Crossword Answer: Expert Gives Us Real Understanding - GURU

© 2009 Data Management & Warehousing Page 64

You might also like