You are on page 1of 18

Data Warehouse Project Management

Listen, Listen, Listen to the Stakeholders

By Hari Mailvaganam

A paramount determining factor in the success of data warehousing is the input of


stakeholders. Data warehousing is very unique to an organization, its business processes,
systems architecture and decision support needs.

Project management for data warehousing allows for large amounts of user input and at
all phases of the project. There are commercial software products tailored for data
warehouse project management. A good project plan lists the critical tasks that must be
performed and when each task should be started and completed. It identifies who is to
perform the tasks, describes deliverables to be created and identifies milestones for
measuring progress.

There are a number of publications on data warehousing project management. The


standard bearer publication is Ralph Kimball's Data Warehouse Lifecycle Toolkit.

Over the years I have worked with different project management methodologies and
processes for data warehousing. In this article I have listed a summary of a project
management methodology that I have developed that also ties in with the Rational
Unified Process (RUP). This is especially useful if the data warehousing project is part of
a greater development effort which follows RUP.

The program management framework for data warehousing follows tested


methodologies. The methodology described below follows the project management
workflow of the Unified Process.

Figure 1. Rational Unified Process; Copyright IBM


The project plan should include presentations at regular intervals, say monthly, to
management and stakeholders. The presentation will include:

• Discussion on any challenges and determine if project is on schedule;


• Review of activities and priorities to be achieved before next meeting;
• Contingency plans to make up time and address problems;
• An open question-and-answer period followed by a summary.

Phases of the Unified Process

The phases in project management following UP, Figure 1, are:

1. Inception
2. Elaboration
3. Construction
4. Transition

The project lifecycle is composed over time into four sequential phases, each concluded
by a major milestone, Figure 2.

The RFP is produces at the end of the inception phase.

Figure 2. Phases in the Unified Process; © IBM

1. Inception Phase

• Project planning and evaluation.


• Requirements gathering.
• Define features the system must support.
• Identify the stakeholders who oversee the system use.
• Users and other applications (i.e. actors) that will interface with the system.
• Define framework to identify business rules.
• List the events that the system must be aware of and respond to.
• List the constraints and risks place on the project.
• Define product selection criteria.
• Define project management environment.
• Prepare and obtain approval of sponsors for project mandate and approach.
• Define data warehouse objectives.
• Define query library/location matrix.
• Define query libraries.
• Establish meta-data.
• Prepare migration specifications.
• Develop query usage monitoring.
• Technical infrastructure assessment.
• Current technology assessment.
• Release of RFP.

The project plan must also place importance to:

• Change control.
• Risk assessment.
• Training.
• Scope agreements.
• Resources.
• Communication.

2. Elaboration Phase

• Review of RFPs and selecting winning candidate.


• Determine size of the data warehouse.
• Determine complexity and cleanliness of the data.
• Identify number of source databases and their characteristics.
• Select DSS tools.
• Determine network requirements.
• Conduct training.
• Integration of data.
• Plan quality assurance testing procedures.

3. Construction Phase

• Implement components.
• Systems integration.
• Conduct testing.
• Plan deployment.
• Develop support material.

4. Transitions Phase

• Roll-out of modules.
• Managing change-control requests.
• Conduct beta testing to validate the new system against user expectations.
• Perform converting operational databases.
• Evaluate performance and benchmarking.
• Evaluate that deployment baselines are complete.
• Obtain stakeholder feedback on deployment baselines consistency with evaluation
criteria.
• Fine-tune product based on feedback.
• Release product to users.
• Product release milestones.
Please contact us if you have any questions or suggestions.

What is Data Cleansing?

The elevator pitch: "Data cleansing ensures that undecipherable data does not enter the
data warehouse. Undecipherable data will affect reports generated from the data
warehouse via OLAP, Data Mining and KPI's."

A very simple example of where data cleansing would be utilized is how dates are stored
in separate applications. Example: 11th March 2007 can be stored as '03/11/07' or
'11/03/07' among other formats. A data warehousing project would require the different
date formats to be transformed to a uniform standard before being entered in the data
warehouse.

Why Extract, Transform and Load (ETL)?

Extract, Transform and Load (ETL) refers to a category of tools that can assist in
ensuring that data is cleansed, i.e. conforms to a standard, before being entered into the
data warehouse. Vendor supplied ETL tools are considerably more easy to utilized for
managing data cleansing on an ongoing basis. ETL sits in front of the data warehouse,
listening for incoming data. If it comes across data that it has been programmed to
transform, it will make the change before loading the data into the data warehouse.

ETL tools can also be utilized to extract data from remote databases either through
automatically scheduled events or via manual intervention. There are alternatives to
purchasing ETL tools and that will depend on the complexity and budget for your project.
Database Administrators (DBAs) can write scripts to perform ETL functionality which
can usually suffice for smaller projects. Microsoft's SQL Server comes with a free ETL
tool called Data Transforming Service (DTS). DTS is pretty good for a free tool but it
does has limitations especially in the ongoing administration of data cleansing.

Example of ETL vendors are Data Mirror, Oracle, IBM, Cognos and SAS. As with all
product selections, list what you think you would require from an ETL tool before
approaching a vendor. It may be worthwhile to obtain the services of consultants that can
assist with the requirements analysis for product selection.
Figure 1. ETL sits in front of Data Warehouses

How important is Data Cleansing and ETL to the success of Data Warehousing
Projects?

ETL is often out-of-sight and out-of-mind if the data warehouse is producing the results
that match stakeholders expectations. As a results ETL has been dubbed the silent killer
of data warehousing projects. Most data warehousing projects experience delays and
budget overruns due to unforeseen circumstances relating to data cleansing.

How to Plan for Data Cleansing?

It is important is start mapping out the data that will be entered into the data warehouse as
early as possible. This may change as the project matures but the documentation trail will
come in extremely valuable as you will need to obtain commitments from data owners
that they will not change data formats without prior notice.

Create a list of data that will require Extracting, Transforming and Loading. Create a
separate list for data that has a higher likelihood of changing formats. Decide on whether
you need to purchase ETL tools and set aside an overall budget. Obtain advice from
experts in the field and evaluate if the product fits into the overall technical hierarchy of
your organization.
Planning for a Data Warehouse
Data Warehousing have been voted the most complex and resource intensive software
projects by CIOs of Fortune 1000 companies. Enterprise Resource Planning (ERP)
implementations top the list.

Data Warehousing comes in all shapes and sizes, which bear a direct relationship to cost
and time involved. The approach to starting a Data Warehousing project will vary and the
steps listed below are summary of some of the points to consider.

1) Get Professional Advice

Data Warehousing makes a great difference is the lead has been there and done that. It
will save a huge bundle to get professional advice upfront. Endless meeting times can be
saved and the risk of an abandon data warehousing project can be reduced.

2) Plan the Data

Know what metrics you want to measure in the Data Warehouse and ensure that there is
the appropriate data to provide for the analysis. If you wish to obtain periodic Key
Performance Index (KPI) data for shipping logistics, make sure that the appropriate data
is piped into the data warehouse.

3) Who will use the Data Warehouse

The power Data Warehouse consumers are business and financial managers. Data
Warehouses are meant to deliver clear indications on how the business is performing. Plot
out the expected users for the Data Warehouse in the enterprise. Gauge that they will have
the appropriate reports in a format which is quickly understandable. Ensure that planning
exercises are conducted in advance to accumulate scenarios on how the Data Warehouse
will be used. Always remember that data has to be presented attractively and in a format
business managers will feel comfortable. Text files with lines of numbers will not suffice!

4) Intergration to External Applications

Most Data Warehousing projects sink or swim by their ability to extract data from
external applications. Enterprises have a slew of applications either developed inhouse or
obtain from a vendor. Conceptually, your Data Warehouse will act as the heart to diverse
applications running in the enterprise. All important data will flow in or out of the Data
Warehouse.

5) Technology, What Technology?

At the bits and bytes level, a Data Warehouse is a souped up database. It will be built
from one of the major Relational Database Management System (DBMS) vendors like
Oracle, IBM, Microsoft. Open source databases, like mySQL, can also support Data
Warehousing with the right support in place.
Data Warehousing RFP
Much of a data warehousing project's success is determined at the onset. A data
warehousing request for proposal (RFP) is a powerful tool in dictating the project's tone.

This article gives an overview of some of the components that are covered in an RFP.
Data warehousing projects are unique to an organization's business requirements,
computer resources and budget. Please contact us if you have any questions.

1. GENERAL INFORMATION FOR VENDORS


 Lists the Project Vision
 Project timelines
 Nature of contract
 Scope
Reasoning: Give an overview of project and format for the RFP process.
2. SYSTEM BACKGROUND
Description of the Technical Environment
a) Description on the Client and Server specifications at the Ministry.
b) Breakdown of user types for the data warehouse:
User Types
Description Number of Users
(Examples Only)

Novice User Uses “canned” reports and queries. x

Frequent user, edits reports, queries


Power User directly and builds cubes for multi- x
dimensional analysis.
c) Software tools for the data warehouse:
• ETL Tools
• Report and Query Tools
• Data Analysis
• Data Visualization Tools
Reasoning: Give an overview of the expected installation. Also specify points of particular
interest.
3. SYSTEM REQUIREMENTS
 List of General Requirements
→ Give an overview of the data warehouse environment.
Reasoning: An overview of the rational for the data warehouse implementation.
 Mandatory Requirements
a) List if the software is Client/Server based or web-based. Described what
functions are available via the Internet browser. Describe what plug-ins (if
any) are necessary for the client machines.
b) State that the vendor must communicate with the Ministry’s database and
applications. Ask the vendor how they propose to do this.
c) The vendor must be committed to its products remaining compatible with
future versions of Oracle. Ask the vendor to offer documents as evidence of
commitment to this environment.
d) The vendor’s system must enable the overall optimization of the Ministry’s
environment:
I) Ask for how the vendor’s system minimizes user response
time while maximizing the use of system resources.
ii) Ask the vendor to explain the system’s architecture for
data selection, processing, formatting.
iii) Ask for evidence that network traffic is minimized and how
the need to process large data sets on the client is also
minimized.
Reasoning: Based on user interviews and technical review, these are the must
have features for the data warehouse installation.
 Strongly Desired Features
Examples:
→ The vendor’s system must allow users to schedule reports.
→ Does your system permit users to schedule at a certain time
and date? Daily? Weekly? Based on an event?
Describe how scheduling is accomplished and the options for
scheduling reports.
Reasoning: Based on the requirements analysis, the items listed here are
additional features that will be good to have but not mandatory.

 Desired Features
Examples:
→ Does your system possess a redundant query control?
This feature returns a saved version of a report whenever an
identical report has been executed earlier – the report, output
format, and raw data all remain constant in the interim. Describe
how this works in your system.
What are the limits in terms of how far back reports are archived?
Reasoning: Following the rational of “Strongly Desired Features” above, this
section lists a lower desired set of requirements.
4. INFORMATION REQUIRED FROM VENDORS
 Product Profile
o Product name
o Product description
o Current release level
o Date current release level was available
o Projected date for the next release candidate
o Current products install base
o Industry verticals of installation
o Number of installations
o Breakdown of installations/users
Reasoning: Gauge vendor’s product specifications and maturity.
 Consulting
o Describe the level of involvement the consultants will have at each
stage of the project.
o What is the experience of the consultants?
o Location of Consultants.
o Please forward resumes of consultants.
Reasoning: Evaluate experience and competency of the vendor’s team.
Also enquire on the consultants’ geographic location, this will help evaluate
possible response rate.
 Business Partners
o Ask the vendor to state which products are fully integrated to your solution.
o Describe the interconnectivity and interface requirements.
Reasoning: Enquire what other products can integrate well with the solution,
e.g. Crystal Reports, Cognos. Also, the vendor will state the ease at which
integration can occur.
 Full feature of the business intelligence solution and products
o Ask what makes the product stand out.
Reasoning: This will give the vendor a chance to distinguish their products.
Further business intelligence capabilities such as OLAP, Data mining can be evaluated.
 Application areas
o Describe the areas the solution supports i.e. budgeting, financial
planning etc.
Reasoning: Evaluate which areas the data warehousing solution can serve.
The “Mandatory Features” section above will inform the vendor the area which
are necessary for the Ministry. This section gives further information on the
vendors solution and possible areas which the Ministry can use in the future.
 Web-based and standalone clients
o Query if the solution supports both web-based and standalone clients.
Ask which functionality is not supported in both versions (i.e. what
works in a standalone client and not web-based and vice versa)
Reasoning: The architecture is of a fundamental importance. This section will
enquire if the vendor’s solution will match the Ministry requirements.
 Pricing and Licensing Model
o How much does the product cost, including initial training, support
and consulting?
o
Reasoning: Gauge the cost of the solution and other indirect costs.
 System Installation and Tool Administration
Examples:
o Please describe the platform and system requirements of your product.
o For web-based clients, are there specific requirements on the type and
version of browser that may be used (e. g. Netscape vs. Internet Explorer)?
o Please describe the overall architecture of your product (e. g.
Client/ Server, two-tier/ three-tier architecture, etc.), including how
your product functions in a geographically distributed environment.
o Please describe the steps involved in installing your product.
o Please list the administrative functions that are included in your
product (e. g. defining roles, backup/ recovery, etc.).
o How do tool administrators detect when your product requires additional
hardware or requires human intervention (e. g. to restart a process, etc.).
o How often are major/ minor software upgrades available? How
would we be notified of a new software release?
o Please describe how standalone (I. e. non-web-based) clients are
distributed to end-user desktops. How are software upgrades
distributed to end-users?
o Are there ways to make the upgrade process transparent to the
end-users?
Reasoning: The above questions and statements will give further information
on the vendor’s solution regarding the installation and administration of your
product.
 Security
o How are end-users and administrators authenticated to your
product?
o How do end-users and administrators gain access to reports, data
sets, etc. (e. g. via roles)?
o Please fully describe how reports and data are protected from
unauthorized users.
o Please describe how row-level access to the data is attained.
o What level of encryption is used on passwords, data, reports, etc.?
o How does the existence of a firewall change the security architecture?

o How does the web-based client differ in terms of security from the
standalone client?
Reasoning: Questions on security must also pass the security criteria set
by regulators or industry standards .
 Performance and Scalability
o Does the product supports a substantial increase in data size,
frequency and complexity of end-user queries?
o How is performance measured within your product?
Reasoning: Evaluation of how the product can scale to future expectations.

Data Warehouse Design


Data warehousing is more an art-form than cookie cutter science. The business variables
and technical risks are very unique to each installation. The business users have different
goals and expectations. Data warehousing is more often successful than not if there is a
reservoir of data warehousing expertise in-house.

This article will focus on the data warehousing design methodologies most commonly
proposed. These designs are in an evolving flux as business needs and technical cost
change.

Quite often the design chosen will be a combination of the methodologies below and
additional requirements - the data warehouse design third way. I am a proponent of the
third way data warehousing design. Third way takes into account the business specifics
and needs of the installing company and technical resources available. It uses the best
design patterns of both methodologies plus additional requirements unique to the
business.

The two major design methodologies of data warehousing are from Ralph Kimball and
Bill Inmon. The design methodologies developed by Kimball and Inmon have lines
drawn in the sand.

Both Kimball and Inmon view data warehousing as separate from OLTP and Legacy
applications.

Kimball views data warehousing as a constituency of data marts. Data marts are focused
on delivering business objectives for departments in the organization. And the data
warehouse is a conformed dimension of the data marts. Hence a unified view of the
enterprise can be obtain from the dimension modeling on a local departmental level.
Figure 1. Kimball's Data Warehousing Design Methodology

Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the
development of the data warehouse can start with data from the online store. Other
subject areas can be added to the data warehouse as their needs arise. Point-of-sale (POS)
data can be added later if management decides it is necessary.

The data mart is the creation of a data warehouse's subject area.

Figure 2. Inmon's Data Warehouse Design Methodology

There are pros and cons to both approaches. And there are third ways that can be unique
to an enterprise's needs.
Data Life Cycle Management
Data becomes active as soon as it is of interest to an organization. Data life cycle begins
with a business need for acquiring data. Active data are referenced on a regular basis
during day-to-day business operations. Over time, this data loses its importance and is
accessed less often, gradually losing its business value, and ending with its archival or
disposal.

Figure 1. Data Life Cycle in Enterprises


Active Data

Active data is of business use to an organization. The ease of access for business users to
active data is an absolute necessity in order to run an efficient business.

The simple, but critical principle, that all data moves through life-cycle stages is key to
improving data management. By understanding how data is used and how long it must be
retained, companies can develop a strategy to map usage patterns to the optimal storage
media, thereby minimizing the total cost of storing data over its life cycle.

The same principles apply when data is stored in a relational database, although the
challenge of managing and storing relational data is compounded by complexities
inherent in data relationships. Relational databases are a major consumer of storage and
are also among the most difficult to manage because they are accessed on a regular basis.
Without the ability to manage relational data effectively, relative to its use and storage
requirements, runaway database growth will result in increased operational costs, poor
performance, and limited availability for the applications that rely on these databases.
The ideal solution is to manage data stored in relational databases as part of an overall
enterprise data management solution.
Inactive Data

Data are put out to pasture once they are no longer active. i.e. there are no longer needed
for critical business tasks or analysis.

Prior to the mid-nineties, most enterprises achieved data in Microfilms and tape back-
ups.

There are now technologies for data archival such as Storage Area Networks (SAN),
Network Attached Storage (NAS) and Hierarchical Storage Management. These
storage systems can maintain referential integrity and business context.

Data Warehouse Project Management


Listen, Listen, Listen to the Stakeholders

By Hari Mailvaganam

A paramount determining factor in the success of data warehousing is the input of


stakeholders. Data warehousing is very unique to an organization, its business processes,
systems architecture and decision support needs.

Project management for data warehousing allows for large amounts of user input and at
all phases of the project. There are commercial software products tailored for data
warehouse project management. A good project plan lists the critical tasks that must be
performed and when each task should be started and completed. It identifies who is to
perform the tasks, describes deliverables to be created and identifies milestones for
measuring progress.

There are a number of publications on data warehousing project management. The


standard bearer publication is Ralph Kimball's Data Warehouse Lifecycle Toolkit.

Over the years I have worked with different project management methodologies and
processes for data warehousing. In this article I have listed a summary of a project
management methodology that I have developed that also ties in with the Rational
Unified Process (RUP). This is especially useful if the data warehousing project is part of
a greater development effort which follows RUP.

The program management framework for data warehousing follows tested


methodologies. The methodology described below follows the project management
workflow of the Unified Process.
Figure 1. Rational Unified Process; Copyright IBM

The project plan should include presentations at regular intervals, say monthly, to
management and stakeholders. The presentation will include:

• Discussion on any challenges and determine if project is on schedule;


• Review of activities and priorities to be achieved before next meeting;
• Contingency plans to make up time and address problems;
• An open question-and-answer period followed by a summary.

Phases of the Unified Process

The phases in project management following UP, Figure 1, are:

1. Inception
2. Elaboration
3. Construction
4. Transition

The project lifecycle is composed over time into four sequential phases, each concluded
by a major milestone, Figure 2.

The RFP is produces at the end of the inception phase.

Figure 2. Phases in the Unified Process; © IBM


1. Inception Phase

• Project planning and evaluation.


• Requirements gathering.
• Define features the system must support.
• Identify the stakeholders who oversee the system use.
• Users and other applications (i.e. actors) that will interface with the system.
• Define framework to identify business rules.
• List the events that the system must be aware of and respond to.
• List the constraints and risks place on the project.
• Define product selection criteria.
• Define project management environment.
• Prepare and obtain approval of sponsors for project mandate and approach.
• Define data warehouse objectives.
• Define query library/location matrix.
• Define query libraries.
• Establish meta-data.
• Prepare migration specifications.
• Develop query usage monitoring.
• Technical infrastructure assessment.
• Current technology assessment.
• Release of RFP.

The project plan must also place importance to:

• Change control.
• Risk assessment.
• Training.
• Scope agreements.
• Resources.
• Communication.

2. Elaboration Phase

• Review of RFPs and selecting winning candidate.


• Determine size of the data warehouse.
• Determine complexity and cleanliness of the data.
• Identify number of source databases and their characteristics.
• Select DSS tools.
• Determine network requirements.
• Conduct training.
• Integration of data.
• Plan quality assurance testing procedures.

3. Construction Phase

• Implement components.
• Systems integration.
• Conduct testing.
• Plan deployment.
• Develop support material.

4. Transitions Phase

• Roll-out of modules.
• Managing change-control requests.
• Conduct beta testing to validate the new system against user expectations.
• Perform converting operational databases.
• Evaluate performance and benchmarking.
• Evaluate that deployment baselines are complete.
• Obtain stakeholder feedback on deployment baselines consistency with evaluation
criteria.
• Fine-tune product based on feedback.
• Release product to users.
• Product release milestones.

Data Modeling and Mining


Modeling for data mining

By Hari Mailvaganam

Data mining is conducted against data accumulated in OLTP repositories, data


warehouses, data marts and archived data. The steps for data mining follows the
following pattern:

• data extraction
• data cleansing
• modeling data
• applying data mining algorithm
• pattern discovery
• data visualization

Data extraction and data cleansing can be eased with good data lifecycle management
policies. Very often a data warehousing project will ensure that data extraction and meta-
data standards are pre-defined in an organization.

Data models for operational and archived data are different from data mining models.
Data stored referentially in operational systems are designed for transactional speed.
Figure 1. Data Extraction for Data Mining

In data mining a unified table view is created where data of interest is stored. Most data
mining vendors offer the ability to extract data from repositories and transfer to the data
mining database.

The table view below shows an example for an retailer's data mining database.

Table 1. Table View of Data Mining Database

Not all of the data found in the data mining table view will have relevance. An example is
the first column in Table 1, which has identifier values.

Other data may hold hidden patterns that can be discovered after relevancy is captured,
often with external data sources. An example from Table 1 is the telephone number
column. At first glance it would seem this data set would be insignificant to the data
mining process. However, useful information can be obtained from telephone numbers,
such as telephone exchange location or cell phone usage. This can be obtained from
external data sources which will grade and score the telephone number data set in Table
1.
Figure 2. The Steps taken in Data Mining