You are on page 1of 14

2.1.

Data warehouse
Consider a bank that has different OLTP applications for handling accounts and deposits, Loans
and Credit cards. Let us assume that the OLTP applications for handling accounts and deposits
uses mainframes and it stores the data in IBM DB2 database. The OLTP application for Loans is
using Oracle database for data storage. Credit card application uses a totaly different database
to store its data. The information about the bank employees are stored in different files. If the
bank decides to do a market analysis, it will require data from all these systems. To do analysis,
the BI application requires data from different applications to be stored in a single location i.e. the
data warehouse.

Data Warehouse can be defined as a Subject-Oriented, Integrated, Time-Variant, Non-volatile


collection of data, enabling management decision making. A data warehouse is specifically
structured for dynamic queries (Queries that are not pre-processed and are prepared and
executed at run time), and fast and efficient business analytics. The data warehouse will act as
the source of data for reports and analytics.

Features of data warehouse

Subject-oriented

A data warehouse (DW or DWH) can be used to analyse different subject areas. For example, in
a retail enterprise, „Sales‟ can be a particular subject; „Order processing‟ could be another
subject, etc.

An Enterprise Data Warehouse (EDW) typically stores data from all or most of the subject areas
of an enterprise, thereby enabling cross-functional analysis and enterprise reporting. Data marts
(discussed later in this document) help analyse a particular subject area.

Integrated
Data from multiple source systems can be loaded into Data Warehouse. Hence DW is integrated
in nature, which means a Data Warehouse contains data that is arriving from multiple source
systems.

For example, a company is located in multiple cities across the country. Business is using
different source systems to capture transaction data in each and every city, and data from all
these systems are integrated and stored in the Data Warehouse.

Time-variant
Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6
months, 12 months, or even older data from a data warehouse. This contrasts with a Transaction
system, where only the most recent data is often kept.

Non-volatile
Data in the data warehouse is seldom over-written and never deleted. The data is static, mostly
read-only, and retained for future reporting.
2.2. Need For Data warehouse
Scenario 1:

ABC Pvt. Ltd is a company with branches in Mumbai, Delhi, Chennai and Bangalore. Each
branch has a separate operational system.
ABC‟s Sales Vice President (VP) wants to get a quarterly sales report.
(Note: Quarterly means every 3 months)

Here the problem is the company is operating in multiple locations, and in each location, they are
using different systems to capture business data. To analyse how the company as a whole is
performing, we need to have all the data in single data storage system, preferably a relational
database. The sales VP can then look into the single storage area to get knowledge on entire
company operation.

Solution:

Here we are going to integrate data from four different source systems located in Mumbai, Delhi,
Chennai, Bangalore and load into a single data storage that is a Data Warehouse. The data is
also saved in a format that will help the Sales VP to do his analysis. Now the Sales VP can use
the Data Warehouse to get information on and make inferences on the business operations.

The below figure depicts the solution provided


Scenario 2:

One-Stop-Shopping Super Market has a huge operational (OLTP) database.


Whenever Executives try to generate a report from the OLTP system, the OLTP system
becomes slow and data entry operators have to wait for some time.

Here the problem is the company is storing operational data in a huge operational database. If
business wants to use the same database for reporting needs, the system becomes slow due to
overload and it affects the business operations. The below diagram gives an idea about this
problem

The problem with the above situation is that, if management is using operational database for
reporting needs, data entry operators will need to wait till the Operational database becomes free
from reporting jobs.
Solution:

To avoid this problem, we can load data from operational database to a separate Data
Warehouse in specified time intervals (typically during non-business hours), and thus leave the
operational database free to Data Entry Operators. The Data Warehouse will support the
reporting needs, thus not impacting the operational database.

The below diagram gives an overview about the solution

Scenario 3:

Connect-all is a leading mobile communication service provider in the country. They offer post-
paid and prepaid mobile services to its customers. There are various tariff plans they offer to
customers. The management wants to find out the usage pattern of its various services by the
customers in different age groups and in different economic back grounds for the past ten years.
Here the problem is that to do such an analysis, the data for the past 10 years should be present
in the database. If that much volume of data were to be stored in the operational database, it will
make the OLTP system very slow
In the above situation, if the OLTP database stores all the transactional history data for ten years,
even simple queries like insert, update or delete will take a lot of time to execute.

Solution:

To avoid this problem, we can store the history data required for analysis in a separate data
store. This will make sure that the OLTP system is not slowed down by the history data. We can
also make sure that only the data required for analysis is brought forward to the data warehouse.
For example, contact address of each and every customer will be stored in the OLTP system.
But it is not required for analysis. So we do not have to bring that into the data warehouse.

In this scenario, each and every call details of the customer might not be required for analysis.
Let us assume that only the total call duration and the number of calls made per week is required
for analysis. In that case only the summarized information has to be stored in the data
warehouse.
Let us now conclude on why businesses require a data warehouse.

 Integrated information from heterogeneous sources

 Multiple ways to view business performance

 Low cycle time, faster analytics

 Performance Optimization - OLTP systems get overloaded with large analytical queries

 OLTP systems are not built to hold history data

Characteristics of a Data warehouse

 Stores large volumes of data used frequently by DSS (Decision Support System)

 Is maintained separately from operational databases

 Are relatively static with infrequent updates

 Contains data integrated from several, possibly heterogeneous operational databases

 Supports queries that process large data volumes

2.3. Data warehouse Architecture


Data warehouse Architecture
The source of the data in the data warehouse might be different databases or file systems from
different OLTP applications, data files from external vendors etc. The different data sources
might be in different physical locations. To bring the data to a common location, the data might
have to be transmitted over the network.

The different source systems might not be available all the time. There might be time constraints
like 'Data from one application can be read only between 1 am and 5 am. Data from second
application can be read only between 6 pm and 10 pm. Due to these timing constraints, we
maintain a „Staging area‟ where we store data temporarily. Staging Area is a temporary area
where we store data after extraction from sources.

Once all the data from different sources are available in the staging area, we do the necessary
data cleansing operations, integrate the data, transform data (perform calculations,
summarization etc.), and finally load the data into the data warehouse.

Data warehouse Building Blocks

Source systems (Systems of Record) are where the data comes from.

E.g. Online transaction processing, Mainframe, External applications

Extract, Transform and Load (ETL) jobs move data between different data stores, applying
business transformations and cleansing rules.
ETL jobs are used to move data from source to stage, and then from stage to target DW.
Examples of ETL tools are: Informatica, IBM DataStage, Ab Initio

Staging Area is a temporary storage area for the source data. A staging area is required in Data
warehouse architecture due to timing reasons. Data from all the different source systems must
be available before data can be integrated into the Data Warehouse. Due to varying business
cycles, data processing cycles, hardware and network resource limitations and geographical
factors, it is not feasible to extract data from all Operational databases at exactly the same time.
So the source data from each source system is first copied to the staging area. Once all the
required data is available in the staging area, it can be integrated into the data warehouse.

Data Warehouse (DW) - Example of DW databases: Teradata, IBM Netezza, Oracle

Metadata repository is a database used to store metadata. Metadata is data about data.
Metadata describes what (tables, columns, etc.) is available and where.

For example consider a customer table which has information about the customer. Here the
metadata about the customer table might have information like table name, column names, data
types of columns, constraints (e.g. the column cust_gender should only have values 'male' or
'female'), Format rules (Customer name should be in format First-name Middle-name Last-
name), how often the table is being updated etc..

The goal of the metadata repository is to provide a central, easily accessible, and easily
updateable repository of complete and accurate metadata about the entire BI application. The
metadata will provide both developers and end-users with full documentation and support for all
OLAP analysis. In addition, it is anticipated that thorough metadata will reflect in higher overall
data quality and data completeness in the data warehouse.

Operational Data Store (ODS) is a repository of current and integrated operational data used for
analysis - explained in detail below.

Data mart is a miniature data warehouse that supports the requirements of a particular
department or business process – explained in detail below.

OLAP Analysis is complex data analysis by active user queries

E.g. what are the top 5 selling products in 2013, what is the Total sale in Q1 of FY 2013-14?

2.4. Operational Data Store and Data Marts


Operational Data Store
An operational data store (ODS) is a data store used to analyse near real-time data (real time
means as and when things are happening) - to solve day-to-day problems. The content may
span many subject areas, but little history is retained.
It is used for integrating disparate data from multiple sources, so that business operations,
analysis and reporting can be carried out while business operations are happening. In some data
warehouse designs, ODS is also used as a source of data to a DW.
ODS is designed for queries on transactional data. The queries that are run on ODS are
relatively simple queries applied on small amounts of data (such as finding the status of a
customer order), rather than the complex queries that run on large amounts of data typical of the
data warehouse.

For example, consider a bank which needs answers to operational questions like “How many
defaulters were there for loan payments in the last month?” or an Insurance company which
needs information on “What payments have been received in the last 30 days”. Such analysis
can be done using an ODS.

Comparison between ODS and DW

ODS is generally used for near real time analysis, while data warehouse is used of OLAP
analysis.

ODS generally contains operational data while data warehouse has enterprise wide information
including history.

The queries run on ODS are relatively simple queries on small amounts of data, rather than the
complex queries on large amounts of data typical of the data warehouse.

An ODS is updated (loaded) more frequently than a DW.

Data marts

A Data Warehouse contains data from multiple subject areas (Departments) in an organization. A
data mart is a simple form of a data warehouse that is focused on a single subject (or functional
area), such as Sales, Finance, or Marketing. Data marts are often built and controlled by a single
department within an organization. Given their single subject focus, data marts usually draw data
from only a few sources. The sources could be internal operational systems, a central data
warehouse, or external data.

Consider an Insurance company which sells property insurance, automobile insurance, and
commercial insurance. It can create a data warehouse which contains data from all the subject
areas (property insurance, automobile insurance and commercial insurance) – in which case the
warehouse can be called as an „Enterprise Data Warehouse‟ or EDW. This can be used for
overall analysis. The company can also (or instead) decide to create separate data stores for
each insurance type. In this case the analysis will be limited to the respective insurance type.
This is a Data mart. A data mart contains data that belongs to one specific subject area in an
organization.
Data Mart and Data Warehouse- A comparison

A data warehouse, unlike a data mart, deals with multiple subject areas and is typically
implemented and controlled by a central organizational unit such as the Corporate Information
Technology (IT) group. Often, it is called a central or enterprise data warehouse. Typically, a data
warehouse assembles data from multiple source systems.

The below table summarizes the basic differences between a data warehouse and a data mart:

2.5. Data warehousing approaches/ architectures


There are two popular approaches.

 Top down approach

 Bottom up approach
 Virtual Data Warehouse

 Enterprise Data Warehouse

 Data marts

 Distributed Data mart

Top Down approach:


This approach was proposed by Bill Inmon. In this approach, an Enterprise Data Warehouse is
built first, and data marts are later derived from it. In this approach, Data Warehouse is designed
in Normalized form. The below figure gives an idea about the Top down approach. “3NF EDW”
represents the Enterprise Data warehouse that stores data in 3rd Normal Form (highly
normalized).

Bottom-up approach:
This approach was proposed by Ralph Kimball. In Bottom up approach, data marts for each
subject area are built first. The Data Warehouse is built later by combining/connecting the
various data marts, through shared dimensions called as Conformed Dimensions (explained in
detail in „Dimensional Modeling‟ module of this course)
Enterprise Data Warehouse

Some organizations might decide to build an Enterprise data warehouse for all its reporting
needs. In this case, all the reports and analysis is done from the EDW.

Data marts
Smaller businesses may decide to build data marts for all its BI requirements. A data mart can be
less expensive than implementing a data warehouse, thus making it more practical for the small
businesses.

In this case the organisation has separate data marts for each of its departments' analysis
requirements.

Virtual Data warehouse


In this approach the reports and analysis are sourced directly from the operational database. The
source systems itself are treated as the data warehouse, and the Reporting tool has to handle
the data complexities. While economically viable for very small scale businesses, this is not a
good solution approach for reporting or analytics.

Dependant and Independent data marts

Data marts in Top down approach are called as „Dependent data marts‟, whereas data marts in
Bottom up approach are called as „Independent data marts‟. Dependent data marts draw data
from a central data warehouse that has already been created. Independent data marts, in
contrast, are standalone systems that draw data directly from operational or external sources of
data, or both.

You might also like