You are on page 1of 27

Data Warehousing

ACPCE
 Outline
What is data warehousing
The benefit of data warehousing
Differences between OLTP and data warehousing
The architecture of data warehouse
The main components
Data flows
Tools and technologies
Integration
The importance of managing meta-data
Data marts
 What is data warehousing?
data warehousing is subject-oriented,
integrated, time-variant, and non-volatile
collection of data in support of management’
s decision-making process.
a data warehouse is data management and
data analysis
data webhouse is a distributed data
warehouse that is implement over the web
with no central data repository
goal: is to integrate enterprise wide corporate
data into a single reository from which users
 What is data warehousing?
Subject-oriented WH is organized around the major subjects of the
enterprise..rather than the major application areas.. This is reflected in
the need to store decision-support data rather than application-
oriented data
Integrated because the source data come together from different
enterprise-wide applications systems. The source data is often
inconsistent using..The integrated data source must be made
consistent to present a unified view of the data to the users
Time-variant the source data in the WH is only accurate and valid at
some point in time or over some time interval. The time-variance of
the data warehouse is also shown in the extended time that the data is
held, the implicit or explicit association of time with all data, and the
fact that the data represents a series of snapshots
Non-volatile data is not update in real time but is refresh from OS
on a regular basis. New data is always added as a supplement to DB,
rather than replacement. The DB continually absorbs this new data,
incrementally integrating it with previous data
 The benefits of data
warehousing
The potential benefits of data
warehousing are high returns on
investment..
substantial competitive advantage..
increased productivity of corporate
decision-makers..
 The difference bewteen
OLTP and data warehousing
A DBMS built for online transaction
processing (OLTP) is generally
regarded as unsuitable for data
warehousing because each system is
designed with a differing set of
requirements in mind

example: OLTP systems are design to maximize the


transaction processing capacity, while data warehouses are
designed to support ad hoc query processing
comparision of OLTP systems and data
warehousing system
OLTP systems Data warehousing
systems
Hold current data Holds historical data
Stores detailed data Stores detailed, lightly, and highly
Data is dynamic summarized data
Repetitive processing Data is largely static
High level of transaction throughput Ad hoc, unstructured, and heuristic
Predictable pattern of usage processing
Transaction-driven Medium to how level of transaction
throughput
Application-orented
Unpredictable pattern of usage
Supports day-to-day decisions
Analysis driven
Serves large number of clerical/operation
users Subject-oriented
supports strategic decisions
Serves relatively how number of
managerial users
 Problems
Underestimation of resources for data loading
Hidden problems with source systems
Required data not captured
Increased end-user demands
Data homogenization
High demand for resources
Data ownership
High maintenance
Long-duration projects
Complexity of integration
 The architecture
Operati
onal
Reporting, query,
data
source High application development,
1
Query Manage sum and EIS(executive
Oper Meta- mariz information system) tools
ation data ed
Lightl
al data
y
data
sum
sourc Load
mariz
e2
Oper Manager ed
ation data
al Detail
data ed DBMS OLAP(online analytical
sourc data Warehouse Manager processing) tools
en
Oper
ation
al
data
store
(ods)
Operational data store
(ODS)
Arch Data mining
ive/b
acku
p
End-user
data
access
Typical architecture of a data tools
warehouse
 The main components
Operational data sources for the DW is supplied
from mainframe operational data held in first generation
hierarchical and network databases, departmental data held in
proprietary file systems, private data held on workstaions and
private serves and external systems such as the Internet,
commercially available DB, or DB assoicated with and
organization’s suppliers or customers
Operational datastore(ODS) is a repository of
current and integrated operational data used for analysis. It is
often structured and supplied with data in the same way as the
data warehouse, but may in fact simply act as a staging area
for data to be moved into the warehouse
 The main components
load manager also called the frontend component, it
performance all the operations associated with the extraction
and loading of data into the warehouse. These operations
include simple transformations of the data to prepare the data
for entry into the warehouse
warehouse manager performs all the operations
associated with the management of the data in the warehouse.
The operations performed by this component include analysis
of data to ensure consistency, transformation and merging of
source data, creation of indexes and views, generation of
denormalizations and aggregations, and archiving and backing-
up data
 The main components
query manager also called backend component, it
performs all the operations associated with the management of
user queries. The operations performed by this component
include directing queries to the appropriate tables and
scheduling the execution of queries
detailed, lightly and lightly summarized data,
archive/backup data
meta-data
end-user access tools can be categorized into five main
groups: data reporting and query tools, application
development tools, executive information system (EIS) tools,
online analytical processing (OLAP) tools, and data mining
tools
 Data flows
Inflow- The processes associated with the extraction, cleansing, and
loading of the data from the source systems into the data warehouse.

upflow- The process associated with adding value to the data in the
warehouse through summarizing, packaging , packaging, and distribution of
the data

downflow- The processes associated with archiving and backing-up


of data in the warehouse

outflow- The process associated with making the data availabe to the
end-users

Meta-flow- The processes associated with the management of the


meta-data
Operati Reporting, query,application
onal Warehouse development, and EIS
data Manager (executive information system)
source High
1 Meta- tools
Meta- sum
data flow mariz
Lightl ed
Inflo y data Outflo
w sum w
Load Warehouse Manager
mariz OLAP (online
Oper Manage ed Query Manage
Upflo analytical
ation r data
al w processing) tools
Detail
data ed DBMS
sourc data
en
Oper
ation
al
data
store Data mining tools
(ods)
End-user
Downflo access
Arch w tools
ive/b
acku
p
data
Information flows of a data
warehouse
 Tools and Technologies
The critical steps in the construction of a
data warehouse:
a. Extraction
b. Cleansing
c. Transformation
after the critical steps, loading the results into
target system can be carried out either by
separate products, or by a single, categories:
code generators
database data replication tools
dynamic transformation engines
 Data Warehouse DBSM
(integration)
due to the maturity of such products, most
relational databases will integrate predictably
with other types of software
The reqirements for data warehouse RDBMS
Load performance
Load processing
Data quality management
Query perfomance
Terabyte scalability
Mass user scalability
Networked data warehouse
Warehouse administration
Integrated dimensional analysis
Advanced query funtionlity
 The importance of managing
meta-data(integration)
The integration of meta-data, that is ”data about data”
Meta-data is used for a variety of purposes and the
management of it is a critical issue in achieving a fully
integrated data warehouse
The major purpose of meta-data is to show the pathway back
to where the data began, so that the warehouse administrators
know the history of any item in the warehouse
The meta-data associated with data transformation and loading
must describe the source data and any changes that were
made to the data
The meta-data associated with data management describes
the data as it is stored in the warehouse
The meta-data is required by the query manager to generate
appropriate queries, also is associated with the user of queries
The major integration issue is how to synchronize the various
types of meta-data use throughout the data warehouse. The
challenge is to synchronize meta-data between different
products from different vendors using different meta-data
stores
Two major standards for meta-data and modeling in the areas
of data warehousing and component-based development-MDC
(Meta Data Coalition) and OMG(Object Management Group)
 Administration and
Management Tools
a data warehouse requires tools to support the
administration and management of such complex
enviroment.
for the various types of meta-data and the day-to-
day operations of the data warehouse, the
administration and management tools must be
capable of supporting those tasks:
monitoring data loading from multiple sources
data quality and integrity checks
managing and updating meta-data
monitoring database performance to ensure efficient query
response times and resource utilization
auditing data warehouse usage to provide user chargeback
information
replicating, subsetting, and distributing data
maintaining effient data storage management
purging data;
archiving and backing-up data
implementing recovery following failure
security management
 Data mart
data mart a subset of a data
warehouse that supports the
requirements of particular department
or business function
The characteristics that differentiate
data marts and data warehouses
include:
a data mart focuses on only the requirements of users
associated with one department or business function
data marts do not normally contain detailed operational data,
unlike data warehouses
as data marts contain less data compared with data
warehouses, data marts are more easily understood and
navigated
Operati
onal Warehouse
data Manager High
source
1 sum Reporting, query,application developmen
Meta-
Oper data mariz and EIS(executive information system)
ation Lightl ed tools
al y data
data sum Query
sourc Load Warehouse
mariz Manager Manag
e2 Manage ed e
Oper
ation r data
al Detail OLAP(online analytical
data
ed
data
DBMS processing) tools
sourc
en
Oper
ation
al
data
store Data mining
(ods) (First
(Third
Tier)
Tier)
Operational data store Arch
(ODS) ive/b End-user
acku
p access
tools
data Data
summ
Mart
arized
data
Sum
(Relati
mari
onal
zed
databa
data
se)
(Mult
i-
dime (Second
nsio Tier)
Typical data warehouse adn data mart n
data
architecture base
)
Reasons for creating a data
mart
To give users access to the data they need to analyze most
often
To provide data in a form that matches the collective view of
the data by a group of users in a department or business
function
To improve end-user response time due to the reduction in the
volume of data to be accessed
To provide appropriately structured data as ditated by the
requirements of end-user access tools
Normally use less data so tasks such as data cleansing,
loading, transformation, and integration are far easier, and
hence implementing and setting up a data mart is simpler than
establishing a corporate data warehouse
The cost of implementing data marts is normally less than that
required to establish a data warehouse
The potential users of a data mart are more clearly defined and
can be more easily targeted to obtain support for a data mart
project rather than a corporate data warehouse project
data marts issues
data mart functionality the capabilities of data
marts have increased with the growth in their popularity

data mart size the performance deteriorates


as data marts grow in size, so need to reduce the
size of data marts to gain improvements in
performance
data mart load performance two critical
components: end-user response time and data
loading performance to increment DB updating so
that only cells affected by the change are updated
and not the entire MDDB structure
users’ access to data in multiple marts one
approach is to replicate data between different data marts or,
alternatively, build virtual data mart it is views of several
physical data marts or the corporate data warehouse tailored to
meet the requirements of specific groups of users
data mart internet/intranet access it’s products sit
between a web server and the data analysis product.Internet/intranet
offers users low-cost access to data marts and the data WH using
web browsers.

data mart administration organization can not easily


perform administration of multiple data marts, giving rise to issues
such as data mart versioning, data and meta-data consistency and
integrity, enterprise-wide security, and performance tuning . Data mart
administrative tools are commerciallly available

data mart installation data marts are becoming


increasingly complex to build. Vendors are offering products
referred to as ”data mart in a box” that provide a low-cost
source of data mart tools

You might also like