You are on page 1of 47

c  

 

c    
The term Data Warehouse was coined by Bill
Inmon, the Ơfather of data warehousingơ in 1990
1990,,
which he defined in the following way
way::
m Data warehousing is combining data from
multiple and usually varied sources into one
comprehensive and easily manipulated
database..
database
m Common accessing systems of data warehousing
include queries, analysis and reporting
reporting..
m Because data warehousing creates one database
in the end, the number of sources can be
anything you want it to be, provided that the
system can handle the volume, of course.
course.
m The final result, however, is homogeneous data,
which can be more easily manipulated.
manipulated.
m Data warehousing is commonly used by
companies to analyze trends over time.
time.
m In other words, companies may very well use
data warehousing to view day
day--to
to--day
operations, but its primary function is facilitating
strategic planning resulting from long-
long-term data
overviews..
overviews
m From such overviews, business models,
forecasts, and other reports and projections can
be made
made..
m Routinely, because the data stored in data
warehouses is intended to provide more
overview--like reporting, the data is read-
overview read-only
m This is not to say that data warehousing involves
data that is never updated.
updated.
m On the contrary, the data stored in data
warehouses is updated all the time. time. It's the
reporting and the analysis that take more of a
long--term view.
long view.
m Data warehousing is not the be- be-all and end
end--all
for storing all of a company's data.
data.
m Rather, data warehousing is used to house the
necessary data for specific analysis.
analysis.
m More comprehensive data storage requires
different capacities that are more static and less
easily manipulated than those used for data
warehousing..
warehousing
m Data warehousing is typically used by larger
companies analyzing larger sets of data for
enterprise purposes.
purposes.
m Smaller companies wishing to analyze just one
subject, for example, usually access data marts,
which are much more specific and targeted in
their storage and reporting
reporting..
m Data warehousing often includes smaller
amounts of data grouped into data marts
marts..
m In this way, a larger company might have at its
disposal both data warehousing and data marts,
allowing users to choose the source and
functionality depending on current needs.
needs.
m Data Warehousing is open to an almost limitless
range of definitions.
definitions.
m Simply put, Data Warehouses store an
aggregation of a company's data.
data.
m Data Warehouses are an important asset for
organizations to maintain efficiency, profitability
and competitive advantages.
advantages. Organizations
collect data through many sources - Online, Call
Center, Sales Leads, Inventory Management
Management..
m The data collected have degrees of value and
business relevance.
relevance.
m As data is collected, it is passed through a
'conveyor belt', call the Data Life Cycle
Management .      

 !     "

Overview of Data Warehousing Infrastructure
O 
  

m The pre-
pre-Data Warehouse zone provides
the data for data warehousing
warehousing.. Data
Warehouse designers determine which
data contains business value for insertion.
insertion.
m OLTP databases are where operational
data are stored.
stored. OLTP databases can
reside in transactional software
applications such as Enterprise Resource
Management (ERP), Supply Chain, Point of
Sale, Customer Serving Software
Software.. OLTPs
are design for transaction speed and
accuracy..
accuracy
m Metadata ensures the sanctity and
accuracy of data entering into the data
lifecycle process.
process.
m Meta--data ensures that data has the right
Meta
format and relevancy
relevancy..
m Organizations can take preventive action
in reducing cost for the ETL stage by
having a sound Metadata policy.
policy.
m The commonly used terminology to
describe meta data is "data about data"
data"..
  

m Before data enters the data warehouse,
the extraction, transformation and
cleaning (ETL) process ensures that the
data passes the data quality threshold
threshold..
m ETLs are also responsible for running
scheduled tasks that extract data from
OLT
m Many of the concepts and practices of
data warehousing have existed for years,
but it is only within the last few years that
the term has acquired "buzz word" status
status..
m While it is true that software is available
for automating some of the data
warehouse processing, a data warehouse
is not a product - it is not something that
can be purchased from a vendor.
vendor.
m Rather, it is a model of a corporation's
data, put together in such a way that it
answers the corporation's business
questions..
questions
       #    !    
   $  %    ""
&'( 
c     c 
(  
) c  * ë
   
  

 
subject oriented application oriented

integrated multiple diverse sources

time-variant real-time, current

nonvolatile updateable
c
++ c  +
c
++  
m One of the fundamental assumptions of
data warehousing is that operational data
needs to be stored separately in a
different format in order to support data
analysis..
analysis
m This diagram illustrates the fact that
different sets of users access the data,
using different sets of applications and for
different purposes.
purposes.
       

   
$          $  $  
     
""
"Most organizations need two different data
environment, one optimized for operational
applications and one optimized for informational
applications.. For example, operational applications
applications
and databases are typically optimized for fast
response time and typically cannot tolerate the impact
on response time created when access by an
informational application.
application. The two types of
applications are fundamentally different.
different. If the same
data environment is used to support both, the
performance, capability and benefit of both will be
compromised.."
compromised
Within a data warehouse implementation
itself, the following types of data will be
required to support typical uses
uses::
m Real--time Data
Real
m Reconciled Data
m Derived Data
m Changed Data
m Metadata
?    
? ? 

  
    O
O  
pical operation Update Report Anal e
Level of anal tical Low Medium High
requirements
Screens Unchanging User-defined User-defined
Amount of data per Small Small to large Large
transaction
Data level Detail Detail to Mostl summar
summar
Age of data Current Historical and Historical, current
current and projected
Orientation Records Records Arra s 
,( +c    

(()
&  

Data warehousing systems target at
least three different types of applications:
applications:
m personal productivity
m query and reporting
m planning and analysis
m O   

   :
such as spreadsheets, statistical
packages and graphics tools, are
useful for manipulating and
presenting data on individual PCsPCs..
Developed for a standalone
environment, these tools address
applications requiring only small
volumes of warehouse data.
data.
m c 
   
  ::
  
Deliver warehouse-
warehouse-wide data access
through simple, list-
list-oriented queries, and
the generation of basic reports.
reports. These
reports provide a view of historical data
but do not address the enterprise need for
in-
in-depth analysis and planning
planning..
m 4O       
address such essential business
requirements as budgeting, forecasting,
product line and customer profitability, sales
analysis, financial consolidations and
manufacturing mix analysis--
analysis--applications
applications that
use historical, projected and derived data
data..
m "These planning and analysis requirements,
referred to as on-on-line analytical processing
(OLAP) applications, share a set of user
requirements that cannot be met by applying
query tools against the historical data
maintained in the warehouse repository.
repository.
m The planning and analysis function
mandates that the organization look not
only at past performance but, more
importantly, at the future performance of
the business.
business.
m It is essential to create operational
scenarios that are shaped by the past, yet
also include planned and potential
changes that will impact tomorrow's
corporate performance.
performance.
m The combined analysis of historical data
with future projections is critical to the
success of today's corporation.
corporation."
u       

m Data warehousing is being hailed as one
of the most strategically significant
developments in information processing in
recent times.
times.
m One of the reasons for this is that it is
seen as part of the answer to information
overload..
overload
m pas a subject area orientation
m Integrates data from multiple, diverse sources
m Allows for analysis of data over time
m Adds ad hoc reporting and enquiry
m Provides analysis capabilities to decision makers
m Relieves the development burden on IT
m Provides improved performance for complex
analytical queries
m Relieves processing burden on transaction
oriented databases
m Allows for a continuous planning process
m Converts corporate data into strategic
information
c    

(&   c+ & 

m The processes run from left to right, with a
feedback loop from the users.
users. One of the very
clear lessons of data warehousing is that you
don't build one in the way you build a househouse..
Iteration and refinement is vital.
vital. The clue is to
start small, and then evolve the data warehouse
as the needs develop.
develop.
m Flexibility and the ability to adapt to changing
business needs are essential.
essential. Some vendors are
beginning to talk about tools for automating
maintenance.. For this to happen, the
maintenance
management of the metadata needs to become
more tightly integrated into the data
warehousing process
process..
The actual design process for developing a
data warehouse runs from right to left in
this diagram.
diagram.
m talk to the users

m determine their needs in terms that can be
measured
m design a database to support those needs

m document the data descriptions and other
attributes (this will ultimately include data
sources, time stamps, data meanings that
change over time, etc)
m design the logic for translating data from
various sources into an integrated data
store
m write the code for extracting data from the
various sources and transforming it into
the data warehouse, with updates to the
metadata
m and finally, package the procedures to
handle scheduling, management and
maintenance
c   

m Functions that are desired as part of a data
warehousing solution are shown in.in.
m This illustrates the flow of data from
originating sources to the user, and
includes management and implementation
aspects..
aspects
m It starts with access mechanisms for
retrieving data from heterogeneous
operational data sources
sources..
m That data is replicated via a transformation
model and stored in the data warehouse
warehouse..
m The definition of data elements in the data
warehouse and in the data sources, and the
transformation rules that relate them, are
referred to as 'metadata'.
'metadata'.
m Metadata is the means by which the end-
end-user
finds and understands the data in the
warehouse..
warehouse
m The data transformation and movement
processes are executed whenever an update to
the warehouse data is desired.
desired.
m Different parts of the warehouse may require
updates at different times, some at regular
intervals such as weekly or monthly, and some
on specified dates.
dates.
m There should be a capability to manage
and automate the processes required to
perform these functions.
functions.
m Particularly in a multi-
multi-vendor environment,
adopting an architecture with open
interfaces would facilitate the integration
of the products that implement these
functions..
functions
m Quality consulting services can be an
important factor in assuring a successful
and cost effective implementation
þ 

m The simplest definition of the term metadata is
"data about data"
data"..
m A data warehouse unlocks the data held in
corporate databases but only if business users
are able to find out about the data and
information objects (queries, analyses, reports,
etc) that are stored there.
there.
m Such as facility is known as metadata or an
information directory.
directory.
m An information directory needs to hold more than
the names of the tables with their elements and
data types
types..
m As it shows, it needs to hold information for
technical tasks as well as for business tasks.
tasks.
m Much of the data needed by technical users will
exist in a variety of places, such as program
libraries, DBMS system catalogs, CASE tools, etc.
etc.
m Some of these places will include information of
interest to business users as well, such as data
descriptions and meanings
meanings..
m As Colin White points out, "one key objective of
an information directory is to be able to
integrate this diverse set of metadata, and then
provide easy access to it for data warehouse
developers, administrators, and business users.
users." 
 c c 
m The information directory holds information
about the design of the data warehouse,
including a history of changes that occur over
time..
time
m It includes all the information necessary for
administering the data warehouse -
authorization, archiving, backups, building data
collections, etc.
etc.
m It also specifies the data acquisition rules,
including scheduling, sources, transformations
and cleanups, etc
etc.. And it maintains a log of data
collection as well as data access operations.
operations.
m An important component of a complete data
warehousing solution is one that integrates and
synchronizes the various sources of metadata
()
&   

m The term replication is used to describe the
process of managing copies of data, and has
traditionally been applied in the field of
distributed databases and in client/server
environments..
environments
m Its application in data warehousing is relatively
recent and somewhat specialized, due in part to
the aggregation and denormalization that occurs.
occurs.
m Also, a data warehouse will contain multiple
instances of the same set of data elements as
snapshots, each with a different timestamp.
timestamp.
m The more difficult replication problems in data
warehousing occur in tables where the current
values are merely to be updated if they have
changed..
changed
In a research paper written at Stanford University,
entitled Efficient Snapshot Differential Algorithms for
Data Warehousing
Warehousing,, the authors state that "detecting
and extracting modifications from information sources
is an integral part of data warehousing
warehousing..
m " They report that "there are essentially three ways to
detect and extract modifications:
modifications: The application
running on top of the source is altered to send the
modifications to the warehouse.
warehouse. A system log file is
parsed to obtain the relevant modifications to the
application.. The modifications are inferred by
application
comparing a current source snapshot with an earlier
one.. We call the problem of detecting differences
one
between two sources snapshots the snapshot
differential problem;
problem; it is the problem we address in
this paper.
paper."
m An important part of replication is to determine
the schedule - what parts of the data warehouse
get written or refreshed when?
m Some parts of the data warehouse will get
updated with a regular frequency, such as
monthly or weekly.
weekly.
m Other parts will get new snapshots written
according to a calendar of significant dates, such
as around census and registration dates.
dates.
m The replication schedule needs to be automated,
and documented in the information directory as
well as in the system manual.
manual.
m One of the benefits of data warehousing is
consistency - analyzing the same data in
different ways always accumulates to the same
totals..
totals
m This is not possible with operational systems
since real-
real-time updates to the data can take
place between doing the first analysis and doing
a later one.
one.
m powever, in planning the replication schedule,
special care needs to be taken to ensure that
the operational data is in a stable, non
non--volatile
state during the replication process.
process.
m Synchronizing the various parts of the data
warehouse is important, or inconsistencies can
still arise.
arise.
&''&
)), -
) *)
c    
)
i     


 
m The Data Warehousing Institute has created
what they call a "New Roadmap To Data
Warehousing" in the form of a poster.
poster.
m It is a comprehensive although not exhaustive
compilation of data warehousing components
and tools organized into categories of use.
use.
m The different categories are summarized in the
following diagram
D      

  
m Larry Greenfield has compiled and organized
links on the world wide web to a comprehensive
list of data warehousing resources, at URL:
URL:
m http
http:://www
//www..dwinfocenter
dwinfocenter..org/
(used to be:
be:
http:://pwp
http //pwp..starnetinc
starnetinc..com/larryg/index
com/larryg/index..html)
m There are listings for vendors of end user tools
as well as for infrastructure technology, as
shown in the following list of headings.
headings.
Y    
m Report and Query
m OLAP / Multidimensional Databases
m Financial, Marketing, and Supply Chain Analysis
m Executive Information Systems
m Data Mining
m Document Retrieval
m Geographic Information Systems
m Decision Analysis
m Statistics
m Process Modeling
m Information Filtering
m Industry Specific Tools
m Other End User Decision Support Tools
„  

   
m Data Extraction, Cleaning, Loading

m Information Catalogs

m Databases for Data Warehousing

m Query and Load Accelerators

m Middleware

m Other Database Tools

m pardware
GOAL
m The goal of Data Warehousing is to
generate front-
front-end analytics that will
support business executives and
operational managers.
managers.
CONCLUSION
m An operational database is designed
primarily to support day to day operations
operations..
m A data warehouse is designed to support
strategic decision making.
making.