You are on page 1of 48

|   

| | 


  


OWhy Warehouse?
OWhat is a Warehouse?
OData Warehouse
Architecture
OIntroduction to Data
Mining
OIntroducing Data
Warehousing and Mining
in your organization
 
Pee |e elees ...

phe two most important


people in the 21st
Century will be the CFO
(managing the Cash
Flow) and the CIO
(managing the
Information Flow)
 
| | eeee
e ...

OI canƞt find the data I need


Odata is scattered over the network
Omany versions, subtle differences OI
canƞt get the data I need
Oneed an expert to get the data OI
canƞt understand the data I
found
Oavailable data poorly documented
OI canƞt use the data I found
Oresults are unexpected
Odata needs to be transformed from
one form to other
 
 | es 
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue and
margins?

|es 

Outting Information technology to help


the knowledge worker make faster and
better decisions
OWhich of my customers are most likely to go
to the competition?
OWhat product promotions have the biggest
impact on revenue?
OHow did the share price of software
companies correlate with profits over last 10
years?
 
|es 

Oàsed to manage and control business


OData is historical or point-in-time
OOptimized for inquiry rather than update
Oàse of the system is loosely defined and
can be ad-hoc
Oàsed by managers and end-users to
understand the business and make
judgements


l  |es


OÎ0ƞs: Batch reports


Ohard to find and analyze information
Oinflexible and expensive, reprogram every request
O70ƞs: perminal based DSS and EIS O80ƞs:
Desktop data access and analysis tools
Oquery tools, spreadsheets, GàIs
Oeasy to use, but access only operational db O90ƞs:
Data warehousing with integrated OLA
engines and tools
 
 e e ses
s ...

OData should be integrated


across the enterprise
OSummary data had a real
value to the organization
OHistorical data held the key to
understanding data over time
OWhat-if capabilities are
required


 s | es 

A process of transforming
Information
data into information and
making it available to
users in a timely enough
manner to make a
difference

[Forrester Research, April 199Î

Data

 s  | ese

A single, complete and


consistent store of data
obtained from a variety of
different sources made
available to end users in a
what they can understand
and use in a business
context.

[Barry Devlin

| es  e

OHardware -- servers, storage, clients


OWarehouse -- DBMS
Opools
OSystems Integration and Consulting
OMarket growing from
O$2B in 1995 to $8B in 1998 [Meta Group
O$1.5B in 1995 to $Î.9B in 1999 [Gartner
Group
 
e | ese
e







 
 
 








    
    ! "# Meta Group

 
½   ! e 1000 C" es
e e  eses

$% 
& 
  '( )

  * # +


, # + , ,-

 ./0 1 # , # , ,-


2$ 

&$  # , # , , -

) #

, # + ,  ,-

 # + , #+ , ,-

  3 20 #  , # +  , ,-

& 2    $  ! ",,     '  '% $,
04 '  '%+  0,

eses e Ve
#e |ses
-

 -

-

-

-
 
- 560 7

- 04 .& '%+  0,


'  '  '  '
' '  '  ' 
 
Ve #e | $ses

Operabytes -- 10^12 bytes: Walmart -- 24 perabytes

Oetabytes -- 10^15 bytes: Geographic Information


Systems
OExabytes -- 10^18 bytes: National Medical Records

OZettabytes -- 10^21 bytes: Weather images

OZottabytes -- 10^24 bytes: Intelligence Agency


Videos



| es  %%
 s  ess

Opechnique for assembling and


managing data from various
sources for the purpose of
answering business questions.
phus making decisions that
were not previous possible OA
decision support database
maintained separately from the
organizationƞs operational
database
 
| ese

OA data warehouse is a
Osubject-oriented
Ointegrated
Otime-varying
Onon-volatile
collection of data that is used primarily in
organizational decision making.
-- Bill Inmon, Building the Data Warehouse 199Î


 l |$  se
 #P

ODatabase Systems have been used


traditionally for OLp
Oclerical data processing tasks
Odetailed, up to date data
Ostructured repetitive tasks
Oread/update a few records
Oisolation, recovery and integrity are critical
OWill call these operational systems

e l se"s

ORun the business in real time


OBased on up-to-the-second data
OOptimized to handle large numbers
of simple read/write transactions
OOptimized for fast response to
predefined transactions
Oàsed by people who deal with
customers, products -- clerks,
salespeople etc.
Ophey are increasingly used by
customers

&"les  e l
|
Data Industry àsage pechnology Volumes
Customer All prack Legacy application, flat Small-medium
File Customer files, main frames
Details
Account Finance Control Legacy applications, Large
Balance account hierarchical databases,
activities mainframe
oint-of- Retail Generate Client/Server, Very Large
Sale data bills, manage relational databases
stock
Call pelecomm- Billing Legacy application, Very Large
Record unications hierarchical database,
mainframe
roduction Manufact- Control New applications, Medium
Record uring roduction relational databases,
AS/400
 
l %e  s.
'e%e 
Application-Orientation Subject-Orientation

›%  
8 9(

Credit
Loans Customer
Card
Vendor
roduct
prust

Savings Activity
 
#P s. | ese

OOLp systems are tuned for known transactions


and workloads while workload is not known a
priori in a data warehouse
OSpecial data organization, access methods and
implementation methods are needed to support
data warehouse queries (typically
multidimensional queries)
Oe.g., average amount spent on phone calls
between 9AM-5M in California during the
month of December
 
#P s. | ese

OComplex Data Warehouse queries would


degrade performance of operational DBMS
OData Warehouse requires historical data; not
typically maintained by operational databases
ODecision support requires consolidation
(aggregation, summarization) of data from
heterogeneous sources: operational DBMS,
external sources, legacy systems ODifferent
sources typically use different representations,
code and format which have to be reconciled

#P s | ese

OOLp OWarehouse (DSS)


OApplication Oriented OSubject Oriented
Oàsed to run business Oàsed to analyze business
ODetailed data OSummarized and refined
OCurrent up to date OSnapshot data
OIsolated Data OIntegrated Data
ORepetitive access OAd-hoc access
OClerical àser OKnowledge àser
(Manager)

 
#P s | ese

OOLp OData Warehouse


Oerformance Sensitive Oerformance relaxed
OFew Records accessed OLarge volumes
at a time (tens) accessed at a
time(millions)
ORead/àpdate Access OMostly Read (Batch
àpdate)
ONo data redundancy ORedundancy present
ODatabase Size ODatabase Size
100MB -100 GB 100 GB - few terabytes



#P s | ese

OOLp OData Warehouse


Opransaction OQuery throughput is
throughput is the the performance
performance metric metric
Ophousands of users OHundreds of users
OManaged in entirety OManaged by subsets

 
Ce  ele  
ese

Oload/index time
Oquery response time
Odatabase size
requirements/limitations
Oquality
Oratio of raw data size to full
database size (including indices,
temp space, etc.)
Oparallel capabilities
Oprice
Ocompany DBMS standardization
policy

 
| ese
ee

)  ›%$* :


8
./0
  3

 9(
:30; . 3  & ;*
 7;

50(

 )%;
 
| &  
Cle s 

OExtract data from existing operational and


legacy data
OIssues:
OSources of data for the warehouse
OData quality at the sources
OMerging different data sources
OData pransformation
OHow to propagate updates (on the sources) to the
warehouse
Operabytes of data to be loaded

 
&  
 s " ls

OCarleton Corporation -- assport


OEvolutionary pechnologies Inc. -- Extract
OInformatica -- OpenBridge
OInformation Builders Inc. -- EDA Copy
Manager
Olatinum pechnology -- InfoRefiner
Orism Solutions -- rism Warehouse
Manager
 
  |

OSophisticated transformation
tools.
Oàsed for cleaning the quality
of data
OClean data is vital for the
success of the warehouse
OExample
OSeshadri, Sheshadri, Sesadri,
Seshadri S., Srinivasan
Seshadri, etc. are the same
person
 
  ls

OApertus -- Enterprise/Integrator
OVality -- IE
Oostal Soft

 
e | ese

OHeart of the data warehouse is the data


itself!
OSingle version of the truth
OCorporate memory
OData is organized in a way that represents
business -- subject orientation

 
ese Ps
OComputer Associates -- CA-Ingres
OHewlett-ackard -- Allbase/SQL
OInformix -- Informix, Informix XS
OMicrosoft -- SQL Server
OOracle -- Oracle7, Oracle arallel Server
ORed Brick -- Red Brick Warehouse
OSAS Institute -- SAS
OSoftware AG -- ADABAS
OSybase -- SQL Server, IQ, M
 
ses e  ee  es
 |
pourists: Browse
information harvested
OLA
by farmers

Farmers: Harvest information


from known access paths

Explorers: Seek out the


Organizationally unknown and previously
structured unsuspected rewards hiding in
the detailed data
 

!" e | ese


 | s
Information

Individually Less
Structured

Departmentally History
Structured Normalized
Detailed

Organizationally More
Structured Data Warehouse

Data
 
# P: 3 e |
 9( ›:&5 . 3  0 %%  

8 :; &%%0 :30 :; 5  :;

Store atomic Generate SQL Obtain multi-


data in industry execution plans in dimensional
standard Data the OLA engine to reports from the
Warehouse. obtain OLA DSS Client.
functionality.
 
e s  # P

OIt is a powerful visualization


tool
OIt provides fast, interactive
response times
OIt is good for analyzing time
series
OIt can be useful to find
some clusters and outliners
OMany vendors offer OLA
tools
 
e  ls
OAndyne Computing -- GQL
OBrio -- BrioQuery
OBusiness Objects -- Business Objects
OCognos -- Impromptu
OInformation Builders Inc. -- Focus for Windows
OOracle -- Discoverer2000
Olatinum pechnology -- SQL*Assist, roReports
OowerSoft -- InfoMaker
OSAS Institute -- SAS/Assist
OSoftware AG -- Esperant
OSterling Software -- VISION:Data
 
# P   &ee
" se"s

OAndyne Computing -- OOracle -- Express


ablo Oilot -- LightShip
OArbor Software -- Essbase Olanning Sciences --
OCognos -- owerlay Gentium
OComshare -- Commander Olatinum pechnology --
OLA rodeaBeacon, Forest &
OHolistic Systems -- Holos prees
OInformation Advantage -- OSAS Institute -- SAS/EIS,
AXSYS, WebOLA OLA
OInformix -- Metacube OSpeedware -- Media
OMicrostrategies --
DSS/Agent
 
D #(s  $les  
PC |ses

OInformation Builders -- Focus


OLotus -- Approach
OMicrosoft -- Access, Visual Basic
OMIpI -- SQR/Workbench
OowerSoft -- owerBuilder
OSAS Institute -- SAS/AF

 
|    s 
ese |

OData Warehousing provides


the Enterprise with a memory

OData Mining provides the


Enterprise with intelligence

 
e      ...

O Given a database of 100,000 names, which persons are the least


likely to default on their credit cards?
O Which types of transactions are likely to be fraudulent given the
demographics and transactional history of a particular customer? O If
I raise the price of my product by Rs. 2, what is the effect on my ROI?

O If I offer only 2,500 airline miles as an incentive to purchase rather


than 5,000, how many lost responses will result?
O If I emphasize ease-of-use of the product as opposed to its
technical capabilities, what will be the net effect on my revenues? O
Which of my customers are likely to be the most loyal? <
  3<(%</0<0(<

 !$ 
l es

 ; &%%0
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
pelecommunication Call record analysis
pransport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
àtilities ower usage analysis

 
|     se

Ophe àS Government uses Data Mining to track


fraud
OA Supermarket becomes an information broker
OBasketball teams use it to track game strategy
OCross Selling
OWarranty Claims Routing
OHolding on to Good Customers
OWeeding out Bad Customers

 

e e 

OMarketing efforts based


on the targeting most
likely customers
empowers companies to
achieve their goals with
remarkable precision and
substantially lower costs.

 
 "es  "  
ssle

OAdvances in the following areas are


making data mining deployable:
Odata warehousing
Obetter and more data (i.e., operational,
behavioral, and demographic)
Othe emergence of easily deployed data
mining tools and
Othe advent of new data mining techniques.
ƥ -- Gartner Group