You are on page 1of 27

DATA WAREHOUSING

AND
DATA MINING

February 22, 2009

ICS426: Introduction

Course Overview

Introduction

Data Preporcessing

DW and OLAP

Data Mining

February 22, 2009

ICS426: Introduction

Motivation

Data flood

Automated data collection tools and mature database technology


lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories
There is a tremendous increase in the amount of data recorded
and stored on digital media

February 22, 2009

We are producing over two exabites (10^18) of data per year


Storage capacity, for a fixed price, appears to be doubling
approximately every 9 months
Data stored in worlds databases doubles every 20 months
Other growth rate estimates even higher
ICS426: Introduction

Data, Data everywhere - yet ...


I cant find the data I need

data is scattered over the network


many versions, subtle differences

I cant get the data I need

I cant understand the data I found

available data poorly documented

I cant use the data I found

February 22, 2009

need an expert to get the data

results are unexpected


data needs to be transformed from one
form to other

ICS426: Introduction

Motivation

Very little data will ever be looked at by a human. We are drowning


in data, but starving for knowledge! The greatest problem of today
is how to teach people to ignore the irrelevant, how to refuse to
know things, before they are suffocated. For too many facts are as
bad as none at all.

Knowledge Discovery is NEEDED to make sense and use of data.

Solution: Data warehousing and data mining

Data warehousing and On-Line Analytical Processing (OLAP)


Extraction of interesting knowledge (rules, regularities, patterns,
constraints) from data in large databases

February 22, 2009

ICS426: Introduction

Knowledge Discovery (KDD) Process

February 22, 2009

ICS426: Introduction

KDD Process: Several Key Steps

Learning the application domain

relevant prior knowledge and goals of application

Creating a target data set: data selection

Data cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation

Data mining

summarization, classification, regression, association, clustering

Pattern evaluation and knowledge presentation

Use of discovered knowledge

February 22, 2009

ICS426: Introduction

What is a Data Warehouse?


A single, complete and consistent
store of data obtained from a variety
of different sources made available to
end users in a what they can
understand and use in a business
context.
[Barry Devlin]

February 22, 2009

ICS426: Introduction

What are the users saying...

Data should be integrated across the


enterprise
Summary data has a real value to the
organization
Historical data holds the key to
understanding data over time
What-if capabilities are required

February 22, 2009

ICS426: Introduction

What is Data Warehousing?

Information

A process of transforming data into


information and making it available
to users in a timely enough manner
to make a difference
[Forrester Research, April 1996]

Data
February 22, 2009

ICS426: Introduction

10

Evolution

60s: Batch reports


hard to find and analyze information
inflexible and expensive, reprogram every new request
70s: Terminal-based DSS and EIS (executive information systems)
still inflexible, not integrated with desktop tools
80s: Desktop data access and analysis tools
query tools, spreadsheets, GUIs
easier to use, but only access operational databases
90s: Data warehousing with integrated OLAP engines and tools
2000s:
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information
systems

February 22, 2009

ICS426: Introduction

11

Very Large Data Bases

Terabytes -- 10^12 bytes:

Walmart -- 24 Terabytes

Petabytes -- 10^15 bytes:

Geographic Information Systems

Exabytes -- 10^18 bytes:

National Medical Records

Zettabytes -- 10^21 bytes:

Weather images

Zottabytes -- 10^24 bytes:

Intelligence Agency Videos

February 22, 2009

ICS426: Introduction

12

Data Warehousing -- It is a process

February 22, 2009

Technique for assembling and managing


data from various sources for the purpose
of answering business questions. Thus
making decisions that were not previous
possible
A decision support database maintained
separately from the organizations
operational database

ICS426: Introduction

13

Data Warehouse

A data warehouse is a

subject-oriented

integrated

time-varying

non-volatile

collection of data that is used primarily in organizational decision


making.
-- Bill Inmon, Building the Data Warehouse 1996

February 22, 2009

ICS426: Introduction

14

Data Warehouse Architecture


Relational
Databases
Optimized Loader
ERP
Systems

Extraction
Cleansing
Data Warehouse
Engine

Purchased
Data

Legacy
Data
February 22, 2009

Analyze
Query

Metadata Repository
ICS426: Introduction

15

Data Warehouse for Decision Support & OLAP

Putting Information technology to help the knowledge worker


make faster and better decisions

February 22, 2009

Which of my customers are most likely to go to the


competition?
What product promotions have the biggest impact on revenue?
How did the share price of software companies correlate with
profits over last 10 years?

ICS426: Introduction

16

Decision Support

Used to manage and control business

Data is historical or point-in-time

Optimized for inquiry rather than update

Use of the system is loosely defined and can be ad-hoc

Used by managers and end-users to understand the business and


make judgements

February 22, 2009

ICS426: Introduction

17

Data Mining works with Warehouse Data

Data Warehousing provides the Enterprise


with a memory

Data Mining provides the Enterprise


with intelligence

February 22, 2009

ICS426: Introduction

18

Why Data Mining

Credit ratings/targeted marketing:

Identify likely responders to sales promotions

Fraud detection

Given a database of 100,000 names, which persons are the


least likely to default on their credit cards?

Which types of transactions are likely to be fraudulent, given


the demographics and transactional history of a particular
customer?

Customer relationship management:

Which of my customers are likely to be the most loyal, and


which are most likely to leave for a competitor? :
Data Mining helps extract such information

February 22, 2009

ICS426: Introduction

19

Why DM: A producer wants to know.


Which
Whichare
areour
our
lowest/highest
margin
lowest/highest margin
customers
customers??
What
Whatisisthe
themost
most
effective
effectivedistribution
distribution
channel?
channel?

What
Whatproduct
productpromprom-otions
-otionshave
havethe
thebiggest
biggest
impact
impacton
onrevenue?
revenue?

February 22, 2009

Who
Whoare
aremy
mycustomers
customers
and
what
products
and what products
are
arethey
theybuying?
buying?

Which
Whichcustomers
customers
are
most
are mostlikely
likelyto
togo
go
to
tothe
thecompetition
competition??
What
Whatimpact
impactwill
will
new
newproducts/services
products/services
have
haveon
onrevenue
revenue
and
andmargins?
margins?
ICS426: Introduction

20

What is Data Mining?

Data mining: a misnomer?

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge


extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc

Many Definitions

Non-trivial extraction of implicit, previously unknown and


potentially useful information from huge amount of data
Exploration & analysis, by automatic or semi-automatic means,
of large quantities of data in order to discover meaningful
patterns

February 22, 2009

ICS426: Introduction

21

Data Mining: Confluence of Multiple Disciplines

20x20 ~ 2^400 10^120 patterns

February 22, 2009

ICS426: Introduction

22

Some basic operations

Predictive:

Regression

Classification

Collaborative Filtering

Descriptive:

Clustering / similarity matching

Association rules and variants

Deviation detection

February 22, 2009

ICS426: Introduction

23

Applications

Banking: loan/credit card approval

Customer relationship management:

identify likely responders to promotions

Fraud detection: telecommunications, financial transactions

identify those who are likely to leave for a competitor.

Targeted marketing:

predict good customers based on old customers

from an online stream of event identify fraudulent events

Manufacturing and production:

automatically adjust knobs when process parameter changes

February 22, 2009

ICS426: Introduction

24

Applications

Medicine: disease outcome, effectiveness of treatments

analyze patient disease history: find relationship between


diseases

Molecular/Pharmaceutical: identify new drugs

Scientific data analysis:

identify new galaxies by searching for sub clusters

Web site/store design and promotion:

February 22, 2009

find affinity of visitor to pages and modify layout

ICS426: Introduction

25

The course
(4)

DS
DS

OLAP
(2)

(3)

DP

DW

DM

Association

DS

(5)

Classification (6)
Clustering

(7)

DS = Data source
DW = Data warehouse
DM = Data Mining
DP = Data processing
February 22, 2009

ICS426: Introduction

26

END

February 22, 2009

ICS426: Introduction

27

You might also like