You are on page 1of 41

Database Management Systems

Chapter 8
Data Warehouses and
Data Mining

Jerry Post
Copyright © 2003
1
D Sequential Storage and Indexes
A  We picture tables as ID LastName FirstName DateHired

T simple rows and


columns, but they cannot
1
2
Reeves
Gibson
Keith
Bill
1/29/98
3/31/98

A be stored this way.


 It takes too many
3
4
Reasoner
Hopkins
Katy
Alan
2/17/98
2/8/98
operations to find an
B item.
 Insertions require
5
6
James
Eaton
Leisha
Anissa
1/6/98
8/23/98

A reading and rewriting 7 Farris Dustin 3/28/98


the entire table. 8 Carpenter Carlos 12/29/98
9 O'Connor Jessica 7/23/98
S 10 Shields Howard 7/13/98

E
2
D Operations on Sequential Tables
A  Read entire table
 Easy and fast Row Prob. # Reads

T  Sequential retrieval
 Easy and fast for one order.
A
B
1/N
1/N
1
2

A  Random Read/Sequential
 Very weak
C
D
1/N
1/N
3
4

B E 1/N 5
 Probability of any row = 1/N
 Sequential retrieval … 1/N i
 1,000,000 rows means
A 500,000 retrievals per
lookup!
1
EV   i 
1
i
i N N
S  Delete
 Easy
i

1 N ( N  1) N  1
EV  
E  Insert/Modify
 Very weak
N 2 2

3
D Insert into Sequential Table
A
ID LastName FirstName DateHired
 Insert Inez: 8 Carpenter Carlos 12/29/98
6 Eaton Anissa 8/23/98
 Find insert location. 7 Farris Dustin 3/28/98

T 

Copy top to new file.
At insert location, add row.
2
4
5
Gibson
Hopkins
James
Bill
Alan
Leisha
3/31/98
2/8/98
1/6/98

A  Copy rest of file. 9


3
1
O'Connor
Reasoner
Reeves
Jessica
Katy
Keith
7/23/98
2/17/98
1/29/98
ID LastName FirstName DateHired 10 Shields Howard 7/13/98

B 8
6
7
Carpenter
Eaton
Farris
Carlos
Anissa
Dustin
12/29/98
8/23/98
3/28/98

A 2

11
Gibson

Inez
Bill

Maria
3/31/98

1/15/99

S 5
9
3
James
O'Connor
Reasoner
Leisha
Jessica
Katy
1/6/98
7/23/98
2/17/98

E
1 Reeves Keith 1/29/98
10 Shields Howard 7/13/98

4
D Binary Search
A  Given a sorted list of names.
 How do you find Jones. Adams
Brown
T  Sequential search
 Jones = 10 lookups
Cadiz
Dorfmann

A
 Average = 15/2 = 7.5 lookups Eaton
 Min = 1, Max = 14 Farris
1 Goetz
 Binary search
B 

Find midpoint (14 / 2) = 7
Jones > Goetz
3
Hanson
Inez
4 Jones
A 

Jones < Kalida
Jones > Inez
2 Kalida
Lomax

S  Jones = Jones (4 lookups) Miranda


Norman
 Max = log2 (N) 14 entries

E  N = 1000
 N = 1,000,000
Max = 10
Max = 20

5
D Indexed Sequential Storage
A  Common uses
 Large tables.
Address

A11
ID
1
LastName
Reeves
FirstName DateHired
Keith 1/29/98

T  Need many sequential lists.


 Some random search--with
A22
A32
A42
2
3
4
Gibson
Reasoner
Hopkins
Bill
Katy
Alan
3/31/98
2/17/98
2/8/98
one or two key columns.
A
A47 5 James Leisha 1/6/98
A58 6 Eaton Anissa 8/23/98
 Mostly replaced by B+-Tree.
A63 7 Farris Dustin 3/28/98
A67 8 Carpenter Carlos 12/29/98

B ID
1
2
Pointer
A11
A22
LastName Pointer
Carpenter A67
Eaton A58
A78
A83
9
10
O'Connor
Shields
Jessica
Howard
7/23/98
7/13/98

A 3
4
5
A32
A42
A47
Farris
Gibson
A63
A22
Hopkins A42

S 6
7
8
A58
A63
A67
James A47
O'Connor A78
Reasoner A32
Indexed for ID and LastName

E 9
10
A78
A83
Reeves
Shields
A11
A83

7
D Index Options: Bitmaps and Statistics
A
 Bitmap index
T  A compressed index designed for non-primary key columns.
Bit-wise operations can be used to quickly match WHERE
A criteria.
 Analyze statistics
B  By collecting statistics about the actual data within the index,
the DBMS can optimize the search path. For example, if it
A knows that only a few rows match one of your search
conditions in a table, it can apply that condition first,

S reducing the amount of work needed to join tables.

E
10
D Problems with Indexes
A
 Each index must be updated when rows are inserted,
T deleted or modified.
 Changing one row of data in a table with many
A indexes can result in considerable time and
resources to update all of the indexes.
B  Steps to improve performance

A  Index primary keys


 Index common join columns (usually primary keys)

S  Index columns that are searched regularly


 Use a performance analyzer

E
11
D Data Warehouse
A
Predefined
T reports
Interactive
data analysis
A Operations
data
B Daily data
A OLTP Database
transfer

S 3NF tables
Data warehouse
Star configuration

E Flat files
12
D Data Warehouse Goals
A  Existing databases optimized for Online Transaction
T Processing (OLTP)
 Online Analytical Processing (OLAP) requires fast retrievals,

A and only bulk writes.


 Different goals require different storage, so build separate dta

B warehouse to use for queries.


 Extraction, Transformation, Transportation (ETT)

A  Data analysis
 Ad hoc queries

S
 Statistical analysis
 Data mining (specialized automated tools)

E
13
D Extraction, Transformation, and Transportation
(ETT)
A Customers

T Convert Client
to Customer
A Apply standard

B product numbers

Convert
A currencies

S Fix region codes


Data warehouse:
All data must be
Transaction data
E from diverse
systems.
consistent.

14
D OLTP v. OLAP
A
T
A
B
A
S
E
15
D Multidimensional Cube
A Pet Store
Item Sales
ry
T Ca
t eg
o Amount = Quantity*Sale Price

A
B Customer

A Location

S
E Time
Sale Date

16
D Sales Date: Time Hierarchy
A Year

T Levels Roll-up

A
Quarter To get higher-level totals

B Month

A Drill-down
To get lower-level details
Week
S
E Day

17
D Star Design
A Dimension Tables

T Products Sales Date

A Fact Table

Sales
B Quantity
Amount=SalePrice*Quantity

A Customer

S Location

E
18
D Snowflake Design City
CityID
A Merchandise Sale
ZipCode
City

T ItemID
Description
SaleID
SaleDate
State

QuantityOnHand
A ListPrice
Category
EmployeeID
CustomerID
SalesTax
Customer

B
CustomerID
Phone
FirstName

A OLAPItems
SaleID
LastName
Address
ZipCode
S ItemID
Quantity
SalePrice
CityID

E Amount Dimension tables can join to


other dimension tables.

19
D OLAP Computation Issues
A
T
A
B Compute Quantity*Price in base query, then add to get $23.00

A If you use Calculated Measure in the Cube, it will add first and
multiply second to get $45.00, which is wrong.

S
E
20
D OLAP Data Browsing
A
T
A
B
A
S
E
21
D Microsoft Pivot Table
A
T
A
B
A
S
E
22
D OLAP in SQL 99
Category Month Amount
Bird 1 $135.00

A
Bird 2 $45.00
Bird 3 $202.50
GROUP BY two columns
Bird 6 $67.50
T Gives you totals for each
month within each category.
Bird 7 $90.00
Bird 9 $67.50

A You do not get super-


aggregate totals for the
Cat
Cat
1
2
$396.00
$113.85

B category, or the month, or the


overall total.
Cat
Cat
3
4
$443.70
$2.25

A SELECT Category, Month(SaleDate) AS Month,


Sum(Quantity*SalePrice) AS Amount
S FROM Sale INNER JOIN (Merchandise INNER JOIN SaleItem
ON Merchandise.ItemID = SaleItem.ItemID)

E ON Sale.SaleID = SaleItem.SaleID
GROUP BY Category, Month(SaleDate);

23
D SQL ROLLUP
A SELECT Category, Month…, Sum …
FROM …
T GROUP BY ROLLUP (Category, Month...)

A Category
Bird
Month
1
Amount
135.00

B Bird

2 45.00

Bird (null) 607.50


A Cat
Cat
1
2
396.00
113.85
S …
Cat (null) 1293.30

E …
(null) (null) 8451.79

24
D Missing Values Cause Problems
A If there are missing values in the groups, it can be difficult
to identify the super-aggregate rows.

T Category Month Amount


Bird 1 135.00
A Bird

2 45.00

B Bird
Bird
(null)
(null)
32.00
607.50
Missing date

Super-aggregate
Cat 1 396.00
A Cat

2 113.85

S Cat

(null) 1293.30

E (null) (null) 8451.79

25
D GROUPING Function
SELECT Category, Month…, Sum …,

A GROUPING (Category) AS Gc,


GROUPING (Month) AS Gm
FROM …
T GROUP BY ROLLUP (Category, Month...)
Category Month Amount Gc Gm
A Bird 1 135.00 0 0
Bird 2 45.00 0 0
B …
Bird (null) 32.00 0 0
A Bird
Cat
(null)
1
607.50
396.00
1
0
0
0
Cat 2 113.85 0 0
S …
Cat (null) 1293.30 1 0
E …
(null) (null) 8451.79 1 1

26
D CUBE Option
SELECT Category, Month, Sum, GROUPING (Category) AS Gc,

A
GROUPING (Month) AS Gm
FROM …
GROUP BY CUBE (Category, Month...)

T Category Month
Bird 1
Amount
135.00
Gc
0
Gm
0
A Bird

2 45.00 0 0

B
Bird (null) 32.00 0 0
Bird (null) 607.50 1 0
Cat 1 396.00 0 0
A Cat

2 113.85 0 0

S Cat
(null)
(null)
1
1293.30
1358.8
1
0
0
1
(null) 2 1508.94 0 1
E (null)

3 2362.68 0 1

(null) (null) 8451.79 1 1


27
D GROUPING SETS: Hiding Details
A SELECT Category, Month, Sum
FROM …
GROUP BY GROUPING SETS
T ( ROLLUP (Category),
ROLLUP (Month),

A )
()

Category Month Amount


B Bird (null) 607.50
Cat (null) 1293.30
A …
(null) 1 1358.8
S (null)
(null)
2
3
1508.94
2362.68

E (null) (null) 8451.79

28
D SQL OLAP Analytical Functions
A VAR_POP
VAR_SAMP
variance

T STDDEV_POP
STDEV_SAMP
standard deviation

COVAR_POP covariance
A COVAR_SAMP
CORR correlation

B
REGR_R2 regression r-square
REGR_SLOPE regression data (many)
REGR_INTERCEPT

A
S
E
29
D SQL RANK Functions
A SELECT Employee, SalesValue
RANK() OVER (ORDER BY SalesValue DESC) AS rank
DENSE_RANK() OVER (ORDER BY SalesValue DESC) AS dense
T FROM Sales
ORDER BY SalesValue DESC, Employee;

A Employee SalesValue rank dense

B
Jones 18,000 1 1
Smith 16,000 2 2

A Black 16,000 2 2
White 14,000 4 3
S DENSE_RANK
does not skip

E numbers

30
D SQL OLAP Windows
SELECT Category, SaleMonth, MonthAmount,
A AVG(MonthAmount)
OVER (PARTITION BY Category

T
ORDER BY SaleMonth ASC ROWS 2 PRECEDING)
AS MA
FROM qryOLAPSQL99

A ORDER BY SaleMonth ASC;

Category SaleMonth MonthAmount MA


B Bird
Bird
200101
200102
1500.00
1700.00

A
Bird 200103 2000.00 1600.00
Bird 200104 2500.00 1850.00

S Cat
Cat
200101
200102
4000.00
5000.00
Cat 200103 6000.00 4500.00
E Cat

200104 7000.00 5500.00

31
D Ranges: OVER
A SELECT SaleDate, Value
SUM(Value) OVER (ORDER BY SaleDate) AS running_sum,
T SUM(Value) OVER (ORDER BY SaleDate RANGE
BETWEEN UNBOUNDED PRECEDING

A
AND CURRENT ROW) AS running_sum2,
SUM (Value) OVER (ORDER BY SaleDate RANGE
BETWEEN CURRENT ROW

B FROM …
AND UNBOUNDED FOLLOWING) AS remaining_sum;

A Sum1 computes total from beginning through current row.


S Sum2 does the same thing, but more explicitly lists the rows.

E Sum3 computes total from current row through end of query.

32
D LAG and LEAD Functions
A LAG or LEAD: (Column, # rows, default)

T SELECT SaleDate, Value,


LAG (Value 1,0) OVER (ORDER BY SaleDate) AS prior_day

A LEAD (Value 1, 0) OVER (ORDER BY SaleDate) AS next_day


FROM …
ORDER BY SaleDate
B SaleDate Value prior_day next_day
Prior is 0 from
default value

A 1/1/2003
1/2/2003
1/3/2003
1000
1500
2000
0
1000
1500
1500
2000
2300
S …
1/31/2003 3500 3200 0

E Not part of standard yet? But are in SQL Server and Oracle.

33
D Data Mining
A
 Goal: To discover unknown relationships in the data
T that can be used to make better decisions.

A Transactions and operations Reports


B Specific ad hoc questions Queries
A
Aggregate, compare, drill down OLAP
S Databases

E
Unknown relationships Data Mining

34
D Exploratory Analysis
A
 Data Mining usually works autonomously.
T  Supervised/directed
 Unsupervised
A  Often called a bottom-up approach that scans the data to
find relationships
B  Some statistical routines, but they are not sufficient
 Statistics relies on averages
A  Sometimes the important data lies in more detailed pairs

S
E
35
D Common Techniques
A  Classification/Prediction/Regression
T  Association Rules/Market Basket Analysis
 Clustering
A  Data points
 Hierarchies
B  Neural Networks
 Deviation Detection
A  Sequential Analysis
 Time series events
S  Websites
 Textual Analysis
E  Spatial/Geographic Analysis

36
D Classification Examples
A  Examples
T  Which borrowers/loans are most likely to be successful?
 Which customers are most likely to want a new item?
A  Which companies are likely to file bankruptcy?
 Which workers are likely to quit in the next six months?

B  Which startup companies are likely to succeed?


 Which tax returns are fraudulent?

A
S
E
37
D Classification Process
A  Clearly identify the outcome/dependent variable.
 Identify potential variables that might affect the outcome.
T  Supervised (modeler chooses)
 Unsupervised (system scans all/most)

A  Use sample data to test and validate the model.


 System creates weights that link independent variables to

B outcome.

A
Income Married Credit History Job Stability Success
S 50000 Yes Good Good Yes
25000 Yes Bad Bad No
E 75000 No Good Good No

38
D Classification Techniques
A
 Regression
T  Bayesian Networks

A  Decision Trees (hierarchical)


 Neural Networks
B  Genetic Algorithms

A  Complications

S
 Some methods require categorical data
 Data size is still a problem

E
39
D Association/Market Basket
A  Examples
T  What items are customers likely to buy together?
 What Web pages are closely related?

A  Others?
 Classic (early) example:

B  Analysis of convenience store data showed customers often buy


diapers and beer together.
 Importance: Consider putting the two together to increase cross-
A selling.

S
E
40
D Association Details (two items)
A  Rule evaluation (A implies B)
T  Support for the rule is measured by the percentage of all
transactions containing both items: P(A ∩ B)

A  Confidence of the rule is measured by the transactions with A that


also contain B: P(B | A)
 Lift is the potential gain attributed to the rule—the effect compared
B to other baskets without the effect. If it is greater than 1, the effect
is positive:

A  P(A ∩ B) / ( P(A) P(B) )


 P(B|A)/P(B)
 Example: Diapers implies Beer
S  Support: P(D ∩ B) = .6 P(D) = .7 P(B) = .5
 Confidence: P(B|D) = .857 = P(D ∩ B)/P(D) = .6/.7
E  Lift: P(B|D) / P(B) = 1.714 = .857 / .5

41
D Association Challenges
A  If an item is rarely purchased, any other item bought with it
seems important. So combine items into categories.

T Item
1 “ nails
Freq.
2%
Item
Hardware
Freq.
15%
A 2” nails
3” nails
1%
1%
Dim. Lumber
Plywood
20%
15%
B 4” nails
Lumber
2%
50%
Finish lumber 15%

A  Some relationships are obvious.

S  Burger and fries.


 Some relationships are meaningless.

E  Hardware store found that toilet rings sell well only when a new
store first opens. But what does it mean?

42
D Cluster Analysis
A  Examples
 Are there groups of customers? (If so, we can cross-sell.)
 Do the locations for our stores have elements in common? (So we
T can search for similar clusters for new locations.)
 Do our employees (by department?) have common characteristics?

A
(So we can hire similar, or dissimilar, people.)
 Problem: Many dimensions and large datasets

B Large
intercluster

A distance

S Small
intracluster
distance
E
43
D Geographic/Location
A  Examples
 Customer location and sales comparisons
T  Factory sites and cost
 Environmental effects
A  Challenge: Map data, multiple overlays

B
A
S
E
44

You might also like