Professional Documents
Culture Documents
Chapter 8
Data Warehouses and
Data Mining
Jerry Post
Copyright © 2003
1
D Sequential Storage and Indexes
A We picture tables as ID LastName FirstName DateHired
E
2
D Operations on Sequential Tables
A Read entire table
Easy and fast Row Prob. # Reads
T Sequential retrieval
Easy and fast for one order.
A
B
1/N
1/N
1
2
A Random Read/Sequential
Very weak
C
D
1/N
1/N
3
4
B E 1/N 5
Probability of any row = 1/N
Sequential retrieval … 1/N i
1,000,000 rows means
A 500,000 retrievals per
lookup!
1
EV i
1
i
i N N
S Delete
Easy
i
1 N ( N 1) N 1
EV
E Insert/Modify
Very weak
N 2 2
3
D Insert into Sequential Table
A
ID LastName FirstName DateHired
Insert Inez: 8 Carpenter Carlos 12/29/98
6 Eaton Anissa 8/23/98
Find insert location. 7 Farris Dustin 3/28/98
T
Copy top to new file.
At insert location, add row.
2
4
5
Gibson
Hopkins
James
Bill
Alan
Leisha
3/31/98
2/8/98
1/6/98
B 8
6
7
Carpenter
Eaton
Farris
Carlos
Anissa
Dustin
12/29/98
8/23/98
3/28/98
A 2
11
Gibson
Inez
Bill
Maria
3/31/98
1/15/99
S 5
9
3
James
O'Connor
Reasoner
Leisha
Jessica
Katy
1/6/98
7/23/98
2/17/98
E
1 Reeves Keith 1/29/98
10 Shields Howard 7/13/98
4
D Binary Search
A Given a sorted list of names.
How do you find Jones. Adams
Brown
T Sequential search
Jones = 10 lookups
Cadiz
Dorfmann
A
Average = 15/2 = 7.5 lookups Eaton
Min = 1, Max = 14 Farris
1 Goetz
Binary search
B
Find midpoint (14 / 2) = 7
Jones > Goetz
3
Hanson
Inez
4 Jones
A
Jones < Kalida
Jones > Inez
2 Kalida
Lomax
E N = 1000
N = 1,000,000
Max = 10
Max = 20
5
D Indexed Sequential Storage
A Common uses
Large tables.
Address
A11
ID
1
LastName
Reeves
FirstName DateHired
Keith 1/29/98
B ID
1
2
Pointer
A11
A22
LastName Pointer
Carpenter A67
Eaton A58
A78
A83
9
10
O'Connor
Shields
Jessica
Howard
7/23/98
7/13/98
A 3
4
5
A32
A42
A47
Farris
Gibson
A63
A22
Hopkins A42
S 6
7
8
A58
A63
A67
James A47
O'Connor A78
Reasoner A32
Indexed for ID and LastName
E 9
10
A78
A83
Reeves
Shields
A11
A83
7
D Index Options: Bitmaps and Statistics
A
Bitmap index
T A compressed index designed for non-primary key columns.
Bit-wise operations can be used to quickly match WHERE
A criteria.
Analyze statistics
B By collecting statistics about the actual data within the index,
the DBMS can optimize the search path. For example, if it
A knows that only a few rows match one of your search
conditions in a table, it can apply that condition first,
E
10
D Problems with Indexes
A
Each index must be updated when rows are inserted,
T deleted or modified.
Changing one row of data in a table with many
A indexes can result in considerable time and
resources to update all of the indexes.
B Steps to improve performance
E
11
D Data Warehouse
A
Predefined
T reports
Interactive
data analysis
A Operations
data
B Daily data
A OLTP Database
transfer
S 3NF tables
Data warehouse
Star configuration
E Flat files
12
D Data Warehouse Goals
A Existing databases optimized for Online Transaction
T Processing (OLTP)
Online Analytical Processing (OLAP) requires fast retrievals,
A Data analysis
Ad hoc queries
S
Statistical analysis
Data mining (specialized automated tools)
E
13
D Extraction, Transformation, and Transportation
(ETT)
A Customers
T Convert Client
to Customer
A Apply standard
B product numbers
Convert
A currencies
14
D OLTP v. OLAP
A
T
A
B
A
S
E
15
D Multidimensional Cube
A Pet Store
Item Sales
ry
T Ca
t eg
o Amount = Quantity*Sale Price
A
B Customer
A Location
S
E Time
Sale Date
16
D Sales Date: Time Hierarchy
A Year
T Levels Roll-up
A
Quarter To get higher-level totals
B Month
A Drill-down
To get lower-level details
Week
S
E Day
17
D Star Design
A Dimension Tables
A Fact Table
Sales
B Quantity
Amount=SalePrice*Quantity
A Customer
S Location
E
18
D Snowflake Design City
CityID
A Merchandise Sale
ZipCode
City
T ItemID
Description
SaleID
SaleDate
State
QuantityOnHand
A ListPrice
Category
EmployeeID
CustomerID
SalesTax
Customer
B
CustomerID
Phone
FirstName
A OLAPItems
SaleID
LastName
Address
ZipCode
S ItemID
Quantity
SalePrice
CityID
19
D OLAP Computation Issues
A
T
A
B Compute Quantity*Price in base query, then add to get $23.00
A If you use Calculated Measure in the Cube, it will add first and
multiply second to get $45.00, which is wrong.
S
E
20
D OLAP Data Browsing
A
T
A
B
A
S
E
21
D Microsoft Pivot Table
A
T
A
B
A
S
E
22
D OLAP in SQL 99
Category Month Amount
Bird 1 $135.00
A
Bird 2 $45.00
Bird 3 $202.50
GROUP BY two columns
Bird 6 $67.50
T Gives you totals for each
month within each category.
Bird 7 $90.00
Bird 9 $67.50
E ON Sale.SaleID = SaleItem.SaleID
GROUP BY Category, Month(SaleDate);
23
D SQL ROLLUP
A SELECT Category, Month…, Sum …
FROM …
T GROUP BY ROLLUP (Category, Month...)
A Category
Bird
Month
1
Amount
135.00
B Bird
…
2 45.00
E …
(null) (null) 8451.79
24
D Missing Values Cause Problems
A If there are missing values in the groups, it can be difficult
to identify the super-aggregate rows.
B Bird
Bird
(null)
(null)
32.00
607.50
Missing date
Super-aggregate
Cat 1 396.00
A Cat
…
2 113.85
S Cat
…
(null) 1293.30
25
D GROUPING Function
SELECT Category, Month…, Sum …,
26
D CUBE Option
SELECT Category, Month, Sum, GROUPING (Category) AS Gc,
A
GROUPING (Month) AS Gm
FROM …
GROUP BY CUBE (Category, Month...)
T Category Month
Bird 1
Amount
135.00
Gc
0
Gm
0
A Bird
…
2 45.00 0 0
B
Bird (null) 32.00 0 0
Bird (null) 607.50 1 0
Cat 1 396.00 0 0
A Cat
…
2 113.85 0 0
S Cat
(null)
(null)
1
1293.30
1358.8
1
0
0
1
(null) 2 1508.94 0 1
E (null)
…
3 2362.68 0 1
A )
()
28
D SQL OLAP Analytical Functions
A VAR_POP
VAR_SAMP
variance
T STDDEV_POP
STDEV_SAMP
standard deviation
COVAR_POP covariance
A COVAR_SAMP
CORR correlation
B
REGR_R2 regression r-square
REGR_SLOPE regression data (many)
REGR_INTERCEPT
A
S
E
29
D SQL RANK Functions
A SELECT Employee, SalesValue
RANK() OVER (ORDER BY SalesValue DESC) AS rank
DENSE_RANK() OVER (ORDER BY SalesValue DESC) AS dense
T FROM Sales
ORDER BY SalesValue DESC, Employee;
B
Jones 18,000 1 1
Smith 16,000 2 2
A Black 16,000 2 2
White 14,000 4 3
S DENSE_RANK
does not skip
E numbers
30
D SQL OLAP Windows
SELECT Category, SaleMonth, MonthAmount,
A AVG(MonthAmount)
OVER (PARTITION BY Category
T
ORDER BY SaleMonth ASC ROWS 2 PRECEDING)
AS MA
FROM qryOLAPSQL99
A
Bird 200103 2000.00 1600.00
Bird 200104 2500.00 1850.00
…
S Cat
Cat
200101
200102
4000.00
5000.00
Cat 200103 6000.00 4500.00
E Cat
…
200104 7000.00 5500.00
31
D Ranges: OVER
A SELECT SaleDate, Value
SUM(Value) OVER (ORDER BY SaleDate) AS running_sum,
T SUM(Value) OVER (ORDER BY SaleDate RANGE
BETWEEN UNBOUNDED PRECEDING
A
AND CURRENT ROW) AS running_sum2,
SUM (Value) OVER (ORDER BY SaleDate RANGE
BETWEEN CURRENT ROW
B FROM …
AND UNBOUNDED FOLLOWING) AS remaining_sum;
32
D LAG and LEAD Functions
A LAG or LEAD: (Column, # rows, default)
A 1/1/2003
1/2/2003
1/3/2003
1000
1500
2000
0
1000
1500
1500
2000
2300
S …
1/31/2003 3500 3200 0
E Not part of standard yet? But are in SQL Server and Oracle.
33
D Data Mining
A
Goal: To discover unknown relationships in the data
T that can be used to make better decisions.
E
Unknown relationships Data Mining
34
D Exploratory Analysis
A
Data Mining usually works autonomously.
T Supervised/directed
Unsupervised
A Often called a bottom-up approach that scans the data to
find relationships
B Some statistical routines, but they are not sufficient
Statistics relies on averages
A Sometimes the important data lies in more detailed pairs
S
E
35
D Common Techniques
A Classification/Prediction/Regression
T Association Rules/Market Basket Analysis
Clustering
A Data points
Hierarchies
B Neural Networks
Deviation Detection
A Sequential Analysis
Time series events
S Websites
Textual Analysis
E Spatial/Geographic Analysis
36
D Classification Examples
A Examples
T Which borrowers/loans are most likely to be successful?
Which customers are most likely to want a new item?
A Which companies are likely to file bankruptcy?
Which workers are likely to quit in the next six months?
A
S
E
37
D Classification Process
A Clearly identify the outcome/dependent variable.
Identify potential variables that might affect the outcome.
T Supervised (modeler chooses)
Unsupervised (system scans all/most)
B outcome.
A
Income Married Credit History Job Stability Success
S 50000 Yes Good Good Yes
25000 Yes Bad Bad No
E 75000 No Good Good No
38
D Classification Techniques
A
Regression
T Bayesian Networks
A Complications
S
Some methods require categorical data
Data size is still a problem
E
39
D Association/Market Basket
A Examples
T What items are customers likely to buy together?
What Web pages are closely related?
A Others?
Classic (early) example:
S
E
40
D Association Details (two items)
A Rule evaluation (A implies B)
T Support for the rule is measured by the percentage of all
transactions containing both items: P(A ∩ B)
41
D Association Challenges
A If an item is rarely purchased, any other item bought with it
seems important. So combine items into categories.
T Item
1 “ nails
Freq.
2%
Item
Hardware
Freq.
15%
A 2” nails
3” nails
1%
1%
Dim. Lumber
Plywood
20%
15%
B 4” nails
Lumber
2%
50%
Finish lumber 15%
E Hardware store found that toilet rings sell well only when a new
store first opens. But what does it mean?
42
D Cluster Analysis
A Examples
Are there groups of customers? (If so, we can cross-sell.)
Do the locations for our stores have elements in common? (So we
T can search for similar clusters for new locations.)
Do our employees (by department?) have common characteristics?
A
(So we can hire similar, or dissimilar, people.)
Problem: Many dimensions and large datasets
B Large
intercluster
A distance
S Small
intracluster
distance
E
43
D Geographic/Location
A Examples
Customer location and sales comparisons
T Factory sites and cost
Environmental effects
A Challenge: Map data, multiple overlays
B
A
S
E
44