Database Management Systems

Database Management Systems
Chapter 8
Data Warehouses and
Data Mining
Jerry Post
Copyright © 2003
1
D Sequential Storage and Indexes
A  We picture tables as ID LastName FirstName DateHired
T simple rows and

columns, but they cannot
1
2
Reeves
Gibson
Keith
Bill
1/29/98
3/31/98
A be stored this way.

 It takes too many
3
4
Reasoner
Hopkins
Katy
Alan
2/17/98
2/8/98
operations to find an
B item.
 Insertions require
5
6
James
Eaton
Leisha
Anissa
1/6/98
8/23/98
A reading and rewriting 7 Farris Dustin 3/28/98

the entire table. 8 Carpenter Carlos 12/29/98
9 O'Connor Jessica 7/23/98
S 10 Shields Howard 7/13/98
E
2
D Operations on Sequential Tables
A  Read entire table
 Easy and fast Row Prob. # Reads
T  Sequential retrieval
 Easy and fast for one order.
A
B
1/N
1/N
1
2
A  Random Read/Sequential
 Very weak
C
D
1/N
1/N
3
4
B E 1/N 5
 Probability of any row = 1/N
 Sequential retrieval … 1/N i
 1,000,000 rows means
A 500,000 retrievals per
lookup!
1
EV   i 
1
i
i N N
S  Delete
 Easy
i
1 N ( N  1) N  1
EV  
E  Insert/Modify
 Very weak
N 2 2
3
D Insert into Sequential Table
A
ID LastName FirstName DateHired
 Insert Inez: 8 Carpenter Carlos 12/29/98
6 Eaton Anissa 8/23/98
 Find insert location. 7 Farris Dustin 3/28/98
T 

Copy top to new file.
At insert location, add row.
2
4
5
Gibson
Hopkins
James
Bill
Alan
Leisha
3/31/98
2/8/98
1/6/98
A  Copy rest of file. 9

3
1
O'Connor
Reasoner
Reeves
Jessica
Katy
Keith
7/23/98
2/17/98
1/29/98
ID LastName FirstName DateHired 10 Shields Howard 7/13/98
B 8
6
7
Carpenter
Eaton
Farris
Carlos
Anissa
Dustin
12/29/98
8/23/98
3/28/98
A 2
11
Gibson
Inez
Bill
Maria
3/31/98
1/15/99
S 5
9
3
James
O'Connor
Reasoner
Leisha
Jessica
Katy
1/6/98
7/23/98
2/17/98
E
1 Reeves Keith 1/29/98
10 Shields Howard 7/13/98
4
D Binary Search
A  Given a sorted list of names.
 How do you find Jones. Adams
Brown
T  Sequential search
 Jones = 10 lookups
Cadiz
Dorfmann
A
 Average = 15/2 = 7.5 lookups Eaton
 Min = 1, Max = 14 Farris
1 Goetz
 Binary search
B 

Find midpoint (14 / 2) = 7
Jones > Goetz
3
Hanson
Inez
4 Jones
A 

Jones < Kalida
Jones > Inez
2 Kalida
Lomax
S  Jones = Jones (4 lookups) Miranda

Norman
 Max = log2 (N) 14 entries
E  N = 1000
 N = 1,000,000
Max = 10
Max = 20
5
D Indexed Sequential Storage
A  Common uses
 Large tables.
Address
A11
ID
1
LastName
Reeves
FirstName DateHired
Keith 1/29/98
T  Need many sequential lists.

 Some random search--with
A22
A32
A42
2
3
4
Gibson
Reasoner
Hopkins
Bill
Katy
Alan
3/31/98
2/17/98
2/8/98
one or two key columns.
A
A47 5 James Leisha 1/6/98
A58 6 Eaton Anissa 8/23/98
 Mostly replaced by B+-Tree.
A63 7 Farris Dustin 3/28/98
A67 8 Carpenter Carlos 12/29/98
B ID
1
2
Pointer
A11
A22
LastName Pointer
Carpenter A67
Eaton A58
A78
A83
9
10
O'Connor
Shields
Jessica
Howard
7/23/98
7/13/98
A 3
4
5
A32
A42
A47
Farris
Gibson
A63
A22
Hopkins A42
S 6
7
8
A58
A63
A67
James A47
O'Connor A78
Reasoner A32
Indexed for ID and LastName
E 9
10
A78
A83
Reeves
Shields
A11
A83
7
D Index Options: Bitmaps and Statistics
A
 Bitmap index
T  A compressed index designed for non-primary key columns.
Bit-wise operations can be used to quickly match WHERE
A criteria.
 Analyze statistics
B  By collecting statistics about the actual data within the index,
the DBMS can optimize the search path. For example, if it
A knows that only a few rows match one of your search
conditions in a table, it can apply that condition first,
S reducing the amount of work needed to join tables.
E
10
D Problems with Indexes
A
 Each index must be updated when rows are inserted,
T deleted or modified.
 Changing one row of data in a table with many
A indexes can result in considerable time and
resources to update all of the indexes.
B  Steps to improve performance
A  Index primary keys

 Index common join columns (usually primary keys)
S  Index columns that are searched regularly

 Use a performance analyzer
E
11
D Data Warehouse
A
Predefined
T reports
Interactive
data analysis
A Operations
data
B Daily data
A OLTP Database
transfer
S 3NF tables
Data warehouse
Star configuration
E Flat files
12
D Data Warehouse Goals
A  Existing databases optimized for Online Transaction
T Processing (OLTP)
 Online Analytical Processing (OLAP) requires fast retrievals,
A and only bulk writes.

 Different goals require different storage, so build separate dta
B warehouse to use for queries.

 Extraction, Transformation, Transportation (ETT)
A  Data analysis
 Ad hoc queries
S
 Statistical analysis
 Data mining (specialized automated tools)
E
13
D Extraction, Transformation, and Transportation
(ETT)
A Customers
T Convert Client
to Customer
A Apply standard
B product numbers
Convert
A currencies
S Fix region codes

Data warehouse:
All data must be
Transaction data
E from diverse
systems.
consistent.
14
D OLTP v. OLAP
A
T
A
B
A
S
E
15
D Multidimensional Cube
A Pet Store
Item Sales
ry
T Ca
t eg
o Amount = Quantity*Sale Price
A
B Customer
A Location
S
E Time
Sale Date
16
D Sales Date: Time Hierarchy
A Year
T Levels Roll-up
A
Quarter To get higher-level totals
B Month
A Drill-down
To get lower-level details
Week
S
E Day
17
D Star Design
A Dimension Tables
T Products Sales Date
A Fact Table
Sales
B Quantity
Amount=SalePrice*Quantity
A Customer
S Location
E
18
D Snowflake Design City
CityID
A Merchandise Sale
ZipCode
City
T ItemID
Description
SaleID
SaleDate
State
QuantityOnHand
A ListPrice
Category
EmployeeID
CustomerID
SalesTax
Customer
B
CustomerID
Phone
FirstName
A OLAPItems
SaleID
LastName
Address
ZipCode
S ItemID
Quantity
SalePrice
CityID
E Amount Dimension tables can join to

other dimension tables.
19
D OLAP Computation Issues
A
T
A
B Compute Quantity*Price in base query, then add to get $23.00
A If you use Calculated Measure in the Cube, it will add first and
multiply second to get $45.00, which is wrong.
S
E
20
D OLAP Data Browsing
A
T
A
B
A
S
E
21
D Microsoft Pivot Table
A
T
A
B
A
S
E
22
D OLAP in SQL 99
Category Month Amount
Bird 1 $135.00
A
Bird 2 $45.00
Bird 3 $202.50
GROUP BY two columns
Bird 6 $67.50
T Gives you totals for each
month within each category.
Bird 7 $90.00
Bird 9 $67.50
A You do not get super-

aggregate totals for the
Cat
Cat
1
2
$396.00
$113.85
B category, or the month, or the

overall total.
Cat
Cat
3
4
$443.70
$2.25
A SELECT Category, Month(SaleDate) AS Month,

Sum(Quantity*SalePrice) AS Amount
S FROM Sale INNER JOIN (Merchandise INNER JOIN SaleItem
ON Merchandise.ItemID = SaleItem.ItemID)
E ON Sale.SaleID = SaleItem.SaleID
GROUP BY Category, Month(SaleDate);
23
D SQL ROLLUP
A SELECT Category, Month…, Sum …
FROM …
T GROUP BY ROLLUP (Category, Month...)
A Category
Bird
Month
1
Amount
135.00
B Bird
…
2 45.00
Bird (null) 607.50

A Cat
Cat
1
2
396.00
113.85
S …
Cat (null) 1293.30
E …
(null) (null) 8451.79
24
D Missing Values Cause Problems
A If there are missing values in the groups, it can be difficult
to identify the super-aggregate rows.
T Category Month Amount

Bird 1 135.00
A Bird
…
2 45.00
B Bird
Bird
(null)
(null)
32.00
607.50
Missing date
Super-aggregate
Cat 1 396.00
A Cat
…
2 113.85
S Cat
…
(null) 1293.30
E (null) (null) 8451.79
25
D GROUPING Function
SELECT Category, Month…, Sum …,
A GROUPING (Category) AS Gc,

GROUPING (Month) AS Gm
FROM …
T GROUP BY ROLLUP (Category, Month...)
Category Month Amount Gc Gm
A Bird 1 135.00 0 0
Bird 2 45.00 0 0
B …
Bird (null) 32.00 0 0
A Bird
Cat
(null)
1
607.50
396.00
1
0
0
0
Cat 2 113.85 0 0
S …
Cat (null) 1293.30 1 0
E …
(null) (null) 8451.79 1 1
26
D CUBE Option
SELECT Category, Month, Sum, GROUPING (Category) AS Gc,
A
GROUPING (Month) AS Gm
FROM …
GROUP BY CUBE (Category, Month...)
T Category Month
Bird 1
Amount
135.00
Gc
0
Gm
0
A Bird
…
2 45.00 0 0
B
Bird (null) 32.00 0 0
Bird (null) 607.50 1 0
Cat 1 396.00 0 0
A Cat
…
2 113.85 0 0
S Cat
(null)
(null)
1
1293.30
1358.8
1
0
0
1
(null) 2 1508.94 0 1
E (null)
…
3 2362.68 0 1
(null) (null) 8451.79 1 1

27
D GROUPING SETS: Hiding Details
A SELECT Category, Month, Sum
FROM …
GROUP BY GROUPING SETS
T ( ROLLUP (Category),
ROLLUP (Month),
A )
()
Category Month Amount

B Bird (null) 607.50
Cat (null) 1293.30
A …
(null) 1 1358.8
S (null)
(null)
2
3
1508.94
2362.68
…
E (null) (null) 8451.79
28
D SQL OLAP Analytical Functions
A VAR_POP
VAR_SAMP
variance
T STDDEV_POP
STDEV_SAMP
standard deviation
COVAR_POP covariance
A COVAR_SAMP
CORR correlation
B
REGR_R2 regression r-square
REGR_SLOPE regression data (many)
REGR_INTERCEPT
A
S
E
29
D SQL RANK Functions
A SELECT Employee, SalesValue
RANK() OVER (ORDER BY SalesValue DESC) AS rank
DENSE_RANK() OVER (ORDER BY SalesValue DESC) AS dense
T FROM Sales
ORDER BY SalesValue DESC, Employee;
A Employee SalesValue rank dense
B
Jones 18,000 1 1
Smith 16,000 2 2
A Black 16,000 2 2
White 14,000 4 3
S DENSE_RANK
does not skip
E numbers
30
D SQL OLAP Windows
SELECT Category, SaleMonth, MonthAmount,
A AVG(MonthAmount)
OVER (PARTITION BY Category
T
ORDER BY SaleMonth ASC ROWS 2 PRECEDING)
AS MA
FROM qryOLAPSQL99
A ORDER BY SaleMonth ASC;
Category SaleMonth MonthAmount MA

B Bird
Bird
200101
200102
1500.00
1700.00
A
Bird 200103 2000.00 1600.00
Bird 200104 2500.00 1850.00
…
S Cat
Cat
200101
200102
4000.00
5000.00
Cat 200103 6000.00 4500.00
E Cat
…
200104 7000.00 5500.00
31
D Ranges: OVER
A SELECT SaleDate, Value
SUM(Value) OVER (ORDER BY SaleDate) AS running_sum,
T SUM(Value) OVER (ORDER BY SaleDate RANGE
BETWEEN UNBOUNDED PRECEDING
A
AND CURRENT ROW) AS running_sum2,
SUM (Value) OVER (ORDER BY SaleDate RANGE
BETWEEN CURRENT ROW
B FROM …
AND UNBOUNDED FOLLOWING) AS remaining_sum;
A Sum1 computes total from beginning through current row.

S Sum2 does the same thing, but more explicitly lists the rows.
E Sum3 computes total from current row through end of query.
32
D LAG and LEAD Functions
A LAG or LEAD: (Column, # rows, default)
T SELECT SaleDate, Value,

LAG (Value 1,0) OVER (ORDER BY SaleDate) AS prior_day
A LEAD (Value 1, 0) OVER (ORDER BY SaleDate) AS next_day

FROM …
ORDER BY SaleDate
B SaleDate Value prior_day next_day
Prior is 0 from
default value
A 1/1/2003
1/2/2003
1/3/2003
1000
1500
2000
0
1000
1500
1500
2000
2300
S …
1/31/2003 3500 3200 0
E Not part of standard yet? But are in SQL Server and Oracle.
33
D Data Mining
A
 Goal: To discover unknown relationships in the data
T that can be used to make better decisions.
A Transactions and operations Reports

B Specific ad hoc questions Queries
A
Aggregate, compare, drill down OLAP
S Databases
E
Unknown relationships Data Mining
34
D Exploratory Analysis
A
 Data Mining usually works autonomously.
T  Supervised/directed
 Unsupervised
A  Often called a bottom-up approach that scans the data to
find relationships
B  Some statistical routines, but they are not sufficient
 Statistics relies on averages
A  Sometimes the important data lies in more detailed pairs
S
E
35
D Common Techniques
A  Classification/Prediction/Regression
T  Association Rules/Market Basket Analysis
 Clustering
A  Data points
 Hierarchies
B  Neural Networks
 Deviation Detection
A  Sequential Analysis
 Time series events
S  Websites
 Textual Analysis
E  Spatial/Geographic Analysis
36
D Classification Examples
A  Examples
T  Which borrowers/loans are most likely to be successful?
 Which customers are most likely to want a new item?
A  Which companies are likely to file bankruptcy?
 Which workers are likely to quit in the next six months?
B  Which startup companies are likely to succeed?

 Which tax returns are fraudulent?
A
S
E
37
D Classification Process
A  Clearly identify the outcome/dependent variable.
 Identify potential variables that might affect the outcome.
T  Supervised (modeler chooses)
 Unsupervised (system scans all/most)
A  Use sample data to test and validate the model.

 System creates weights that link independent variables to
B outcome.
A
Income Married Credit History Job Stability Success
S 50000 Yes Good Good Yes
25000 Yes Bad Bad No
E 75000 No Good Good No
38
D Classification Techniques
A
 Regression
T  Bayesian Networks
A  Decision Trees (hierarchical)

 Neural Networks
B  Genetic Algorithms
A  Complications
S
 Some methods require categorical data
 Data size is still a problem
E
39
D Association/Market Basket
A  Examples
T  What items are customers likely to buy together?
 What Web pages are closely related?
A  Others?
 Classic (early) example:
B  Analysis of convenience store data showed customers often buy

diapers and beer together.
 Importance: Consider putting the two together to increase cross-
A selling.
S
E
40
D Association Details (two items)
A  Rule evaluation (A implies B)
T  Support for the rule is measured by the percentage of all
transactions containing both items: P(A ∩ B)
A  Confidence of the rule is measured by the transactions with A that

also contain B: P(B | A)
 Lift is the potential gain attributed to the rule—the effect compared
B to other baskets without the effect. If it is greater than 1, the effect
is positive:
A  P(A ∩ B) / ( P(A) P(B) )

 P(B|A)/P(B)
 Example: Diapers implies Beer
S  Support: P(D ∩ B) = .6 P(D) = .7 P(B) = .5
 Confidence: P(B|D) = .857 = P(D ∩ B)/P(D) = .6/.7
E  Lift: P(B|D) / P(B) = 1.714 = .857 / .5
41
D Association Challenges
A  If an item is rarely purchased, any other item bought with it
seems important. So combine items into categories.
T Item
1 “ nails
Freq.
2%
Item
Hardware
Freq.
15%
A 2” nails
3” nails
1%
1%
Dim. Lumber
Plywood
20%
15%
B 4” nails
Lumber
2%
50%
Finish lumber 15%
A  Some relationships are obvious.
S  Burger and fries.

 Some relationships are meaningless.
E  Hardware store found that toilet rings sell well only when a new
store first opens. But what does it mean?
42
D Cluster Analysis
A  Examples
 Are there groups of customers? (If so, we can cross-sell.)
 Do the locations for our stores have elements in common? (So we
T can search for similar clusters for new locations.)
 Do our employees (by department?) have common characteristics?
A
(So we can hire similar, or dissimilar, people.)
 Problem: Many dimensions and large datasets
B Large
intercluster
A distance
S Small
intracluster
distance
E
43
D Geographic/Location
A  Examples
 Customer location and sales comparisons
T  Factory sites and cost
 Environmental effects
A  Challenge: Map data, multiple overlays
B
A
S
E
44

Database Management Systems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Database Management Systems

Uploaded by

Copyright:

Available Formats

Database Management Systems

T simple rows and

A be stored this way.

A reading and rewriting 7 Farris Dustin 3/28/98

A  Copy rest of file. 9

S  Jones = Jones (4 lookups) Miranda

T  Need many sequential lists.

S reducing the amount of work needed to join tables.

A  Index primary keys

S  Index columns that are searched regularly

A and only bulk writes.

B warehouse to use for queries.

S Fix region codes

T Products Sales Date

E Amount Dimension tables can join to

A You do not get super-

B category, or the month, or the

A SELECT Category, Month(SaleDate) AS Month,

Bird (null) 607.50

T Category Month Amount

E (null) (null) 8451.79

A GROUPING (Category) AS Gc,

(null) (null) 8451.79 1 1

Category Month Amount

A Employee SalesValue rank dense

A ORDER BY SaleMonth ASC;

Category SaleMonth MonthAmount MA

A Sum1 computes total from beginning through current row.

E Sum3 computes total from current row through end of query.

T SELECT SaleDate, Value,

A LEAD (Value 1, 0) OVER (ORDER BY SaleDate) AS next_day

A Transactions and operations Reports

B  Which startup companies are likely to succeed?

A  Use sample data to test and validate the model.

A  Decision Trees (hierarchical)

B  Analysis of convenience store data showed customers often buy

A  Confidence of the rule is measured by the transactions with A that

A  P(A ∩ B) / ( P(A) P(B) )

A  Some relationships are obvious.

S  Burger and fries.

You might also like