You are on page 1of 20

DATA WAREHOUSING

AND

DATA MINING
A Comprehensive guide for students and IT Professionals
(Choice Based Credit System (CBCS) Pattern) New Syllabus
( For B. Sc Computer Science, B.Sc., Software Computer Science, B.Sc. ISM, B.Sc. IT,
B.Sc. Software System, B.Sc. Software Engineering, BCA, M.Sc. Computer Science,
M.Sc. Information Technology, M.Sc. Information System and Management, M.Sc.
Software Engineering, MCA, B.E.CSE, B.Tech IT, M.E CSE, M.Tech IT, M.Phil., and
IT Professionals.)

By

Dr.P.Rizwan Ahmed, MCA,, M.Sc.,M.A.,M.Phil.,Ph.D,


Head of the Department
Department of Computer Applications and
PG Department of Information Technology
Mazharul Uloom College,
Ambur - 635 802, Vellore Dist. Tamil Nadu.

CONTENTS
Preface
Acknowledgement
PART- I
DATA MINING
Chapter 1

Introduction

1.1 An Expanding universe of data


1.2 Information and production factor
1.3 KDD and data mining
1.4 Data Mining vs query tools
1.5 Data Mining in Marketing
1.6 Practical applications of data mining
1.7 Learning
1.8 Self-learning computer systems
1.9 Machine learning
1.9.1 Why machine learning is done?
1.10 Machine learning and the methodology of science
1.10.1 Differences between Data Mining and Machine Learning
1.11 Concept Learning
Summary
Review Question
Chapter 2

Data Mining and the Data Warehouse

2.1 Data Warehouse: Definitions


2.2 Why do we need Data Warehouse?
2.3 Designing decision support systems
2.3.1Hardware and software products of a decision support system
2.4 Integration with data mining
2.5 Client/server and data warehousing
2.6 Multi-processing machines
2.7 Cost justification
Summary
Review Questions

Chapter 3

Knowledge Discovery Process

3.1 Introduction
3.2 Data selection
3.3 Cleaning
3.4 Coding
3.5 Data mining
3.5.1 Preliminary analysis of the data set using traditional query tools
3.5.1.1 Visualization techniques
3.5.1.2 Likelihood and distance
3.5.1.3 OLAP tools
3.5.1.4 K-nearest neighbor
3.5.1.5 Decision Trees
3.5.1.6 Association Rules
3.5.1.7 Neural networks
3.5.1.8 Genetic algorithms
3.6 Reporting
Summary
Review Questions
Chapter- 4

KDD Environment

4.1 Different forms of knowledge


4.2 KDD environment
4.3 Ten golden rules
Summary
Review Questions
Chapter 5

Real life applications

5.1 Customer profiling


5.2 Predicting bid behavior of pilots
5.3 Discovering foreign key relationships
Summary
Review Questions
Chapter 6
6.1 Learning as compression of data sets

Formal aspects of learning algorithm

6.2 Information content of a message


6.3 Noise and redundancy
6.4 Significance of noise
6.5 Fuzzy databases
6.6 The traditional theory of the relational database
6.7 From relations to tables
6.7.1 From keys to statistical dependencies
6.8 Denormalization
6.9 Data mining primitives
Summary
Review Questions
Chapter 7
7.1 Introduction
7.2 Data
7.3 Information
7.4 Knowledge
7.5 Historical Note: Many names of Data Mining
7.6 Data Mining
7.6.1 Some of the definitions of Data Mining
7.7 Why Data Mining
7.8 Why Data Mining is Important?
7.9 Uses of Data Mining
7.10 Data Mining Models
7.10.1 Verification Model
7.10.2 Discovery Model
7.11 Development of data mining
7.12 Applications of Data Mining
7.12.1 Healthcare
7.12.2 Finance
7.12.3 Retail Industry
7.12.4 Telecommunication
7.12.5 Text Mining and Web Mining
7.12.6 Higher Education
7.13 Basic Data Mining Tasks / Taxonomy of data mining tasks
7.13.1 Prediction methods
7.13.2 Descriptive methods
7.14 Data Mining Vs Database
7.15 Data Mining Vs KDD

Data Mining

7.16 Steps in Data Mining Process / Steps involved in KDD


7.17 Architecture of a typical data mining system
7.18 Future Trends
7.18.1 Data Trends
7.18.2 Hardware Trends
7.18.3 Network Trends
7.18.4 Scientific Computing Trends
7.18.5 Business Trends
7.19 Major issues in Data Mining / Data Mining Issues
7.20 Data Mining Metrics
7.21 Social Implications of Data Mining
7.22 Data Mining from a database Perspective
Summary
Review Question
Chapter 8

Advanced Databases

8.1 Various kinds of data / Types of Data


8.1.1 Flat files
8.1.2 Relational Databases
8.1.3 Data Warehouses
8.1.4 Transaction Databases
8.1.5 Object oriented databases
8.1.6 Temporal Databases
8.1.7 Text and Multimedia Databases
8.1.8 Spatial Databases
8.1.9 Time-Series Databases
8.1.10 World Wide Web (WWW)
8.1.11 Heterogeneous databases
Summary
Review Question
Chapter 9

Data Mining Functionalities, Classification and Case Study

9.1 Data Mining Functionalities


9.2 Pattern Interesting / Interestingness of Patterns
9.2.1 Interestingness measures:
9.2.2 Objective vs. subjective interestingness measures
9.3 Classification of Data Mining Systems

9.4 Data Mining Task Primitives


9.5 Why Data Mining Primitives and Languages?
9.6 Integration of data mining system with a database or Data warehouse system
9.6.1 No Coupling
9.6.2 Loose Coupling
9.6.3 Semitight coupling
9.6.4 Tight coupling
9.7 Case Study
9.7.1 Customer Attrition: Case Study
9.7.2 Assessing Credit Risk : Case Study
9.7.3 Successful e-commerce - Case Study
Summary
Review Question
Chapter 10

Overview of Data Mining Techniques-I

10.1 Data Mining Techniques


10.1.1 Cluster Analysis
10.1.2 Induction
10.1.3 Decision Trees
10.1.4 Rule induction
10.1.5 Nearest Neighbour
10.1.6 Neural networks
10.2 Data Mining Application Examples
Summary
Review Question
Chapter 11

Overview of Data Mining Techniques-II

11.1 Introduction
11.2 A Statistical Perspective on Data Mining
11.2.1 Point Estimation
11.2.2 Models Based on Summarization
11.2.3 Bayes Theorem
11.2.4 Hypothesis Testing
11.2.5 Regression and Correlation
11.3 Similarity Measures
11.4 Decision Trees
11.5 Neural Networks

11.6 Genetic Algorithms


Summary
Review Question
Chapter 12

Data Preprocessing

12.1 1ntroduction
12.2 Why preprocess the data / Need for preprocessing
12.3 Data Preprocessing Techniques / Major Tasks in Data Preprocessing
12.4 Data Cleaning
12.4.1 Missing Data / Values
12.4.1.1 Methods of handling missing data
12.4.2 Noisy Data
12.4.2.1 How to Handle Noisy Data?
12.4.3 Outlier Analysis
12.4.4 Regression
12.5 Data Cleaning as a Process
12.5.1 Discrepancy detection
12.5.2 Discrepancy Detection Tools
12.5.3 Data Transformation
12.5.4 Data Transformation Tools
12.6 Data Integration
12.6.1 Issues to be considered in Data Integration
12.6.1.1 Schema integration
12.6.1.2 Reduction
12.6.1.3 Detecting and resolving data value conflicts
12.6.2 Handling Redundant Data in Data Integration
12.7 Data Transformation
12.7.1 Methods of Data Normalization
12.7.1.1 Min-max normalization
12.7.1.2 z-score normalization
12.7.1.3 Normalization by decimal scaling
12.8 Data Reduction
12.8.1 Data Reduction Strategies
12.8.1.1 Data Cube Aggregation
12.8.1.2 Attribute Subset Selection
12.8.1.3 Dimensionality Reduction

12.8.1.4 Numerosity Reduction


12.8.1.5 Data Discretization and concept hierarchy generation
Data discretization
12.9 Data Mining Query Languages (DMQL)
Summary
Review Questions
Chapter 13

Association Rules

13.1 Association Rules


13.2 Large Item sets
13.3 Basic Algorithm
13.3.1 Apriori Algorithm
13.3.2 Partitioning
13.4 Parallel and Distributed Algorithms
13.4.1 Data parallelism
13.4.2 Task parallelism
13.5 Comparing Approaches
13.6 Incremental Rules
13.7 Advanced Association Rule Techniques
13.7.1 Generalized association rules
13.7.2 Multiple-level association rules
13.7.3 Quantitative association rules
13.7.4 Using Multiple Minimum Supports
13.8 Measuring the Quality of Rules
Summary
Review Questions

Chapter 14

Concept Description: Generalization and Characterization

14.1 Concept Description


14.2 Data Generalization and Summarization-based
14.2.1 Data Generalization
14.2.2 Characterization: Data Cube Approach
14.2.3 Attribute oriented induction for data characterization

14.2.4 Efficient Implementation of Attribute-Oriented Induction


14.3 Analytical characterization: Analysis of attribute relevance
14.4 Mining class comparisons: Discriminating between different classes
Mining Class Comparisons
14.5 Descriptive Data Summarization / Mining descriptive
statistical measures in large databases
14.5.1 Measuring the Central Tendency
14.5.2 Measuring the Dispersion of Data
14.5.3 Graphics Displays of basic Statistical Description

Summary
Review Questions
Chapter 15

Mining Frequent Patterns, Associations & Correlations

15.1 Mining Association Rules in Large Databases


15.1.1 Market Basket Analysis: A Motivating Example
15.1.2 Association Rule: Basic Concepts
15.1.3 Association Rule Mining: A Road Map
15.1.4 Mining Frequent Itemsets: the Key Step
15.2 Mining single-dimensional Boolean association rules from transactional databases:
Efficient and Scalable Frequent Itemset Mining Methods
15.2.1 Apriori Algorithm
15.2.2 Generating Association Rules from Frequent Itemsets
15.2.3 Methods to Improve Aprioris Efficiency
15.2.4 Mining Frequent Patterns without Candidate Generation
15.2.5 Principles of Frequent Pattern Growth
15.3 Mining various kinds of Association Rules
15.3.1 Mining multilevel association rules from transactional databases:
Multiple-Level Association Rules
15.3.2 Mining multidimensional association rules from transactional databases
and data warehouse
15.4 From Association Mining to Correlation Analysis
15.5 Constraint-Based Association Mining

Summary
Review Questions
Chapter 16
16.1 Introduction

Classification

16.1.1 Classification algorithms based on the categorization:


Issues in Classification
16.2 Statistical-Based Algorithms
16.2.1 Regression
16.2.2 Bayesian classification
16.2.3 Nave Bayes Classifier
16.3 Distance-Based Algorithms
16.3.1 Simply Approach
16.3.2 K Nearest Neighbors
16.4 Decision Tree-Based Algorithms
16.4.1 C4.5
16.4.2 CART
16.4.2.1 Scalable DT techniques
16.5 Neural Network-Based Algorithms
16.5.1 Propagation
16.5.2 NN supervised learning
16.5.3 Radial Basis Function Networks
16.5.4 Perceptron
16.6 Rule-Based Algorithms
16.6.1 Generating Rules from a DT
16.6.2 Generating Rules form a Neural Net
16.6.3 Generating Rules without a DT or NN
16.7 Combining Techniques
Summary
Review Questions
Chatper-17

Classification and Prediction

17.1 Classification
17.1.1 ClassificationA Two-Step Process
17.1.2 Prediction
17.1.3 Issues regarding classification and prediction
17.1.4 Comparing Classification and Prediction Methods
17.2 Classification by decision tree induction
17.2.1 Decision Tree Induction
17.2.2 Attribute Selection Measure
17.2.3 Information Gain (ID3/C4.5)
17.2.4 Gini Index (IBM IntelligentMiner)
17.2.5 Extracting Classification Rules from Trees
17.2.6 Avoid Overfitting in Classification

17.2.7 Enhancements to basic decision tree induction


17.2.8 Classification in Large Databases
17.3 Bayesian Classification: Introduction
17.3.1 Bayesian Classification: Why?
17.3.2 Bayesian Classification
17.3.3 Bayesian Theorem
17.3.4 Nave Bayes Classifier
17.3.5 Bayesian Belief Networks
17.3.6 Training Bayesian Belief Networks
17.4 Rule Based Classification
17.4.1 Using IF-THEN Rules for Classification
17.4.2 Rule Extraction from a Decision Tree
17.4.3 Rule induction using a Sequential Conversing Algorithm
17.4.4 Rule Quality Measures
17.5 Classification by backpropagation
17.6 Classification based on concepts from association rule mining/
Association-Based Classification / Classification by association Rules
17.7 Lazy Learners (or Learning from Your Neighbors)
17.7.1 k-Nearest Neighbor
17.7.2 Case-Based Reasoning (CBR)
17.8 Other Classification Methods
17.8.1 Genetic Algorithms
17.8.2 Rough Set Approach
17.8.3 Fuzzy Sets Approaches
17.9 Prediction
17.10 Classification accuracy
17.10.1 Classification Accuracy: Estimating Error Rates
Summary
Review Questions
Chapter- 18
18.1 Introduction
18.2 Similarity and Distance Measures
18.3 Outliers
18.4 Hierarchical Algorithms
18.4.1 Agglomerative Algorithms
18.5 Partitional Algorithms
18.5.1 Minimum spanning tree
18.5.2 Squared Error Clustering Algorithm

Clustering

18.5.3 K-means clustering


18.5.4 Nearest neighbor algorithm
18.5.5 PAM Algorithm
18.5.5.1CLARA
18.5.5.2 CLARANS
18.5.6 Clustering with genetic algorithms
18.5.7 Clustering With Neural Networks
18.5.7.1 Self-Organizing Feature Maps
18.6 Clustering Large Databases
18.6.1 BIRCH
18.6.2 DBSCAN
18.6.3 CURE Algorithm
18.7 Comparison of Clustering Algorithm
Summary
Review Questions
Chapter 19

Cluster Analysis

19.1 What is Cluster Analysis?


19.2 General Applications of Clustering
19.3 Examples of Clustering Applications
19.4 What is Good Clustering?
19.5 Requirements of Clustering in Data Mining
19.6 Types of Data in Cluster Analysis
19.6.1 Interval-valued variables
19.6.2 Binary Variables
19.6.3 Nominal, Ordinal, and Ratio-Scaled Variables.
19.7 A Categorization of Major Clustering Methods
19.7.1 Major Clustering Approaches
19.8 Partitioning Methods: Basic Concept
19.8.1 K-Means Clustering Method
19.8.2K-Medoids Clustering Method
19.8.2.1 Comparison between K-means and K-medoids
19.8.3 PAM
19.8.4 CLARA
19.9 Hierarchical Methods
19.9.1 Types of Hierarchical Clustering Methods
19.9.1.1 Agglomerative Hierarchical Clustering
19.9.1.2 Divisive Hierarchical Clustering
19.9.2 BIRCH

19.9.3 CURE
19.9.4 ROCK
19.9.5 CHAMELEON
19.10 Density-Based Methods
19.10.1 DBSCAN
19.10.2 OPTICS
19.10.3 DENCLUE
19.11 Grid-Based Methods
19.11.1 STING
19.11.2 WaveCluster
19.11.3 CLIQUE
19.12 Model-Based Clustering Methods
19.12.1 Expectation Maximization (EM)
19.12.2 Conceptual clustering
19.12.3 Neural network approaches
19.13 Outlier Analysis
19.13.1 Outlier Discovery: Statistical Approaches
19.13.2 Outlier Discovery: Distance-Based Approach
19.13.3 Outlier Discovery: Deviation-Based Approach
Summary
Review Questions
Chapter 20

Advanced Topics (Mining Complex types of data)

20.1 Multidimensional analysis and descriptive mining of complex data objects


20.1.1 Generalization of Structured Data
20.1.2 Generalizing Spatial and Multimedia Data
20.1.3 Generalizing Object Data
20.1.4 Generalization-based Mining of Plan Databases by Divide and Conquer
20.2 Mining Spatial Data Mining
20.2.1 Dimensions and Measures in Spatial Data Warehouse
20.2.2 Mining Spatial Association and Co-location Patterns
20.2.3 Spatial Classification and Spatial Trend Analysis
20.3 Mining multimedia databases
20.3.1 Similarity Search in Multimedia Data
20.3.2 Multidimensional Analysis of Multimedia Data
20.4 Mining time-series and sequence data
20.4.1 Time-series database
20.4.2 Mining Time-Series and Sequence Data: Trend analysis

20.4.3 Estimation of Trend Curve


20.4.4 Discovery of Trend in Time-Series
20.4.5 Multidimensional Indexing
20.4.6 Subsequence Matching
20.4.7 Query Languages for Time Sequences
20.5 Text Mining / Mining text databases
20.5.1 Text Data Analysis and Information Retrieval
20.5.2 Text Indexing Techniques
20.5.3 Text Mining Approaches
20.6 Mining the World-Wide Web / Web Mining
Chapter 21

Applications and Trends in Data Mining

21.1 Applications of Data Mining


21.1.1 Data Mining for Financial Data Analysis
21.1.2 Data Mining for Retail Industry
21.1.3 Data Mining for Telecommunication Industry
21.1.4 Biomedical Data Mining and DNA Analysis
21.1.5 Data Mining Applications in Sales/Marketing
21.1.6 Data Mining Applications in Banking / Finance
21.1.7 Data Mining Applications in Health Care and Insurance
21.2 Data mining system products and research prototypes
21.2.1 How to choose a data mining system?
21.2.2 Examples of Data Mining Systems
21.3 Additional themes on data mining
21.3.1 Theoretical Foundations of Data Mining
21.3.2 Statistical Data Mining
21.4 Social impact of data mining
21.5 Trends in data mining
Summary
Review Questions
PART II
DATA WAREHOUSING

Chapter 22
22.1 Introduction
22.2 Characteristics of Data Warehouse

Data warehousing

22.3 Need for Data Warehousing


22.4 Why Separate Data Warehouse?
22.5 Difference between Operational databases and Data Warehouses
22.6 Difference between OLTP and Data warehouse
22.7 Benefits of Data Warehousing
22.8 Future of data warehouse
22.9 Limitations of Data Warehouse
22.10Applications of Data Warehousing
22.11 Advantages of Data Warehousing
22.12 Data Warehousing Tools
Summary
Review Questions
Chapter 23

Data Warehousing Components

23.1 Overall Architecture


23.2 Data warehouse database
23.3 Sourcing, acquisition, cleanup, and transformation tools
23.4 Metadata
23.5 Access tools
23.5.1 Query and reporting tools
23.5.2 Application
23.5.3 OLAP
23.5.4 Data mining
23.6 Data marts
23.7 Data warehouse administration and management
Summary
Review Questions
Chapter 24

From Data warehousing to data mining

24.1 Data warehouse usage


24.1.1 Three kinds of data warehouse applications
24.2 Information processing Online Analytical Processing
24.2.1 Advantages of OLAM
24.2.2 Architecture of On-Line Analytical Mining
24.2.3 Comparison between OLAP and OLAM
Summary
Review Questions

Chapter 25

Data Warehouse Architecture

25.1 Data Warehouse architecture


25.1.1 Steps for the design and construction of data warehouse
25.1.2 Data Warehouse Design Process
25.1.3 Three Tier Data Warehouse Architecture
25.1.3.1 Enterprise Warehouse
25.1.3.2 Data Mart
25.1.3.3 Virtual data warehouse
25.2Data warehouse Back-End Tools and Utilities
25.3 Metadata Repository
25.4 OLAP Engine
25.4.1 Relational OLAP (ROLAP)
25.4.2 Multidimensional OLAP (MOLAP)
25.4.3 Hybrid OLAP (HOALP)
25.4.4 Specialized Servers
Summary
Review Questions
Chapter 26

Data Warehouse Implementation

26.1 Data Warehouse Implementation


26.1.1 Efficient Computation of Data Cubes
26.1.2 Cube Operation
26.1.3 Indexing OLAP Data: Bitmap Index
26.1.4 Indexing OLAP Data: Join Indices
26.1.5 Efficient Processing OLAP Queries
Summary
Review Questions
Chapter 27

Mapping the data warehouse to a multiprocessor architecture

27.1 Relational database technology for data warehouse


27.1.1 Types of parallelism
27.1.2 Data partitioning
27.2 Data base architecture for parallel processing
27.2.1 Shared-memory architecture
27.2.2 Shared-disk architecture
27.2.3 Shared-nothing architecture

27.2.4 Combined architecture


27.3 Parallel RDMBS features
27.4 Alternative technologies
27.5 Parallel DBMS Vendors
27.5.1 Oracle
27.5.2 Informix
27.5.4 Sybase
27.5.5 Microsoft
Summary
Review Questions
Chapter 28

Reporting and Query Tools and Applications

28.1Tool categories
28.1.1 Reporting tools
28.1.2 Managed Query Tools
28.1.3 Executive information tools
28.1.4 OLAP tools
28.1.5 Data mining tools
28.2 Need for application
28.3 Cognos impromptu
28.4Applications
28.4.1PowerBuilder
Summary
Review Questions
Chapter 29

On-Line Analytical Processing (OLAP)

29.1 Introduction
29.2 Need for OLAP
29.3 Multidimensional data model
29.3.1 From Tables and Spreadsheets to Data Cubes
29.4 OLAP Guidelines / OLAP Product Evaluation Rules
29.5 Data Warehouse Schema / OLAP Schema
29.5.1 Star Schema
29.5.2 Star Schema Keys
29.5.3 Advantages of Star schema
29.5.4 Snow Flake Schema

29.5.5 Fact Constellation


29.6 Concept hierarchies
29.7 OLAP operation in the Multidimensional Data Model
29.8 Multidimensional versus Multirelational OLAP
29.9 Categorization of OLAP Tools
29.10 OLAP Tools and the Internet
29.11 Difference between OLTP and OLAP
29.12 Comparison of DBMS, OLAP, and Data Mining
Summary
Review Questions
Chapter 30

Security

30.1 Introduction
30.2 Requirements
30.2.1 User Access
30.2.2 Legal Requirements
30.2.3 Audit Requirements
30.2.4 Network Requirements
30.2.5 Data Movement
30.2.6 Documentation
30.2.7 High-Security Environments
30.3 Performance Impact of Security
30.3.1 Views
30.3.2 Data Movement
30.4 Security Impact on Design
30.4.1 Application Development
30.4.2 Database Design
30.4.3 Testing
Summary
Review Questions
Chapter 31
31.1 Introduction
31.2 Definition of Types of System
31.3 Defining the SLA
31.3.1 User Requirements
31.3.2 System Requirements

Service Level Agreement (SLA)

Summary
Review Questions
Chapter 32

Operating the data warehouse

32.1 Introduction
32.2 Day-To Day Operations of the Data Warehouse
32.3 Overnight Processing
Summary
Review Questions
Chapter 33

Capacity Planning

33.1 Process
33.2 Estimating the Load
33.2.1 Initial Configuration
33.2.2 How much CPU bandwidth
33.2.3 How Much Memory
33.2.4 How much disk?
Summary
Review Questions
Chapter 34

Tuning and testing the data warehouse

34.1 Tuning the Data Load


34.2 Prioritized Tuning Steps
34.3 Tuning Queries
34.3.1 Fixed queries
34.3.2 AD HOC queries
34. 4 Testing the Data Warehouse
34.4. 1 Introduction
34.4.2 The Testing Terminologies
34.4.3 Testing the operational environment
34.4.5 Testing the database
34.4.5.1 Testing database manager and monitoring tools
34.4.5.2 Testing database features
34.4.5.3 Testing database performance
34.5 Testing the Application
Summary

Review Questions
Chapter 35
35.1 Introduction
35.1.1 Types of Backup
35.2 Data Warehouse Recovery Models
35.3 Define Backup and Recovery Strategy
35.4 Security Impact on Design of Data Warehouse
35.4.1 Application Development
35.4.2 Database Design
35.4.3 Testing
34.5 Disaster Recovery
Summary
Review Questions
APPENDIX A; Glossary
APPENDIX B: Two marks Questions with Answers
APPENDIX C: Past University Question Papers
BIBLIOGRAPHY

Backup and Recovery