You are on page 1of 193

Data Mining

This book is a part of the course by Jaipur National University, Jaipur.
This book contains the course content for Data Mining.
JNU, Jaipur
First Edition 2013
The content in the book is copyright of JNU. All rights reserved.
No part of the content may in any form or by any electronic, mechanical, photocopying, recording, or any other
means be reproduced, stored in a retrieval system or be broadcast or transmitted without the prior permission of
the publisher.
JNU makes reasonable endeavours to ensure content is current and accurate. JNU reserves the right to alter the
content whenever the need arises, and to vary it at any time without prior notice.
I/JNU OLE
Index
Content I. ...................................................................... II
List of Figures II. ........................................................ VII
List of Tables III. ...........................................................IX
Abbreviations IV. .......................................................... X
Case Study V. .............................................................. 159
Bibliography VI. ......................................................... 175
Self Assessment Answers VII. ................................... 178
Book at a Glance
II/JNU OLE
Contents
Chapter I ....................................................................................................................................................... 1
Data Warehouse – Need, Planning and Architecture ............................................................................... 1
Aim ................................................................................................................................................................ 1
Objectives ...................................................................................................................................................... 1
Learning outcome .......................................................................................................................................... 1
1.1 Introduction ............................................................................................................................................. 2
1.2 Need for Data Warehousing ..................................................................................................................... 4
1.3 Basic Elements of Data Warehousing ...................................................................................................... 5
1.4 Project Planning and Management .......................................................................................................... 6
1.5 Architecture and Infrastructure ................................................................................................................ 8
1.5.1 Infrastructure ...........................................................................................................................11
1.5.2 Metadata ................................................................................................................................. 13
1.5.3 Metadata Components ........................................................................................................... 14
Summary ..................................................................................................................................................... 17
References ................................................................................................................................................... 17
Recommended Reading ............................................................................................................................. 17
Self Assessment ........................................................................................................................................... 18
Chapter II ................................................................................................................................................... 20
Data Design and Data Representation ..................................................................................................... 20
Aim .............................................................................................................................................................. 20
Objectives .................................................................................................................................................... 20
Learning outcome ........................................................................................................................................ 20
2.1 Introduction ............................................................................................................................................ 21
2.2 Design Decision ..................................................................................................................................... 21
2.3 Use of CASE Tools ................................................................................................................................ 21
2.4 Star Schema ........................................................................................................................................... 23
2.4.1 Review of a Simple STAR Schema ....................................................................................... 23
2.4.2 Star Schema Keys .................................................................................................................. 24
2.5 Dimensional Modelling ......................................................................................................................... 26
2.5.1 E-R Modelling versus Dimensional Modelling ..................................................................... 26
2.6 Data Extraction ...................................................................................................................................... 26
2.6.1 Source Identification .............................................................................................................. 27
2.6.2 Data Extraction Techniques ................................................................................................... 28
2.6.3 Data in Operational Systems .................................................................................................. 28
2.7 Data Transformation .............................................................................................................................. 33
2.7.1 Major Transformation Types .................................................................................................. 34
2.7.2 Data Integration and Consolidation ....................................................................................... 36
2.7.3 Implementing Transformation ............................................................................................... 37
2.8 Data Loading .......................................................................................................................................... 38
2.9 Data Quality ........................................................................................................................................... 39
2.10 Information Access and Delivery ......................................................................................................... 40
2.11 Matching Information to Classes of Users OLAP in Data Warehouse ................................................ 40
2.11.1 Information from the Data Warehouse ................................................................................. 41
2.11.2 Information Potential ........................................................................................................... 41
Summary ..................................................................................................................................................... 43
References ................................................................................................................................................... 43
Recommended Reading ............................................................................................................................. 43
Self Assessment ........................................................................................................................................... 44
III/JNU OLE
Chapter III .................................................................................................................................................. 46
Data Mining ................................................................................................................................................ 46
Aim .............................................................................................................................................................. 46
Objectives .................................................................................................................................................... 46
Learning outcome ........................................................................................................................................ 46
3.1 Introduction ............................................................................................................................................ 47
3.2 Crucial Concepts of Data Mining .......................................................................................................... 48
3.2.1 Bagging (Voting, Averaging) ................................................................................................. 48
3.2.2 Boosting ................................................................................................................................. 49
3.2.3 Data Preparation (in Data Mining) ........................................................................................ 49
3.2.4 Data Reduction (for Data Mining) ......................................................................................... 49
3.2.5 Deployment ............................................................................................................................ 49
3.2.6 Drill-Down Analysis .............................................................................................................. 50
3.2.7 Feature Selection .................................................................................................................... 50
3.2.8 Machine Learning .................................................................................................................. 50
3.2.9 Meta-Learning ....................................................................................................................... 50
3.2.10 Models for Data Mining ...................................................................................................... 50
3.2.11 Predictive Data Mining ........................................................................................................ 52
3.2.12 Text Mining .......................................................................................................................... 52
3.3 Cross-Industry Standard Process: Crisp–Dm ......................................................................................... 52
3.3.1 CRISP-DM: The Six Phases .................................................................................................. 53
3.4 Data Mining Techniques ........................................................................................................................ 55
3.5 Graph Mining ......................................................................................................................................... 55
3.6 Social Network Analysis ........................................................................................................................ 56
3.6.1 Characteristics of Social Networks ........................................................................................ 56
3.6.2 Mining on Social Networks ................................................................................................... 57
3.7 Multirelational Data Mining .................................................................................................................. 59
3.8 Data Mining Algorithms and their Types ............................................................................................... 60
3.8.1 Classification ......................................................................................................................... 61
3.8.2 Clustering ............................................................................................................................... 69
3.8.3 Association Rules ................................................................................................................... 77
Summary ..................................................................................................................................................... 80
References ................................................................................................................................................... 80
Recommended Reading ............................................................................................................................. 81
Self Assessment ........................................................................................................................................... 82
Chapter IV .................................................................................................................................................. 84
Web Application of Data Mining .............................................................................................................. 84
Aim .............................................................................................................................................................. 84
Objectives .................................................................................................................................................... 84
Learning outcome ........................................................................................................................................ 84
4.1 Introduction ............................................................................................................................................ 85
4.2 Goals of Data Mining and Knowledge Discovery ................................................................................. 86
4.3 Types of Knowledge Discovered during Data Mining .......................................................................... 86
4.4 Knowledge Discovery Process .............................................................................................................. 87
4.4.1 Overview of Knowledge Discovery Process ......................................................................... 88
4.5 Web Mining ............................................................................................................................................ 90
4.5.1 Web Analysis.......................................................................................................................... 90
4.5.2 Benefits of Web mining ......................................................................................................... 91
4.6 Web Content Mining .............................................................................................................................. 91
4.7 Web StructureMining ............................................................................................................................. 92
4.8 Web Usage Mining ................................................................................................................................. 93
IV/JNU OLE
Summary ..................................................................................................................................................... 95
References ................................................................................................................................................... 95
Recommended Reading ............................................................................................................................. 95
Self Assessment ........................................................................................................................................... 96
Chapter V .................................................................................................................................................... 98
Advance topics of Data Mining ................................................................................................................. 98
Aim .............................................................................................................................................................. 98
Objectives .................................................................................................................................................... 98
Learning outcome ........................................................................................................................................ 98
5.1 Introduction ............................................................................................................................................ 99
5.2 Concepts ................................................................................................................................................. 99
5.2.1 Mechanism ........................................................................................................................... 100
5.2.2 Knowledge to be Discovered ............................................................................................... 100
5.3 Techniques of SDMKD ........................................................................................................................ 101
5.3.1 SDMKD-based Image Classification ................................................................................... 103
5.3.2 Cloud Model ........................................................................................................................ 104
5.3.3 Data Fields ........................................................................................................................... 105
5.4 Design- and Model-based Approaches to Spatial Sampling ................................................................ 106
5.4.1 Design-based Approach to Sampling ................................................................................... 106
5.4.2 Model-based Approach to Sampling .................................................................................... 107
5.5 Temporal Mining .................................................................................................................................. 107
5.5.1 Time in Data Warehouses .................................................................................................... 108
5.5.2 Temporal Constraints and Temporal Relations .................................................................... 108
5.5.3 Requirements for a Temporal Knowledge-Based Management System ............................. 108
5.6 Database Mediators .............................................................................................................................. 108
5.6.1 Temporal Relation Discovery .............................................................................................. 109
5.6.2 Semantic Queries on Temporal Data ................................................................................... 109
5.7 Temporal Data Types ............................................................................................................................110
5.8 Temporal Data Processing ....................................................................................................................110
5.8.1 Data Normalisation ...............................................................................................................111
5.9 Temporal Event Representation ............................................................................................................111
5.9.1 Event Representation Using Markov Models .......................................................................111
5.9.2 A Formalism for Temporal Objects and Repetitions .............................................................112
5.10 Classification Techniques ...................................................................................................................112
5.10.1 Distance-Based Classifier ...................................................................................................112
5.10.2 Bayes Classifier ..................................................................................................................112
5.10.3 Decision Tree ......................................................................................................................112
5.10.4 Neural Networks in Classification ......................................................................................113
5.11 Sequence Mining .................................................................................................................................113
5.11.1 Apriori Algorithm and Its Extension to Sequence Mining ..................................................113
5.11.2 The GSP Algorithm .............................................................................................................114
Summary ....................................................................................................................................................115
References ..................................................................................................................................................115
Recommended Reading ............................................................................................................................115
Self Assessment ..........................................................................................................................................116
Chapter VI .................................................................................................................................................118
Application and Trends of Data Mining .................................................................................................118
Aim .............................................................................................................................................................118
Objectives ...................................................................................................................................................118
Learning outcome .......................................................................................................................................118
6.1 Introduction ...........................................................................................................................................119
6.2 Applications of Data Mining .................................................................................................................119
6.2.1 Aggregation and Approximation in Spatial and Multimedia Data Generalisation ...............119
V/JNU OLE
6.2.2 Generalisation of Object Identifiers and Class/Subclass Hierarchies ...................................119
6.2.3 Generalisation of Class Composition Hierarchies ............................................................... 120
6.2.4 Construction and Mining of Object Cubes .......................................................................... 120
6.2.5 Generalisation-Based Mining of Plan Databases by Divide-and-Conquer ......................... 120
6.3 Spatial Data Mining ............................................................................................................................. 120
6.3.1 Spatial Data Cube Construction and Spatial OLAP ............................................................ 121
6.3.2 Mining Spatial Association and Co-location Patterns ......................................................... 121
6.3.3 Mining Raster Databases ..................................................................................................... 121
6.4 Multimedia Data Mining ...................................................................................................................... 121
6.4.1 Multidimensional Analysis of Multimedia Data .................................................................. 122
6.4.2 Classification and Prediction Analysis of Multimedia Data ................................................ 122
6.4.3 Mining Associations in Multimedia Data ............................................................................ 122
6.4.4 Audio and Video Data Mining ............................................................................................. 122
6.5 Text Mining .......................................................................................................................................... 122
6.6 Query Processing Techniques .............................................................................................................. 123
6.6.1 Ways of dimensionality Reduction for Text ......................................................................... 123
6.6.2 Probabilistic Latent Semantic Indexing schemas ............................................................... 123
6.6.3 Mining the World Wide Web ............................................................................................... 124
6.6.4 Challenges ............................................................................................................................ 124
6.7 Data Mining for Healthcare Industry ................................................................................................... 124
6.8 Data Mining for Finance ...................................................................................................................... 124
6.9 Data Mining for Retail Industry ........................................................................................................... 124
6.10 Data Mining for Telecommunication ................................................................................................. 124
6.11 Data Mining for Higher Education .................................................................................................... 125
6.12 Trends in Data Mining ....................................................................................................................... 125
6.12.1 Application Exploration .................................................................................................... 125
6.12.2 Scalable Data Mining Methods ......................................................................................... 125
6.12.3 Combination of Data Mining with Database Systems, Data Warehouse
Systems, and Web Database Systems ................................................................................ 125
6.12.4 Standardisation of Data Mining Language ........................................................................ 125
6.12.5 Visual Data Mining ........................................................................................................... 126
6.12.6 New Methods for Mining Complex Types of Data .......................................................... 126
6.12.7 Web Mining ....................................................................................................................... 126
6.13 System Products and Research Prototypes ........................................................................................ 126
6.13.1 Choosing a Data Mining System ....................................................................................... 126
6.14 Additional Themes on Data Mining ................................................................................................... 128
6.14.1 Theoretical Foundations of Data Mining ........................................................................... 128
6.14.2 Statistical Data mining ....................................................................................................... 129
6.14.3 Visual and Audio Data Mining .......................................................................................... 130
6.14.4 Data Mining and Collaborative Filtering ........................................................................... 130
Summary ................................................................................................................................................... 132
References ................................................................................................................................................. 132
Recommended Reading ........................................................................................................................... 132
Self Assessment ......................................................................................................................................... 133
Chapter VII .............................................................................................................................................. 135
Implementation and Maintenance.......................................................................................................... 135
Aim ............................................................................................................................................................ 135
Objectives .................................................................................................................................................. 135
Learning outcome ...................................................................................................................................... 135
7.1 Introduction .......................................................................................................................................... 136
7.2 Physical Design Steps .......................................................................................................................... 136
7.2.1 Develop Standards ............................................................................................................... 136
7.2.2 Create Aggregates Plan ........................................................................................................ 137
7.2.3 Determine the Data Partitioning Scheme ............................................................................. 137
VI/JNU OLE
7.2.4 Establish Clustering Options ............................................................................................... 137
7.2.5 Prepare an Indexing Strategy ............................................................................................... 138
7.2.6 Assign Storage Structures .................................................................................................... 138
7.2.7 Complete Physical Model .................................................................................................... 138
7.3 Physical Storage ................................................................................................................................... 138
7.3.1 Storage Area Data Structures ............................................................................................... 138
7.3.2 Optimising Storage .............................................................................................................. 139
7.3.3 Using RAID Technology ..................................................................................................... 140
7.4 Indexing the Data Warehouse .............................................................................................................. 142
7.4.1 B-Tree Index ........................................................................................................................ 142
7.4.2 Bitmapped Index .................................................................................................................. 143
7.4.3 Clustered Indexes ................................................................................................................. 143
7.4.4 Indexing the Fact Table ........................................................................................................ 144
7.4.5 Indexing the Dimension Tables ........................................................................................... 144
7.5 Performance Enhancement Techniques ............................................................................................... 144
7.5.1 Data Partitioning .................................................................................................................. 144
7.5.2 Data Clustering .................................................................................................................... 145
7.5.3 Parallel Processing ............................................................................................................... 145
7.5.4 Summary Levels .................................................................................................................. 145
7.5.5 Referential Integrity Checks ................................................................................................ 146
7.5.6 Initialisation Parameters ...................................................................................................... 146
7.5.7 Data Arrays .......................................................................................................................... 146
7.6 Data Warehouse Deployment ............................................................................................................... 146
7.6.1 Data warehouse Deployment Lifecycle ............................................................................... 147
7.7 Growth and Maintenance ..................................................................................................................... 148
7.7.1 Monitoring the Data Warehouse .......................................................................................... 148
7.7.2 Collection of Statistics ......................................................................................................... 149
7.7.3 Using Statistics for Growth Planning .................................................................................. 150
7.7.4 Using Statistics for Fine-Tuning .......................................................................................... 150
7.7.5 Publishing Trends for Users ................................................................................................. 151
7.8 Managing the Data Warehouse ............................................................................................................ 151
7.8.1 Platform Upgrades ............................................................................................................... 152
7.8.2 Managing Data Growth ....................................................................................................... 152
7.8.3 Storage Management ........................................................................................................... 152
7.8.4 ETL Management ................................................................................................................ 153
7.8.5 Information Delivery Enhancements ................................................................................... 153
7.8.6 Ongoing Fine-Tuning ........................................................................................................... 153
7.9 Models of Data Mining ........................................................................................................................ 154
Summary ................................................................................................................................................... 156
References ................................................................................................................................................. 156
Recommended Reading ........................................................................................................................... 156
Self Assessment ......................................................................................................................................... 157
VII/JNU OLE
List of Figures
Fig. 1.1 Data warehousing ............................................................................................................................. 3
Fig. 1.2 Steps in data warehouse iteration project planning stage ................................................................. 7
Fig. 1.3 Means of identifying required information ...................................................................................... 8
Fig. 1.4 Typical data warehousing environment .......................................................................................... 10
Fig. 1.5 Overview of data warehouse infrastructure .................................................................................... 12
Fig. 1.6 Data warehouse metadata ............................................................................................................... 13
Fig. 1.7 Importance of mapping between two environments ....................................................................... 14
Fig. 1.8 Simplest component of metadata .................................................................................................... 14
Fig. 1.9 Storing mapping information in the data warehouse ...................................................................... 15
Fig. 1.10 Keeping track of when extracts have been run ............................................................................. 15
Fig. 1.11 Other useful metadata ................................................................................................................... 16
Fig. 2.1 Data design ..................................................................................................................................... 21
Fig. 2.2 E-R modelling for OLTP systems ................................................................................................... 22
Fig. 2.3 Dimensional modelling for data warehousing ................................................................................ 22
Fig. 2.4 Simple STAR schema for orders analysis ...................................................................................... 23
Fig. 2.5 Understanding a query from the STAR schema ............................................................................. 24
Fig. 2.6 STAR schema keys ......................................................................................................................... 25
Fig. 2.7 Source identification process .......................................................................................................... 27
Fig. 2.8 Data in operational systems ............................................................................................................ 28
Fig. 2.9 Immediate data extraction options .................................................................................................. 30
Fig. 2.10 Data extraction using replication technology ............................................................................... 31
Fig. 2.11 Deferred data extraction ............................................................................................................... 32
Fig. 2.12 Typical data source environment .................................................................................................. 36
Fig. 2.13 Enterprise plan-execute-assess closed loop .................................................................................. 41
Fig. 3.1 Data mining is the core of knowledge discovery process .............................................................. 47
Fig. 3.2 Steps for data mining projects ........................................................................................................ 51
Fig. 3.3 Six-sigma methodology .................................................................................................................. 51
Fig. 3.4 SEMMA .......................................................................................................................................... 51
Fig. 3.5 CRISP–DM is an iterative, adaptive process .................................................................................. 53
Fig. 3.6 Data mining techniques .................................................................................................................. 55
Fig. 3.7 Methods of mining frequent subgraphs .......................................................................................... 56
Fig. 3.8 Heavy-tailed out-degree and in-degree distributions ...................................................................... 57
Fig. 3.9 A financial multirelational schema ................................................................................................. 60
Fig. 3.10 Basic sequential covering algorithm ............................................................................................. 65
Fig. 3.11 A general-to-specific search through rule space ........................................................................... 66
Fig. 3.12 A multilayer feed-forward neural network ................................................................................... 67
Fig. 3.13 A hierarchical structure for STING clustering .............................................................................. 75
Fig. 3.14 EM algorithm ................................................................................................................................ 76
Fig. 3.15 Tabular representation of association ........................................................................................... 78
Fig. 3.16 Association Rules Networks, 3D .................................................................................................. 79
Fig. 4.1 Knowledge base .............................................................................................................................. 85
Fig. 4.2 Sequential structure of KDP model ................................................................................................ 88
Fig. 4.3 Relative effort spent on specific steps of the KD process .............................................................. 89
Fig. 4.4 Web mining architecture ................................................................................................................. 90
Fig. 5.1 Flow diagram of remote sensing image classification with inductive learning ........................... 103
Fig. 5.2 Three numerical characteristics .................................................................................................... 104
Fig. 5.3 Using spatial information for estimation from a sample .............................................................. 106
Fig. 5.4 Different layers of user query processing ..................................................................................... 109
Fig. 5.5 A Markov diagram that describes the probability of program enrolment changes ........................111
Fig. 6.1 Spatial mining ............................................................................................................................... 120
Fig. 6.2 Text mining ................................................................................................................................... 123
Fig. 7.1 Physical design process ................................................................................................................ 136
Fig. 7.2 Data structures in the warehouse .................................................................................................. 139
Data Mining
VIII/JNU OLE
Fig. 7.3 RAID technology .......................................................................................................................... 141
Fig. 7.4 B-Tree index example ................................................................................................................... 143
Fig. 7.5 Data warehousing monitoring ...................................................................................................... 149
Fig. 7.6 Statistics for the users ................................................................................................................... 151
IX/JNU OLE
List of Tables
Table 1.1 Example of source data .................................................................................................................. 4
Table 1.2 Example of target data (Data Warehouse) ...................................................................................... 4
Table 1.3 Data warehousing elements............................................................................................................ 6
Table 2.1 Basic tasks in data transformation................................................................................................ 34
Table 2.2 Data transformation types ............................................................................................................ 36
Table 2.3 Characteristics or indicators of high-quality data ........................................................................ 40
Table 2.4 General areas where data warehouse can assist in the planning and assessment phases ............. 42
Table 3.1 The six phases of CRISP-DM ...................................................................................................... 55
Table 3.2 Requirements of clustering in data mining .................................................................................. 71
Table 5.1 Spatial data mining and knowledge discovery in various viewpoints ....................................... 100
Table 5.2 Main spatial knowledge to be discovered .................................................................................. 101
Table 5.3 Techniques to be used in SDMKD ............................................................................................. 102
Table 5.4 Terms related to temporal data ................................................................................................... 108
Data Mining
X/JNU OLE
Abbreviations
ANN - Artificial Neural Network
ANOVA - Analysis Of Variance
ARIMA - autoregressive integrated moving average
ASCII - American Standard Code for Information Interchange
ATM - Automatic Banking Machines
BQA - Business Question Assessment
C&RT - Cart Modelling
CA - California
CGI - Computer Generated Imagery
CHAID - CHi-squared Automatic Interaction Detector
CPU - Central Processing Unit
CRISP-DM - Cross Industry Standard Process for Data Mining
DB - Database
DBMS - Database Management System
DBSCAN - Density-Based Spatial Clustering of Applications with Noise
DDL - definition language statements
DM - Data Mining
DMKD - Data Mining and Knowledge Discovery
DNA - DeoxyriboNucleic Acid
DSS - Decision Support System
DW - Data Warehouse
EBCDIC - Extended Binary Coded Decimal Interchange Code
EIS - Executive Information System
EM - Expectation-Maximisation
En - Entropy
ETL - Extraction, Transformation and Loading
Ex - Expected value
GPS - Global Positioning System
GUI - Graphical User Interface
HIV - Human Immunodeficiency Virus
HTML - Hypertext Markup Language
IR - Information retrieval
IRC - Instant Relay Chat
JAD - JAva Decompiler
KDD - Knowledge Discovery and Data Mining
KDP - Knowledge Discovery Processes
KPI - Key Performance Indication
MB - Megabytes
MBR - Master Boot Record
MRDM - Multirelational data mining
NY - New York
ODBC - Open Database Connectivity
OLAP - Online Analytical Processing
OLTP - Online Transaction Processing
OPTICS - Ordering Points to Identify the Clustering Structure
PC - Personal Computer
RAID - Redundant array of inexpensive disks
RBF - Radial-Basis Function
RDBMS - Relational Database Management System
SAS - Statisctial Analysis Software
SDMKD - Spatial Data Mining and Knowledge Discovery
SEMMA - Sample, Explore, Modify, Model, Assess
SOLAM - Spatial Online Analytical Mining
XI/JNU OLE
STING - Statistical Information Grid
TM - Temporal Mediator
UAT - User Acceptance Testing
VLSI - Very Large Scale Integration
WWW - World Wide Web
XML - Extensible Markup Language
XII/JNU OLE
1/JNU OLE
Chapter I
Data Warehouse – Need, Planning and Architecture
Aim
The aim of this chapter is to:
introduce the concept of data warehousing •
analyse the need for data warehousing •
explore the basic elements of data warehousing •
Objectives
The objectives of this chapter are to:
discuss planning and requirements for successful data warehousing •
highlight the architecture and infrastructure of data warehousing •
describe the operational applications in data warehousing •
Learning outcome
At the end of this chapter, you will be able to:
enlist the components of metadata •
comprehend the concept of metadata •
understa • nd what is data mart
Data Mining
2/JNU OLE
1.1 Introduction
Data warehousing is combining data from various and usually diverse sources into one comprehensive and easily
operated database. Common accessing systems of data warehousing include queries, analysis and reporting. As data
warehousing creates one database at the end, the number of sources can be anything, provided that the system can
handle the volume. The fnal result, however, is uniform data, which can be more easily manipulated.
Defnition:
Although there are several defnitions of data warehouse, a widely accepted defnition by Inmon (1992) is an
integrated subject-oriented and time-variant repository of information in support of management’s decision making
process. According to Kimball, a data warehouse is “a copy of transaction data specifcally structured for query and
analysis”. It is a copy of sets of transactional data, which can come from a range of transactional systems.
Data warehousing is commonly used by companies to study trends over time. However, its primary function is •
facilitating strategic planning resulting from long-term data overviews. From such overviews, forecasts, business
models and similar analytical tools, reports and projections can be made.
Normally, as the data stored in data warehouses is intended to provide more overview-like reporting, the data •
is read-only. After building a new query at the end, the data stored via data warehousing can be updated.
Besides being a storehouse for large amount of data, data warehouse must possess systems in place that make •
it easy to access the data and use it for day to day operations.
A data warehouse is sometimes said to be a major role player in a decision support system (DSS). DSS is a •
technique used by organisations to come up with facts, trends or relationships that can assist them to make
effective decisions or create effective strategies to accomplish their organisational goals.
Data warehouses involve a long-term effort and are usually built in an incremental fashion. In addition to adding •
new subject areas, at each cycle, the breadth of data content of existing subject areas is usually increased as
users expand their analysis and their underlying data requirements.
Users and applications can directly use the data warehouse to perform their analysis. Alternately, a subset of •
the data warehouse data, often relating to a specifc line-of business and/or a specifc functional area, can be
exported to another, smaller data warehouse, commonly referred to as a data mart.
Besides integrating and cleansing an organisation’s data for better analysis, one of the benefts of building a •
data warehouse is that the effort initially spent to populate it with complete and accurate data content further
benefts any data marts that are sourced from the data warehouse.
3/JNU OLE
Operational
Metadata
User
Use!
Metadata and
information
objects
Data
warehouse
Find
and
understand
Transform
and
Distribute
VAX RMS
VAX Database
PC Data
Hardcopy
Access to
operational
and external
data
Data and process fow
management and automation
Fig. 1.1 Data warehousing
(Source: http://dheise.andrews.edu/dw/Avondale/images/fg06.gif)
The benefts of implementing a data warehouse are as follows:
This may appear rather obvious but it is not uncommon in an enterprise for two database systems to have two •
different versions of the truth. It is very rarely found in a university in which everyone agrees with fnancial
fgures of income and expenditure at each reporting time during the year.
To speed up ad hoc reports and queries involving aggregations across many attributes, which are resource •
intensive. The managers require trends, sums and aggregations that allow, for example, comparing this year’s
performance to last year’s or preparation of forecasts for next year.
To provide a system in which managers, who do not have a strong technical background are able to run complex •
queries. If the managers are able to access the information they require, it is likely to reduce the bureaucracy
around the managers.
To provide a database that stores relatively clean data. By using a good ETL process, the data warehouse should •
have data of high quality. When errors are discovered it may be desirable to correct them directly in the data
warehouse and then propagate the corrections to the OLTP systems.
To provide a database that stores historical data that may have been deleted from the OLTP systems. To improve •
response time, historical data is usually not retained in OLTP systems other than that which is required to respond
to customer queries. The data warehouse can then store the data that is purged from the OLTP systems.
Example:
In order to store data, many application designers in every branch have made their individual decisions as to how
an application and database should be built. Thus, source systems will be different in naming conventions, variable
measurements, encoding structures and physical attributes of data.
Data Mining
4/JNU OLE
Consider an institution that has got several branches in various countries having hundreds of students. The following
example explains how the data is integrated from source systems to target systems.
System
Name
Attribute
Name
Column Name Datatype Values
Source
system 1
Student
Application
Date
STUDENT_APPLICATION_DATE NUMERIC(8,0) 11012005
Source
system 2
Student
Application
Date
STUDN_APPLICATION_DATE DATE 11012005
Source
system 3
Application
Date
APPLICATION_DATE DATE 01NOV2005

Table 1.1 Example of source data
In the above example, the attribute name, column name, datatype and values are totally different from one source
system to another; this inconsistency in data can be avoided by integrating the data into a data warehouse with good
standards.
System
Name
Attribute Name Column Name
Data
type
Values
Record # 1 Student Application Date STUDENT_APPLICATION_DATE DATE 01112005
Record # 2 Student Application Date STUDENT_APPLICATION_DATE DATE 01112005
Record # 3 Student Application Date STUDENT_APPLICATION_DATE DATE 01112005

Table 1.2 Example of target data (Data Warehouse)
In the above example of target data, attribute names, column names and datatypes are consistent throughout the target
system. This is how data from various source systems is integrated and accurately stored into the data warehouse.
1.2 Need for Data Warehousing
Data Warehousing is an important part and in most cases the foundation of business intelligence architecture.
However, it is necessary to know the need for data warehouse. Following points will help in understanding the need
for data warehousing.
Data Integration
Data warehouse helps in combining scattered and unmanageable data into a particular format, which can be easily
accessible. If the required is complicated, it may lead to inaccuracy in business. By arranging the data properly in
a particular format, it is easy to analyse it across all products by location, time and channel.
The IT staff provides the reports required from time to time through a series of manual and automated steps of
stripping or extracting the data from one source, sorting / merging with data from other sources, manually scrubbing
and enriching the data and then running reports against it.
5/JNU OLE
Data warehouse serves not only as a repository for historical data but also as an excellent data integration platform.
The data in the data warehouse is integrated, subject oriented, time-variant and non-volatile to enable you to get a
360° view of your organisation.
Advanced reporting and analysis
The data warehouse is designed specifcally to support querying, reporting and analysis tasks. The data model is
fattened and structured by subject areas to make it easier for users to get even complex summarised information
with a relatively simple query and perform multi-dimensional analysis. This has two powerful benefts – multilevel
trend analysis and end-user empowerment. Multi-level trend analysis provides the ability to analyse key trends at
every level across several different dimensions, for instance, Organisation, Product, Location, Channel and Time,
and hierarchies within them. Most reporting, data analysis and visualisation tools take advantage of the underlying
data model to provide powerful capabilities such as drilldown, roll-up, drill-across and various ways of slicing and
dicing data.
The fattened data model makes it much easier for users to understand the data and write queries rather than work
with potentially several hundreds of tables and write long queries with complex table joins and clauses.
Knowledge discovery and decision support
Knowledge discovery and data mining (KDD) is an automatic extraction of non-obvious hidden knowledge from
large volumes of data. For example, Classifcation models could be used to classify members into low, medium
and high lifetime value. Instead of coming up with a one-size-fts-all product, the membership can be divided into
different clusters based on member profle using Clustering models, and products could be customised for each
cluster. Affnity groupings could be used to identify better product bundling strategies.

These KDD applications use various statistical and data mining techniques and rely on subject oriented, summarised,
cleansed and “de-noised” data, which a well designed data warehouse can readily provide. The data warehouse also
enables an Executive Information System (EIS). Executives typically could not be expected to go through different
reports trying to get a holistic picture of the organisation’s performance and make decisions. They need the KPIs
delivered to them. Some of these KPIs may require cross product or cross departmental analysis, which may be too
manually intensive, if not impossible, to perform on raw data from operational systems. This is especially relevant
to relationship marketing and proftability analysis. The data in data warehouse is already prepared and structured
to support this kind of analysis.
Performance
Finally, the performance of transactional systems and query response time make the case for a data warehouse.
The transactional systems are meant to do just that – perform transactions effciently – and hence, are designed to
optimise frequent database reads and writes. The data warehouse, on the other hand, is designed to optimise frequent
complex querying and analysis. Some of the ad-hoc queries and interactive analysis, which could be performed
in few seconds to minutes on a data warehouse could take a heavy toll on the transactional systems and literally
drag their performance down. Holding historical data in transactional systems for longer period of time could also
interfere with their performance. Hence, the historical data needs to fnd its place in the data warehouse.
1.3 Basic Elements of Data Warehousing
Basic elements of data warehousing are explained below.
Data Mining
6/JNU OLE
Source System
An operational system of record whose function is to capture the
transactions of the business. A source system is often called a “legacy
system” in the mainframe environment. The main priorities of the source
system are uptime and availability. Queries against source systems are
narrow, “account-based” queries that are part of the normal transaction
fow and severely restricted in their demands on the legacy system.
Staging Area
A storage area and set of processes that clean, transform, combine, de-
duplicate, household, archive and prepare source data for use in the
presentation server. In many cases, the primary objects in this area are
a set of fat-fle tables representing extracted (from the source systems)
data, loading and transformation routines, and a resulting set of tables
containing clean data – Dynamic Data Store. This area does not usually
provide query and presentation services.
Presentation Area
The presentation areas are the target physical machines on which the
data warehouse data is organised and stored for direct querying by end
users, report writers, and other applications. The set of presentable
data, or Analytical Data Store, normally take the form of dimensionally
modelled tables when stored in a relational database, and cube fles
when stored in an OLAP database.
End user Data Access Tools
End user data access tools are any clients of the data warehouse. An end
user access tool can be as simple as an ad hoc query tool, or can be as
complex as a sophisticated data mining or modelling application.
Metadata
All of the information in the data warehouse environment that is not
the actual data itself. This data about data is catalogued, versioned,
documented, and backed up.

Table 1.3 Data warehousing elements
Planning and requirements
For successful data warehousing, proper planning and management is necessary. For this, it is necessary to fulfl all
necessary requirements. Bad planning and improper project management practice is the main factor for failures in
data warehouse project planning. First of all, make sure that your company really needs data warehouse for their
business support. Later, prepare criteria for assessing the value expected from data warehouse. Decide the software
on this project and make sure where the data warehouse will collect its data sources. You need to make rules on
who will be using the data and who will operate the new systems.
1.4 Project Planning and Management
Data warehousing is available in all shapes and sizes, which bear a direct relationship to cost and time involved.
The approach to starting a data warehousing project will differentiate and the steps listed below are summary of
some of the points to consider.
Get professional advice
Data warehousing will save a huge bundle to get professional advice upfront. Endless meeting times can be saved
and the risk of an abandon data warehousing project can be reduced.
Plan the data
Understand what metrics you want to measure in the data warehouse and make sure that there is an appropriate data
to provide for the analysis. If you want to obtain periodic Key Performance Index (KPI) data for shipping logistics,
ensure that the appropriate data is piped into the data warehouse.
7/JNU OLE
Who will use the Data Warehouse
The power data warehouse consumers are business and fnancial managers. Data warehouses are meant to deliver clear
indications on how the business is performing. Plot out the expected users for the data warehouse in an enterprise.
See to it that they will have the appropriate reports in a format, which is quickly understandable. Ensure that planning
exercises are conducted in advance to accumulate scenarios on how the data warehouse will be used. Always keep
in mind that data has to be presented attractively in a format so as business managers will feel comfortable. Text
fles with lines of numbers will not suffce!
Integration to external applications
Most data warehousing projects sink or swim by their ability to extract data from external applications. Enterprises
have a slew of applications either developed in-house or obtain from a vendor. Conceptually, your data warehouse
will act as the heart to diverse applications running in the enterprise. All important data will fow in or out of the
data warehouse.
Technology
Data warehouse will be built from one of the major relational Database Management System (DBMS) vendors like
Oracle, IBM, Microsoft, and many more. Open source databases, like mySQL, can also support Data Warehousing
with the right support in place.
The data warehouse is implemented (populated) one subject area at a time, driven by specifc business questions to
be answered by each implementation cycle. The frst and subsequent implementation cycles of the data warehouse
are determined during the Business Question Assessment (BQA) stage, which may have been conducted as a separate
project. At this stage in the data warehouse process or at the start of this development/implementation project, the
frst (or next if not frst) subject area implementation project is planned.
The business requirements discovered in BQA or an equivalent requirements gathering project and, to a lesser extent,
the technical requirements of the Architecture Review and Design stage (or project) are now refned through user
interviews and focus sessions. The requirements should be refned to the subject area level and further analysed to
yield the detail needed to design and implement a single population project, whether initial or follow-on. The data
warehouse project team is expanded to include the members needed to construct and deploy the Warehouse, and
a detailed work plan for the design and implementation of the iteration project is developed and presented to the
customer organisation for approval.
The following diagram illustrates the sequence in which steps in the data warehouse iteration project planning stage
must be conducted.
Defne detailed business and
technical requirements
Refne data model and
Source data inventory
Plan Iteration
Development Project
Obtain Iteration project
approval and funding
Fig. 1.2 Steps in data warehouse iteration project planning stage
(Source: http://www.gantthead.com/content/processes/9265.cfm#Description)
Data Mining
8/JNU OLE
Collecting the requirements
The informational requirements of the organisation need to be collected by means of a time box. Following fgure
shows the typical means by which those informational requirements are identifed and collected.
Reports
Spreadsheets
Existing
analysis
Live
interviews
Fig. 1.3 Means of identifying required information
(Source: http://inmoncif.com/inmoncif-old/www/library/whiteprs/ttbuild.pdf)
Typically informational requirements are collected by looking at:

Reports
Existing reports can usually be gathered quickly and inexpensively. In most cases, the information displayed on these
reports is easily discerned. However, old reports represent yesterday’s requirements and the underlying calculation
of information may not be obvious ay all.
Spreadsheets
Spreadsheets are able to be easily gathered by asking the DSS analyst community. Like standard reports, the
information on spreadsheets is able to be discerned easily. The problem with spreadsheets:
They are very fuid, for example, important spreadsheets may have been created several months ago that are •
not available now.
They change with no documentation. •
They may not be able to be easily gathered unless the analyst creating them wants them to be gathered. •
Their structure and usage of data may be obtuse. •
Other existing analysis
Through EIS and other channels, there is usually quite a bit of other useful information analysis that has been created
by the organisation. This information is usually unstructured and very informal (although in many cases it is still
valuable information.)
Live interviews
Typically, through interviews or JAD sessions, the end user can tell about the informational needs of the organisation.
Unfortunately, JAD sessions require an enormous amount of energy to conduct and assimilate. Furthermore, the
effectiveness of JAD sessions depend in no small part on the imagination and spontaneity of the end user participating
in the session.
In any case, gathering the obvious and easily accessed informational needs of the organisation should be done and
should be factored into the data warehouse data model prior to the development of the frst iteration of the data
warehouse.
1.5 Architecture and Infrastructure
The Architecture is the logical and physical foundation on which the data warehouse will be built. The architecture
review and design stage, as the name implies, is both, a requirement analysis and a gap analysis activity. It is
important to assess what pieces of the architecture already exist in an organisation (and in what form) and to assess
what pieces are missing, which are needed to build the complete data warehouse architecture.
9/JNU OLE
During the Architecture Review and Design stage, the logical data warehouse architecture is developed. The logical
architecture is a confguration map of the necessary data stores that make up the Warehouse; it includes a central
Enterprise Data Store, an optional Operational Data Store, one or more (optional) individual business area Data
Marts, and one or more Metadata stores. In the metadata, store(s) are two different kinds of metadata that catalogue
reference information about the primary data.
Once the logical confguration is defned, the Data, Application, Technical and Support Architectures are designed to
physically implement it. Requirements of these four architectures are carefully analysed, so that the data warehouse
can be optimised to serve the users. Gap analysis is conducted to determine, which components of each architecture
already exist in the organisation and can be reused, and which components must be developed (or purchased) and
confgured for the data warehouse.
The data architecture organises the sources and stores of business information and defnes the quality and management
standards for data and metadata.
The application architecture is the software framework that guides the overall implementation of business functionality
within the Warehouse environment; it controls the movement of data from source to user, including the functions
of data extraction, data cleansing, data transformation, data loading, data refresh, and data access (reporting,
querying).
The technical architecture provides the underlying computing infrastructure that enables the data and application
architectures. It includes platform/server, network, communications and connectivity hardware/software/middleware,
DBMS, client/server 2-tier vs.3-tier approach, and end-user workstation hardware/software. Technical architecture
design must address the requirements of scalability, capacity and volume handling (including sizing and partitioning
of tables), performance, availability, stability, chargeback, and security.
The support architecture includes the software components (example, tools and structures for backup/recovery,
disaster recovery, performance monitoring, reliability/stability compliance reporting, data archiving, and version
control/confguration management) and organisational functions necessary to effectively manage the technology
investment.
Architecture review and design applies to the long-term strategy for development and refnement of the overall data
warehouse, and is not conducted merely for a single iteration. This stage (or project) develops the blueprint of an
encompassing data and technical structure, software application confguration, and organisational support structure
for the Warehouse. It forms a foundation that drives the iterative Detail Design activities. Where Detail Design tells
you what to do; Architecture Review and Design tells you what pieces you need in order to do it.
The architecture review and design stage can be conducted as a separate project that can run mostly in parallel with
the business question assessment stage. For the technical, data, application and support infrastructure that enables
and supports the storage and access of information is generally independent from the business requirements of
which data is needed to drive the Warehouse. However, the data architecture depends on receiving input from
certain BQA or alternative business requirements analysis activities (such as data source system identifcation and
data modelling), therefore the BQA stage or similar business requirements identifcation activities must conclude
before the Architecture stage or project can conclude.

The architecture will be developed based on the organisation’s long-term data warehouse strategy, so that each future
iteration of the warehouse will be provided for and will ft within the overall data warehouse architecture.
Data warehouses can be architected in many different ways, depending on the specifc needs of a business. The model
shown below is the “hub-and-spokes” Data Warehousing architecture that is popular in many organisations.
Data Mining
10/JNU OLE
In short, data is moved from databases used in operational systems into a data warehouse staging area, then into a
data warehouse and fnally into a set of conformed data marts. Data is copied from one database to another using a
technology called ETL (Extract, Transform, Load).
Operational
applications
Data marts
Customer
database
DW
staging
area
Data
warehouse
ETL
ETL
ETL
ETL
ETL
Sales
database
Products
database

Fig. 1.4 Typical data warehousing environment
(Source: http://data-warehouses.net/architecture/operational.html)
The description of the above diagram is as follows:
Operational applications
The principal reason why business needs to create data warehouses is that their corporate data assets are fragmented
across multiple, disparate applications systems, running on different technical platforms in different physical locations.
This situation does not enable good decision making.
When data redundancy exists in multiple databases, data quality often deteriorates. Poor business intelligence results
in poor strategic and tactical decision making.
Individual business units within an enterprise are designated as “owners” of operational applications and databases.
These “organisational silos” sometimes do not understand the strategic importance of having well integrated, non-
redundant corporate data. Consequently, they frequently purchase or build operational systems that do not integrate
well with existing systems in the business.
Data management issues have deteriorated in recent years as businesses deployed a parallel set of e-business and
ecommerce applications that do not integrate with existing “full service” operational applications.
Operational databases are normally “relational” - not “dimensional”. They are designed for operational, data entry
purposes and are not well suited for online queries and analytics.
Due to globalisation, mergers and outsourcing trends, the need to integrate operational data from external organisations
has arisen. The sharing of customer and sales data among business partners can, for example, increase business
intelligence for all business partners.
The challenge for data warehousing is to be able to quickly consolidate, cleanse and integrate data from multiple,
disparate databases that run on different technical platforms in different geographical locations.
11/JNU OLE
Extraction transform loading
ETL technology (shown in the fg.1.4with arrows) is an important component of the data warehousing Architecture.
It is used to copy data from operational applications to the data warehouse staging area, from the DW staging area
into the data warehouse and fnally from the data warehouse into a set of conformed data marts that are accessible
by decision makers.
The ETL software extracts data, transforms values of inconsistent data, cleanses “bad” data, flters data and loads
data into a target database. The scheduling of ETL jobs is critical. Should there be a failure in one ETL job, the
remaining ETL jobs must respond appropriately.
Data warehousing staging area
The data warehouse staging area is temporary location where data from source systems is copied. A staging area is
mainly required in a data warehousing architecture for timing reasons. In short, all required data must be available
before data can be integrated into the data warehouse.
Due to varying business cycles, data processing cycles, hardware and network resource limitations and geographical
factors, it is not feasible to extract all the data from all operational databases at exactly the same time.
For example, it might be reasonable to extract sales data on a daily basis; however, daily extracts might not be
suitable for fnancial data that requires a month-end reconciliation process. Similarly, it might be feasible to extract
“customer” data from a database in Singapore at noon eastern standard time, but this would not be feasible for
“customer” data in a Chicago database.
Data in the data warehouse can be either persistent (remains around for a long period) or transient (ionly remains
around temporarily).
Not all business requires a data warehouse staging area. For many businesses, it is feasible to use ETL to copy data
directly from operational databases into the data warehouse.
Data marts
ETL (Extract Transform Load) jobs extract data from the data warehouse and populate one or more data marts
for use by groups of decision makers in the organisations. The data marts can be dimensional (Star Schemas) or
relational, depending on how the information is to be used and what “front end” data warehousing tools will be
used to present the information.
Each data mart can contain different combinations of tables, columns and rows from the enterprise data
warehouse.
For example, a business unit or user group that does not need enough of historical data might only need transactions
from the current calendar year in the database. The personnel department might need to see all details about
employees, whereas data such as “salary” or “home address” might not be appropriate for a data mart that focuses
on Sales. Some data mart might need to be refreshed from the data warehouse daily, whereas user groups might
need refreshes only monthly.
1.5.1 Infrastructure
A data warehouse is a ‘business infrastructure’. In a practical world, it does not do anything on its own, but provides
sanitised, consistent and integrated information for host of applications and end-user tools. Therefore, the stability,
availability and response time of this platform is critical. Just like a foundation pillar, its strength is core to your
information management success.
Various factors to be considered for data warehouse infrastructure are as follows:
Data Mining
12/JNU OLE
Data Warehouse Data Size
Data warehouses grow fast in terms of size. This is not only the increment to the data as per the current design you
have. A data warehouse will have frequent additions of new dimensions, attributes and measures. With each such
addition the data could take quantum jump, as you may bring in the entire historical data related to that additional
dimensional model element. Therefore, as you estimate your data size, be on conservative side.
Data Dynamics for Data Warehouse: •
The volume and frequency of increment of data determines the processing speed and memory of the hardware
platform. The increment of data should be typically on the daily basis. However, the level of increment could
be different depending upon which data you are pulling in. Most of the times, you may pull in huge amount of
data from the source system into staging area, but load much smaller size summary data in data warehouse.

Number of users of Data Warehouse
The number of users is essentially the number of concurrent logins, which are on a data warehouse platform. Guessing
the number of users of a data warehouse has the following complication:
Sometimes the user can be an end user tool, which may result in the actual number of users. For example, an •
enterprise reporting server can access the data warehouse in form of few users to generate all the enterprise
reports Post that, the actual users are accessing the database and reports repository of the enterprise reporting
system and not that of the data warehouse. Similarly, you might be using an analytics system, which creates its
own local cube from a data warehouse. The actual users may be accessing that cube without logging into the
data warehouse. Sometimes the users could be referring to the cache of the data warehouse distributed database
and not referring to the main data warehouse.
There is no fxed formulae for calculating (and then linking) the number of users for the purpose of estimating the •
infrastructure needed. We assume that the data warehouse will be able to support large number of simultaneous
login threads.
Pre-data
warehouse
Data
Cleaning
Data
Repositories
Front-End
analytics
OLTP server
ETL
Data
warehouse
ODS
Data
Mart
Data
Mart
OLAP
Data mining
Data
visualisation
Reporting
Data Flow
Meta-Data Repository
Fig. 1.5 Overview of data warehouse infrastructure
(Source: http://www.dwreview.com/Articles/Metadata.html)
13/JNU OLE
1.5.2 Metadata
Metadata is data about data. Metadata has been around as long as there have been programs and data that the programs
operate on. Following fgure shows metadata in a simple form.
Application server tier
Infrastructure tier
Browser
Metadata
repository
Fig. 1.6 Data warehouse metadata
(Source: http://t1.gstatic.com)
While metadata is not new, the role of metadata and its importance in the face of the data warehouse certainly is
new. For years, the information technology professional has worked in the same environment as metadata, but in
many ways has paid little attention to metadata. The information professional has spent a life dedicated to process
and functional analysis, user requirements, maintenance, architectures, and the like. The role of metadata has been
passive at best in this situation.
However, metadata plays a very different role in data warehouse. Relegating metadata to a backwater, passive role
in the data warehouse environment is to defeat the purpose of data warehouse. Metadata plays a very active and
important part in the data warehouse environment. The reason why metadata plays such an important and active role
in the data warehouse environment is apparent when contrasting the operational environment to the data warehouse
environment in so far as the user community is concerned.
Mapping
A basic part of the data warehouse environment is that of mapping from the operational environment into the data
warehouse. The mapping includes a wide variety of facets, including, but not limited to:
mapping from one attribute to another •
conversions •
changes in naming conventions •
changes in physical characteristics of data •
fltering of data •
Following fgure shows the storing of the mapping in metadata for the data warehouse.
Data Mining
14/JNU OLE
Fig. 1.7 Importance of mapping between two environments
(Source: http://www.inmoncif.com/registration/whitepapers/ttmeta-1.pdf)
It may not be obvious why mapping information is so important in the data warehouse environment. Consider the
vice president of marketing who has just asked for a new report. The DSS analyst turns to the data warehouse for the
data for the report. Upon inspection, the vice president proclaims the report to be fction. The credibility of the DSS
analyst goes down until the DSS analyst can prove the data in the report to be valid. The DSS analyst frst looks to
the validity of the data in the warehouse. If the data warehouse data has not been reported properly, then the reports
are adjusted. However, if the reports have been made properly from the data warehouse, the DSS analyst is in the
position of having to go back to the operational source to salvage credibility. At this point, if the mapping data has
been carefully stored, then the DSS analyst can quickly and gracefully go to the operational source. However, if the
mapping has not been stored or has not been stored properly, then the DSS analyst has a diffcult time defending
his/her conclusions to management. The metadata store for the data warehouse is a natural place for the storing of
mapping information.
1.5.3 Metadata Components
Basic components
The basic components of the data warehouse metadata store include the tables that are contained in the warehouse,
the keys of those tables, and the attributes. Following fgure shows these components of the data warehouse.
Fig. 1.8 Simplest component of metadata
(Source: http://www.inmoncif.com/registration/whitepapers/ttmeta-1.pdf)
Mapping
The typical contents of mapping metadata that are stored in the data warehouse metadata store are:
identifcation of source feld(s) •
simple attribute to attribute mapping •
attributes conversions •
physical characteristic conversions •
Operational
environment
Mapping
Metadata
Data
warehouse
Metadata
Data
warehouse
• Tables in the warehouse
• Keys of tables
• Attributes in tables
15/JNU OLE
encoding/reference table conversions •
naming changes •
key changes •
defaults •
logic to choose from multiple sources •
algorithmic changes, and so forth •
Fig. 1.9 Storing mapping information in the data warehouse
(Source: http://www.inmoncif.com/registration/whitepapers/ttmeta-1.pdf)
Extract History
The actual history of extracts and transformations of data coming from the operational environment and heading for
the data warehouse environment is another component that belongs in the data warehouse metadata store.
Fig. 1.10 Keeping track of when extracts have been run
(Source: http://www.inmoncif.com/registration/whitepapers/ttmeta-1.pdf)
The extract history simply tells the DSS analyst when data entered the data warehouse. The DSS analyst has many
uses for this type of information. One occasion is when the DSS analyst wants to know when the last time data in
the warehouse was refreshed. Another occasion is when the DSS analyst wants to do what if processing and the
assertions of analysis have changed. The DSS analyst needs to know whether the results obtained for one analysis
are different from results obtained by an earlier analysis because of a change in the assertions or a change in the
data. There are many cases where the DSS analyst needs to use the precise history of when insertions have been
done to the data warehouse.
Operational
environment
Metadata
Mapping
Data
warehouse
appl D appl C
appl A
appl B
Operational
environment
Data
warehouse
Metadata
Extract
history
Data Mining
16/JNU OLE
Miscellaneous
Alias information is an attribute and key information that allows for alternative names. Alternative names often
make a data warehouse environment much more “user friendly”. In some cases, technicians have infuenced naming
conventions that cause data warehouse names to be incomprehensible.
Fig. 1.11 Other useful metadata
(Source: http://www.inmoncif.com/registration/whitepapers/ttmeta-1.pdf)
In other cases, one department names for data have been entered into the warehouse, and another department would
like to have their names for the data imposed. Alias’ are a good way to resolve these issues. Another useful data
warehouse metadata component is that of status. In some cases, a data warehouse table is undergoing design. In
other cases, the table is inactive or may contain misleading data. The existence of a status feld is a good way to
resolve these differences. Volumetrics are measurements about data in the warehouse. Typical volumetric information
might include:
the number of rows currently in the table •
the growth rate of the table •
the statistical profle of the table •
the usage characteristics of the table •
the indexing for the table and its structure and •
the byte specifcations for the table.. •
Volumetric information is useful for the DSS analyst planning an effcient usage of the data warehouse. It is much
more effective to consult volumetric before submitting a query that will use unknown resources than it is to simply
submit the query and hope for the best.
Aging/purge criteria is also an important component of data warehouse metadata. Looking into the metadata store
for a defnition of the life cycle of data warehouse data is much more effcient than trying to divine the life cycle
by examining the data inside the warehouse.
Metadata
Data
warehouse
• Alias information
• Status
• Volumetrics
• Aging/purge criteria
17/JNU OLE
Summary
Data warehousing is combining data from various and usually diverse sources into one comprehensive and •
easily operated database.
Common accessing systems of data warehousing include queries, analysis and reporting. •
Data warehousing is commonly used by companies to analyse trends over time. •
Data Warehousing is an important part and in most cases it is the foundation of business intelligence •
architecture.
Data warehouse helps in combining scattered and unmanageable data into a particular format, which can be •
easily accessible.
The data warehouse is designed specifcally to support querying, reporting and analysis tasks. •
Knowledge discovery and data mining (KDD) is the automatic extraction of non-obvious hidden knowledge •
from large volumes of data.
For successful data warehousing, proper planning and management is necessary. For this it is necessary to •
fulfl all necessary requirements. Bad planning and improper project management practice is the main factor
for failures in data warehouse project planning.
Data warehousing comes in all shapes and sizes, which bear a direct relationship to cost and time involved. •
The architecture is the logical and physical foundation on which the data warehouse will be built. •
The data warehouse staging area is temporary location where data from source systems is copied. •
Metadata is data about data. Metadata has been around as long as there have been programs and data that the •
programs operate on.
A basic part of the data warehouse environment is that of mapping from the operational environment into the •
data warehouse.
References
Mailvaganam, H., 2007. • Data Warehouse Project Management [Online] Available at: <http://www.dwreview.
com/Articles/Project_Management.html>. [Accessed 8 September 2011].
Hadley, L., 2002. • Developing a Data Warehouse Architecture [Online] Available at: <http://www.users.qwest.
net/~lauramh/resume/thorn.htm>. [Accessed 8 September 2011].
Humphries, M., Hawkins, M. W. And Dy, M. C., 1999. • Data warehousing: architecture and implementation,
Prentice Hall Profesional.
Ponniah, P., 2001. • DATA WAREHOUSING FUNDAMENTALS-A Comprehensive Guide for IT Professionals,
Wiley-Interscience Publication.
Kumar, A., 2008. • Data Warehouse Layered Architecture 1 [Video Online] Available at: < http://www.youtube.
com/watch?v=epNENgd40T4>. [Accessed 11 September 2011].
Intricity101, 2011. • What is OLAP? [Video Online] Available at: <http://www.youtube.com/watch?v=2ryG3Jy
6eIY&feature=related >. [Accessed 12 September 2011].
Recommended Reading
Parida, R., 2006. • Principles & Implementation of Data Warehousing, Firewell Media.
Khan, A., 2003. • Data Warehousing 101: Concepts and Implementation, iUniverse.
Jarke, M., 2003. • Fundamentals of data warehouses, 2nd ed., Springer.
Data Mining
18/JNU OLE
Self Assessment
_____________ is combining data from diverse sources into one comprehensive and easily operated 1.
database?
Data warehousing a.
Data mining b.
Mapping c.
Metadata d.
_____________ is commonly used by companies to analyse trends over time. 2.
Data mining a.
Architecture b.
Planning c.
Data warehousing d.
________________ is a technique used by organisations to come up with facts, trends or relationships that can 3.
help them make effective decisions.
Mapping a.
Operation analysis b.
Decision support system c.
Data integration d.
Which statement is false? 4.
Knowledge discovery and data mining (KDD) is the automatic extraction of non-obvious hidden knowledge a.
from large volumes of data.
The data warehouse enables an Executive Information System (EIS). b.
The data in data warehouse is already prepared and structured to support this kind of analysis. c.
The metadata is designed specifcally to support querying, reporting and analysis tasks. d.
Bad planning and improper ___________ practice is the main factor for failures in data warehouse project 5.
planning.
project management a.
operation management b.
business management c.
marketing management d.
__________ comes in all shapes and sizes, which bears a direct relationship to cost and time involved. 6.
Metadata a.
Data mining b.
Mapping c.
Data warehousing d.
The ________ simply tells the DSS analyst when data entered the data warehouse. 7.
mapping a.
components b.
extract history c.
miscellaneous d.
19/JNU OLE
Which of the following information is useful for the DSS analyst planning an effcient usage of the data 8.
warehouse?
Matrices a.
Volumetric b.
Algebraic c.
Statistical d.
Which of the following criteria is an important component of data warehouse metadata. 9.
Aging/purge a.
Mapping b.
Data mart c.
Status d.
_________ in the data warehouse can be either persistent or transient. 10.
Metadata a.
Mapping b.
Data c.
Operational database d.
Data Mining
20/JNU OLE
Chapter II
Data Design and Data Representation

Aim
The aim of this chapter is to:
introduce the concept of spatial data mining and knowledge discovery •
analyse different techniques of spatial data mining and knowledge discovery •
explore the sequence mining •
Objectives
The objectives of this chapter are to:
explicate the cloud model •
describe the elucidate temporal mining •
elucidate database mediators •
Learning outcome
At the end of this chapter, you will be able to:
enlist types of temporal data •
comprehend temporal data processing •
understand the classifcation • techniques
21/JNU OLE
2.1 Introduction
Data design consists of putting together the data structures. A group of data elements form a data structure. Logical
data design includes determination of the various data elements that are needed and combination of the data elements
into structures of data. Logical data design also includes establishing the relationships among the data structures.
Observe in the following fgure, how the phases start with requirements gathering. The results of the requirements
gathering phase is documented in detail in the requirements defnition document. An essential component of this
document is the set of information package diagrams. Remember that these are information matrices showing the
metrics, business dimensions, and the hierarchies within individual business dimensions. The information package
diagrams form the basis for the logical data design for the data warehouse. The data design process results in a
dimensional data model.
Requirements
Defnition
Document
Requirements
gathering
Data
Design
Dimen-
sional
model
Information
packages
Fig. 2.1 Data design
2.2 Design Decision
Following are the design decisions you have to make:
Choosing the process • : Selecting the subjects from the information packages for the frst set of logical structures
to be designed.
Choosing the grain • : Determining the level of detail for the data in the data structures.
Identifying and conforming the dimensions • : Choosing the business dimensions (such as product, market,
time and so on) to be included in the frst set of structures and making sure that each particular data element in
every business dimension is conformed to one another.
Choosing the facts • : Selecting the metrics or units of measurements (such as product sale units, dollar sales,
dollar revenue and so on.) to be included in the frst set of structures.
Choosing the duration of the database • : Determining how far back in time you should go for historical data.
2.3 Use of CASE Tools
There are many case tools available for data modelling of which you can use for creating the logical schema and
the physical schema for specifc target database management systems (DBMSs).
Data Mining
22/JNU OLE
You can use a case tool to defne the tables, the attributes, and the relationships. You can assign the primary keys
and indicate the foreign keys. You can form the entity-relationship diagrams. All of this is done very easily using
graphical user interfaces and powerful drag-and-drop facilities. After creating an initial model, you may add felds,
delete felds, change feld characteristics, create new relationships, and make any number of revisions with utmost
ease.
Another very useful function found in the case tools is the ability to forward-engineer the model and generate the
schema for the target database system you need to work with. Forward-engineering is easily done with these case
tools.
OLTP systems capture details of events or transactions •
OLTP systems focus on individual events •
An OLTP system is a window into micro-level transactions •
Picture at detail level necessary to run the business •
Suitable only for questions at transaction level •
Data consistency, non-redundancy, and effcient data storage critical •

Entity-Relationship Modelling
Removes data redundancy
Ensures data consistency
Expresses microscopic
relationships
Fig. 2.2 E-R modelling for OLTP systems
DW meant to answer questions on an overall process •
DW focus is on how managers view the business •
DW reveals business trends •
Information is cantered around a business process •
Answers show how the business measures the process •
The measures to be studied in many ways along several business dimensions •

Dimensional Modelling
Captures critical measures
Views along dimensions
Intuitive to business users
Fig. 2.3 Dimensional modelling for data warehousing
For modelling the data warehouse, one needs to know the dimensional modelling technique. Most of the existing
vendors have expanded their modelling case tools to include dimensional modelling. You can create fact tables,
dimension tables, and establish the relationships between each dimension table and the fact table. The result is a
STAR schema for your model. Again, you can forward-engineer the dimensional STAR model into a relational
schema for your chosen database management system.
23/JNU OLE
2.4 Star Schema
Creating the STAR schema is the fundamental data design technique for the data warehouse. It is necessary to gain
a good grasp of this technique.
2.4.1 Review of a Simple STAR Schema
We will take a simple STAR schema designed for order analysis. Assume this to be the schema for a manufacturing
company and that the marketing department is interested in determining how they are doing with the orders received
by the company. Following fgure shows this simple STAR schema. It consists of the orders fact table shown in
the middle of schema diagram. Surrounding the fact table are the four dimension tables of customer, salesperson,
order date, and product. Let us begin to examine this STAR schema. Look at the structure from the point of view
of the marketing department.
Product
Product name
SKU
Brand
Order Measures
Order dollars
cost
Margin dollars
quantity sold
Customer
Customer name
Customer code
Billing address
Shipping address
Order Date
Date
Month
Quarter
Year
Salesperson
Salesperson name
Territory name
Region name
Fig. 2.4 Simple STAR schema for orders analysis
The users in this department will analyse the orders using dollar amounts, cost, proft margin, and sold quantity. This
information is found in the fact table of the structure. The users will analyse these measurements by breaking down
the numbers in combinations by customer, salesperson, date, and product. All these dimensions along which the
users will analyse are found in the structure. The STAR schema structure is a structure that can be easily understood
by the users and with which they can comfortably work. The structure mirrors how the users normally view their
critical measures along with their business dimensions.
When you look at the order dollars, the STAR schema structure intuitively answers the questions of what, when, by
whom, and to whom. From the STAR schema, the users can easily visualise the answers to these questions: For a
given amount of dollars, what was the product sold? Who was the customer? Which salesperson brought the order?
When was the order placed?
When a query is made against the data warehouse, the results of the query are produced by combining or joining one
of more dimension tables with the fact table. The joins are between the fact table and individual dimension tables.
The relationship of a particular row in the fact table is with the rows in each dimension table. These individual
relationships are clearly shown as the spikes of the STAR schema.
Take a simple query against the STAR schema. Let us say that the marketing department wants the quantity sold and
order dollars for product bigpart-1, relating to customers in the state of Maine, obtained by salesperson Jane Doe,
during the month of June. Following fgure shows how this query is formulated from the STAR schema. Constraints
and flters for queries are easily understood by looking at the STAR schema.
Data Mining
24/JNU OLE
Product
Product name
SKU
Brand
Order Measures
Order dollars
cost
Margin dollars
quantity sold
Customer
Customer name
Customer code
Billing address
Shipping address
Order Date
Date
Month
Quarter
Year
Salesperson
Salesperson name
Territory name
Region name
Product name
= bigpart-1
Month = June
State = Maine
Salesperson Name =
Jane Doe
Fig. 2.5 Understanding a query from the STAR schema
2.4.2 Star Schema Keys
Following fgure illustrates how the keys are formed for the dimension and fact tables.
25/JNU OLE
STORE DESC
District ID
District Desc
Region ID
Region Desc
Level
STORE KEY
PRODUCT KEY
TIME KEY
Dollars
Unit
Fact table
Store dimension
Store dimension
Time dimension
Fact table: Compound primary key, one segment for each
dimension
Dimension table: Generated primary key

Fig. 2.6 STAR schema keys

Primary Keys
Each row in a dimension table is identifed by a unique value of an attribute designated as the primary key of the
dimension. In a product dimension table, the primary key identifes each product uniquely. In the customer dimension
table, the customer number identifes each customer uniquely. Similarly, in the sales representative dimension table,
the social security number of the sales representative identifes each sales representative.
Surrogate Keys
There are two general principles to be applied when choosing primary keys for dimension tables. The frst principle
is derived from the problem caused when the product began to be stored in a different warehouse. In other words,
the product key in the operational system has built-in meanings. Some positions in the operational system product
key indicate the warehouse and some other positions in the key indicate the product category. These are built-in
meanings in the key. The frst principle to follow is: avoid built-in meanings in the primary key of the dimension
tables.
The data of the retired customer may still be used for aggregations and comparisons by city and state. Therefore,
the second principle is: do not use production system keys as primary keys for dimension tables. The surrogate keys
are simply system-generated sequence numbers. They do not have any built-in meanings. Of course, the surrogate
keys will be mapped to the production system keys. Nevertheless, they are different. The general practice is to keep
the operational system keys as additional attributes in the dimension tables. Please refer back to Figure 2.5. The
STORE KEY is the surrogate primary key for the store dimension table. The operational system primary key for
the store reference table may be kept as just another non-key attribute in the store dimension table.
Foreign Keys
Each dimension table is in a one-to-many relationship with the central fact table. So the primary key of each
dimension table must be a foreign key in the fact table. If there are four dimension tables of product, date, customer,
and sales representative, then the primary key of each of these four tables must be present in the orders fact table
as foreign keys.
Data Mining
26/JNU OLE
2.5 Dimensional Modelling
Dimensional modelling gets its name from the business dimensions we need to incorporate into the logical data
model. It is a logical design technique to structure the business dimensions and the metrics that are analysed along
these dimensions. This modelling technique is intuitive for that purpose. The model has also proved to provide high
performance for queries and analysis.
In multidimensional information package diagram we have discussed the foundation for the dimensional model.
Therefore, the dimensional model consists of the specifc data structures needed to represent the business dimensions.
These data structures also contain the metrics or facts.
Dimensional modelling is a technique for conceptualising and visualising data models as a set of measures that are
described by common aspects of the business. Dimensional modelling has two basic concepts.
Important facts of dimensional modelling are explained below:
A fact is a collection of related data items, consisting of measures. •
A fact is a focus of interest for the decision making process. •
Measures are continuously valued attributes that describe facts. •
A fact is a business measure. •
Dimension is the parameter over which analysis of facts are performed. This parameter that gives meaning to a
measure number of customers is a fact, perform analysis over time.
2.5.1 E-R Modelling versus Dimensional Modelling
We are familiar with data modelling for operational or OLTP systems. We adopt the Entity-Relationship (E-R)
modelling technique to create the data models for these systems. We have so far discussed the basics of the dimensional
model and fnd that this model is most suitable for modelling the data for the data warehouse. It is necessary to
summarise the characteristics of the data warehouse information and review how dimensional modelling is suitable
for this purpose.

2.6 Data Extraction
Two major factors differentiate the data extraction for a new operational system from the data extraction for a data
warehouse. First, for a data warehouse, it is necessary to extract data from many disparate sources. Next, for a data
warehouse, extract data on the changes for ongoing incremental loads as well as for a one-time initial full load. For
operational systems, all that is needed is one-time extractions and data conversions.
These two factors increase the complexity of data extraction for a data warehouse and, therefore, warrant the use
of third-party data extraction tools in addition to in-house programs or scripts. Third-party tools are generally more
expensive than in-house programs, but they record their own metadata. On the other hand, in-house programs
increase the cost of maintenance and are hard to maintain as source systems change.
If the company is in an industry where frequent changes to business conditions are the norm, then you may want
to minimise the use of in-house programs. Third-party tools usually provide built-in fexibility. For this, change
the input parameters for the third-part tool, which is in use. Effective data extraction is a key to the success of data
warehouse. Therefore, special attention is required to the issue and formulates a data extraction strategy for your
data warehouse. Here is a list of data extraction issues:
Source Identifcation: identify source applications and source structures. •
Method of extraction: for each data source, defne whether the extraction process is manual or tool-based. •
Extraction frequency: for each data source, establish how frequently the data extraction must by done daily, •
weekly, quarterly, and so on.
Time window: for each data source, denote the time window for the extraction process. •
27/JNU OLE
Job sequencing: determine whether the beginning of one job in an extraction job stream has to wait until the •
previous job has fnished successfully.
Exception handling—: determine how to handle input records that cannot be extracted. •
2.6.1 Source Identifcation
Source identifcation, of course, encompasses the identifcation of all the proper data sources. It does not stop with
just the identifcation of the data sources. It goes beyond that to examine and verify that the identifed sources will
provide the necessary value to the data warehouse.
Assume that a part of the database, maybe one of the data marts, is designed to provide strategic information on the
fulflment of orders. For this purpose, it is necessary to store historical information about the fulflled and pending
orders. If the orders are shipped through multiple delivery channels, one needs to capture data about these channels.
If the users are interested in analysing the orders by the status of the orders as the orders go through the fulflment
process, then one needs to extract data on the order statuses. In the fact table for order fulflment, one needs attributes
about the total order amount, discounts, commissions, expected delivery time, actual delivery time, and dates at
different stages of the process. One needs dimension tables for product, order disposition, delivery channel, and
customer. First, it is necessary to determine if one has source systems to provide you with the data needed for this
data mart. Then, from the source systems, one needs to establish the correct data source for each data element in the
data mart. Further, go through a verifcation process to ensure that the identifed sources are really the right ones.
Following fgure describes a stepwise approach to source identifcation for order fulflment. Source identifcation
is not as simple process as it may sound. It is a critical frst process in the data extraction function. You need to go
through the source identifcation process for every piece of information you have to store in the data warehouse.
• List each data item of metrics or
facts needed for analysis in fact
tables.
• List each dimension attribute
from all dimensions.
• For each target data item, fnd
the source system and source
data item
• If there are multiple sources for
one data element, choose the
preferred source.
• Identify multiple source felds
for a single target feld and form
consolidation rules.
• Identify single source feld for
multiple target feld for multiple
target felds and establish splitting
rules.
• Ascertain defaults values.
• Inspect source data for missing
values
SOURCE TARGET SOURCE IDENTIFICATION PROCESS
Order processing
Customer
Product
Delivery contracts
Shipment tracking
Inventory management
PRODUCT
DATA
CUSTOMER
DELIVERY
CHANNEL DATA
DIPOSITION
DATA
TIME
DATA
ORDER
METRICS

Fig. 2.7 Source identifcation process
Data Mining
28/JNU OLE
2.6.2 Data Extraction Techniques
Business transactions keep changing the data in the source systems. In most cases, the value of an attribute in a
source system is the value of that attribute at the current time. If you look at every data structure in the source
operational systems, the day-to-day business transactions constantly change the values of the attributes in these
structures. When a customer moves to another state, the data about that customer changes in the customer table in
the source system. When two additional package types are added to the way a product may be sold, the product
data changes in the source system. When a correction is applied to the quantity ordered, the data about that order
gets changed in the source system.
Data in the source systems are said to be time-dependent or temporal. This is because source data changes with
time. The value of a single variable varies over time.
Next, take the example of the change of address of a customer for a move from New York State to California. In
the operational system, what is important is that the current address of the customer has CA as the state code. The
actual change transaction itself, stating that the previous state code was NY and the revised state code is CA, need
not be preserved. But think about how this change affects the information in the data warehouse. If the state code is
used for analysing some measurements such as sales, the sales to the customer prior to the change must be counted
in New York State and those after the move must be counted in California. In other words, the history cannot be
ignored in the data warehouse. This arise the question: how to capture the history from the source systems? The
answer depends on how exactly data is stored in the source systems. So let us examine and understand how data is
stored in the source operational systems.
2.6.3 Data in Operational Systems
These source systems generally store data in two ways. Operational data in the source system may be thought of
as falling into two broad categories. The type of data extraction technique you have to use depends on the nature
of each of these two categories.
VALUES OF ATTRIBUTES AS STORED IN
OPERATIONAL SYSTEMS AT DIFFERENT DATES EXAMPLES OF ATTRIBUTES
Storing Current Value
Storing Periodic Status
Attribute: Customer’s State of Residence
Attribute:
Status of property consigned
to an auction house for sale.
6/1/2000
6/1/2000
OH CA NY NJ
6/1/2000 RE 6/1/2000 RE
9/15/2000 ES
6/1/2000 RE
9/15/2000 ES
1/22/2001 AS
6/1/2000 RE
9/15/2000 ES
1/22/2001 AS
3/1/2001 SL
9/15/2000
9/15/2000
1/22/2001
1/22/2001
3/1/2001
3/1/2001
6/1/2000

9/15/2000
1/22/2001

3/1/2001
6/1/2000

9/15/2000
1/22/2001

3/1/2001
Value: OH

Changed to CA
Changed to NY
Changed to NJ
Value: RE
(Property receipted)
Changed to ES
(Value estimated)
Changed to AS
(Assigned to auction)
Changed to SL
(Property sold)
Fig. 2.8 Data in operational systems
29/JNU OLE
Current value
Most of the attributes in the source systems fall into this category. Here, the stored value of an attribute represents
the value of the attribute at this moment of time. The values are transient or transitory. As business transactions
happen, the values change. There is no way to predict how long the present value will stay or when it will get
changed next. Customer name and address, bank account balances, and outstanding amounts on individual orders
are some examples of this category. What is the implication of this category for data extraction? The value of an
attribute remains constant only until a business transaction changes it. There is no telling when it will get changed.
Data extraction for preserving the history of the changes in the data warehouse gets quite involved for this category
of data.

Periodic status
This category is not as common as the previous category. In this category, the value of the attribute is preserved as
the status every time a change occurs. At each of these points in time, the status value is stored with reference to
the time when the new value became effective. This category also includes events stored with reference to the time
when each event occurred. Look at the way data about an insurance policy is usually recorded in the operational
systems of an insurance company. The operational databases store the status data of the policy at each point of
time when something in the policy changes. Similarly, for an insurance claim, each event, such as claim initiation,
verifcation, appraisal, and settlement, is recorded with reference to the points in time. For operational data in this
category, the history of the changes is preserved in the source systems themselves. Therefore, data extraction for
the purpose of keeping history in the data warehouse is relatively easier. Whether it is status data or data about an
event, the source systems contain data at each point in time when any change occurred. Pay special attention to the
examples. Having reviewed the categories indicating how data is stored in the operational systems, we are now in
a position to discuss the common techniques for data extraction. When you deploy your data warehouse, the initial
data as of a certain time must be moved to the data warehouse to get it started. This is the initial load. After the
initial load, your data warehouse must be kept updated so the history of the changes and statuses are refected in the
data warehouse. Broadly, there are two major types of data extractions from the source operational systems: “as is”
(static) data and data of revisions.
“As is” or static data is the capture of data at a given point in time. It is like taking a snapshot of the relevant ‚
source data at a certain point in time. For current or transient data, this capture would include all transient
data identifed for extraction. In addition, for data categorised as periodic, this data capture would include
each status or event at each point in time as available in the source operational systems. Primarily, you will
use static data capture for the initial load of the data warehouse. Sometimes, you may want a full refresh of
a dimension table. For example, assume that the product master of your source application is completely
revamped. In this case, you may fnd it easier to do a full refresh of the product dimension table of the target
data warehouse. Therefore, for this purpose, you will perform a static data capture of the product data.
Data of revisions is also known as incremental data capture. Strictly, it is not incremental data but the ‚
revisions since the last time data was captured. If the source data is transient, the capture of the revisions
is not easy. For periodic status data or periodic event data, the incremental data capture includes the values
of attributes at specifc times. Extract the statuses and events that have been recorded since the last date of
extract. Incremental data capture may be immediate or deferred. Within the group of immediate data capture
there are three distinct options. Two separate options are available for deferred data capture.
Immediate Data Extraction
In this option, the data extraction is real-time. It occurs as the transactions happen at the source databases and
fles.
Data Mining
30/JNU OLE
SOURCE
OPERATIONAL
SYSTEMS
Extract
fles from
source
systems
Output fles
of trigger
programs
OptIOn 1:

Capture through
transaction logs
OptIOn 2:

Capture through
database triggers
D
A
T
A
S
T
A
G
I
N
G
A
R
E
A
OptIOn 3:

Capture in source
application
Transaction
log
fles
SOURCE DATABASES
Source
Data
Trigger
programs
DBMS
Fig. 2.9 Immediate data extraction options
Immediate Data Extraction is divided into three options:
Capture through Transaction Logs
This option uses the transaction logs of the DBMSs maintained for recovery from possible failures. As each
transaction adds, updates, or deletes a row from a database table, the DBMS immediately writes entries on the log
fle. This data extraction technique reads the transaction log and selects all the committed transactions. There is no
extra overhead in the operational systems because logging is already part of the transaction processing. You have
to ensure that all transactions are extracted before the log fle gets refreshed. As log fles on disk storage get flled
up, the contents are backed up on other media and the disk log fles are reused. Ensure that all log transactions are
extracted for data warehouse updates.
If all source systems are database applications, there is no problem with this technique. But if some of your source
system data is on indexed and other fat fles, this option will not work for these cases. There are no log fles for
these non-database applications. You will have to apply some other data extraction technique for these cases. While
we are on the topic of data capture through transaction logs, let us take a side excursion and look at the use of
replication. Data replication is simply a method for creating copies of data in a distributed environment. Following
fgure illustrates how replication technology can be used to capture changes to source data.
31/JNU OLE
SOURCE
OPERATIONAL
SYSTEMS
SOURCE DATABASES
Source
Data
Transaction
log
fles
Replicated log
Transactions stored in
Data Staging Area
DBMS
REPLICATION
SERVER
Log Transaction
Manager

Fig. 2.10 Data extraction using replication technology
The appropriate transaction logs contain all the changes to the various source database tables. Here are the broad
steps for using replication to capture changes to source data:
Identify the source system DB table ‚
Identify and defne target fles in staging area ‚
Create mapping between source table and target fles ‚
Defne the replication mode ‚
Schedule the replication process ‚
Capture the changes from the transaction logs ‚
Transfer captured data from logs to target fles ‚
Verify transfer of data changes ‚
Confrm success or failure of replication ‚
In metadata, document the outcome of replication ‚
Maintain defnitions of sources, targets, and mappings ‚
Capture through Database Triggers
This option is applicable to source systems that are database applications. Triggers are special stored procedures
(programs) that are stored on the database and fred when certain predefned events occur. You can create trigger
programs for all events for which you need data to be captured. The output of the trigger programs is written to a
separate fle that will be used to extract data for the data warehouse. For example, if you need to capture all changes
to the records in the customer table, write a trigger program to capture all updates and deletes in that table. Data
capture through database triggers occurs right at the source and is therefore quite reliable. You can capture both
before and after images. However, building and maintaining trigger programs puts an additional burden on the
development effort. Also, execution of trigger procedures during transaction processing of the source systems puts
additional overhead on the source systems. Further, this option is applicable only for source data in databases.
Data Mining
32/JNU OLE
Capture in Source Application
This technique is also referred to as application-assisted data capture. In other words, the source application is made
to assist in the data capture for the data warehouse. You have to modify the relevant application programs that write
to the source fles and databases. You revise the programs to write all adds, updates, and deletes to the source fles
and database tables. Then other extract programs can use the separate fle containing the changes to the source data.
Unlike the previous two cases, this technique may be used for all types of source data irrespective of whether it is
in databases, indexed fles, or other fat fles. But you have to revise the programs in the source operational systems
and keep them maintained. This could be a formidable task if the number of source system programs is large. Also,
this technique may degrade the performance of the source applications because of the additional processing needed
to capture the changes on separate fles.
Deferred Data Extraction
In the cases discussed above, data capture takes place while the transactions occur in the source operational systems.
The data capture is immediate or real-time. In contrast, the techniques under deferred data extraction do not capture
the changes in real time. The capture happens later. Refer to the fgure below for deferred data extraction options.
SOURCE
OPERATIONAL
SYSTEMS
EXTRACT
PROGRAMS
FILE
COMPARISON
PROGRAMS
DBMS
SOURCE DATABASES
Source
Data
Extract
fles based
on time
stamp
Extract
fles based
on fle
comparison
OptIOn 1:
capture based
on date and time
stamp
OptIOn 2:
capture by
comparing fles
D
A
T
A
S
T
A
G
I
N
G
A
R
E
A
Yesterday’s
extract
Today’s
Extract
Fig. 2.11 Deferred data extraction
Two options of deferred data extraction are detailed below.
Capture Based on Date and Time Stamp
Every time a source record is created or updated it may be marked with a stamp showing the date and time. The
time stamp provides the basis for selecting records for data extraction. Here, the data capture occurs at a later time,
not while each source record is created or updated. If you run your data extraction program at midnight every day,
each day you will extract only those with the date and time stamp later than midnight of the previous day. This
technique works well if the number of revised records is small. This technique presupposes that all the relevant
source records contain date and time stamps. Provided this is true, data capture based on date and time stamp can
work for any type of source fle. This technique captures the latest state of the source data.
33/JNU OLE
Any intermediary states between two data extraction runs are lost. Deletion of source records presents a special
problem. If a source record gets deleted in between two extract runs, the information about the delete is not detected.
You can get around this by marking the source record for delete frst, do the extraction run, and then go ahead and
physically delete the record. This means you have to add more logic to the source applications.
Capture by Comparing Files
If none of the above techniques are feasible for specifc source fles in your environment, then consider this technique
as the last resort. This technique is also called the snapshot differential technique because it compares two snapshots
of the source data. Let us see how this technique works. Suppose you want to apply this technique to capture the
changes to your product data.
While performing today’s data extractio n for changes to product data, you do a full fle comparison between today’s
copy of the product data and yesterday’s copy. You also compare the record keys to fnd the inserts and deletes.
Then you capture any changes between the two copies.
This technique necessitates the keeping of prior copies of all the relevant source data. Though simple and
straightforward, comparison of full rows in a large fle can be very ineffcient. However, this may be the only feasible
option for some legacy data sources that do not have transaction logs or time stamps on source records.
2.7 Data Transformation
Using several techniques, data extraction function is designed. Now the extracted data is raw data and it cannot
be applied to the data warehouse. First, all the extracted data must be made usable in the data warehouse. Having
information that is usable for strategic decision making is the underlying principle of the data warehouse. You know
that the data in the operational systems is not usable for this purpose. Next, because operational data is extracted
from many old legacy systems, the quality of the data in those systems is less likely to be good enough for the data
warehouse.
Before moving the extracted data from the source systems into the data warehouse, you inevitably have to perform
various kinds of data transformations. You have to transform the data according to standards because they come
from many dissimilar source systems. You have to ensure that after all the data is put together, the combined data
does not violate any business rules.
Consider the data structures and data elements that you need in your data warehouse. Now think about all the
relevant data to be extracted from the source systems. From the variety of source data formats, data values, and
the condition of the data quality, you know that you have to perform several types of transformations to make
the source data suitable for your data warehouse. Transformation of source data encompasses a wide variety of
manipulations to change all the extracted source data into usable information to be stored in the data warehouse.
Many companies underestimate the extent and complexity of the data transformation functions. They start out with
a simple departmental data mart as the pilot project. Almost all of the data for this pilot comes from a single source
application. The data transformation just entails feld conversions and some reformatting of the data structures. Do
not make the mistake of taking the data transformation functions too lightly. Be prepared to consider all the different
issues and allocate suffcient time and effort to the task of designing the transformations.
Irrespective of the variety and complexity of the source operational systems, and regardless of the extent of your
data warehouse, you will fnd that most of your data transformation functions break down into a few basic tasks.
Let us go over these basic tasks so that you can view data transformation from a fundamental perspective. Here is
the set of basic tasks:
Data Mining
34/JNU OLE
Selection
This takes place at the beginning of the whole process of data transformation. You
select either whole records or parts of several records from the source systems.
The task of selection usually forms part of the extraction function itself. However,
in some cases, the composition of the source structure may not be amenable to
selection of the necessary parts during data extraction. In these cases, it is prudent
to extract the whole record and then do the selection as part of the transformation
function.
Splitting/joining
This task includes the types of data manipulation you need to perform on the selected
parts of source records. Sometimes (uncommonly), you will be splitting the selected
parts even further during data transformation. Joining of parts selected from many
source systems is more widespread in the data warehouse environment.
Conversion
This is an all-inclusive task. It includes a large variety of rudimentary conversions
of single felds for two primary reasons—one to standardise among the data
extractions from disparate source systems, and the other to make the felds usable
and understandable to the users.
Summarisation
Sometimes you may fnd that it is not feasible to keep data at the lowest level of
detail in your data warehouse. It may be that none of your users ever need data at
the lowest granularity for analysis or querying. For example, for a grocery chain,
sales data at the lowest level of detail for every transaction at the checkout may
not be needed. Storing sales by product by store by day in the data warehouse
may be quite adequate. So, in this case, the data transformation function includes
summarisation of daily sales by product and by store.
Enrichment
This task is the rearrangement and simplifcation of individual felds to make them
more useful for the data warehouse environment. You may use one or more felds
from the same input record to create a better view of the data for the data warehouse.
This principle is extended when one or more felds originate from multiple records,
resulting in a single feld for the data warehouse.

Table 2.1 Basic tasks in data transformation
2.7.1 Major Transformation Types
When you consider a particular set of extracted data structures, you will fnd that the transformation functions you
need to perform on this set may done by doing a combination of the basic tasks discussed. Now let us consider
specifc types of transformation functions. These are the most common transformation types:
Format Revisions
These revisions include changes to the data types and lengths of individual
felds. In your source systems, product package types may be indicated by
codes and names in which the felds are numeric and text data types. Again,
the lengths of the package types may vary among the different source systems.
It is wise to standardise and change the data type to text to provide values
meaningful to the users.
Decoding of Fields
This is also a common type of data transformation. When you deal with
multiple source systems, you are bound to have the same data items described
by a plethora of feld values. The classic example is the coding for gender, with
one source system using 1 and 2 for male and female and another system using
M and F. Also, many legacy systems are notorious for using cryptic codes to
represent business values. What do the codes AC, IN, RE, and SU mean in
a customer fle? You need to decode all such cryptic codes and change these
into values that make sense to the users. Change the codes to Active, Inactive,
Regular, and Suspended.
35/JNU OLE
Calculated and
Derived Values
The extracted data from the sales system contains sales amounts, sales units,
and operating cost estimates by product. You will have to calculate the total
cost and the proft margin before data can be stored in the data warehouse.
Average daily balances and operating ratios are examples of derived felds.
Splitting of single
felds
Earlier legacy systems stored names and addresses of customers and employees
in large text felds. The frst name, middle initials, and last name were stored
as a large text in a single feld. Similarly, some earlier systems stored city,
state, and Zip Code data together in a single feld. You need to store individual
components of names and addresses in separate felds in your data warehouse
for two reasons. First, you may improve the operating performance by indexing
on individual components. Second, your users may need to perform analysis by
using individual components such as city, state, and Zip Code.
Merging of
Information
This is not quite the opposite of splitting of single felds. This type of data
transformation does not literally mean the merging of several felds to create
a single feld of data. For example, information about a product may come
from different data sources. The product code and description may come from
one data source. The relevant package types may be found in another data
source. The cost data may be from yet another source. In this case, merging of
information denotes the combination of the product code, description, package
types, and cost into a single entity.
Character set
Conversion
This type of data transformation relates to the conversion of character sets to
an agreed standard character set for textual data in the data warehouse. If you
have mainframe legacy systems as source systems, the source data from these
systems will be in EBCDIC characters. If PC-based architecture is the choice
for your data warehouse, then you must convert the mainframe EBCDIC format
to the ASCII format. When your source data is on other types of hardware and
operating systems, you are faced with similar character set conversions.
Conversion of Units
of Measurements
Many companies today have global branches. Measurements in many European
countries are in metric units. If your company has overseas operations, you
may have to convert the metrics so that the numbers may all be in one standard
unit of measurement.
Date/Time
Conversion
This type relates to representation of date and time in standard formats. For
example, the American and the British date formats may be standardised to
an international format. The date of October 11, 2000 is written as 10/11/2000
in the U.S. format and as 11/10/2000 in the British format. This date may be
standardised to be written as 11 OCT 2000.
Summarisation
This type of transformation is the creating of summaries to be loaded in the data
warehouse instead of loading the most granular level of data. For example, for
a credit card company to analyse sales patterns, it may not be necessary to store
in the data warehouse every single transaction on each credit card. Instead,
you may want to summarise the daily transactions for each credit card and
store the summary data instead of storing the most granular data by individual
transactions.
Key Reconstructing
While extracting data from your input sources, look at the primary keys of
the extracted records. You will have to come up with keys for the fact and
dimension tables based on the keys in the extracted records. When choosing
keys for your data warehouse database tables, avoid such keys with built-in
meanings. Transform such keys into generic keys generated by the system
itself. This is called key restructuring.
Data Mining
36/JNU OLE
Duplication
In many companies, the customer fles have several records for the same
customer. Mostly, the duplicates are the result of creating additional records
by mistake. In your data warehouse, you want to keep a single record for one
customer and link all the duplicates in the source systems to this single record.
This process is called deduplication of the customer fle. Employee fles and,
sometimes, product master fles have this kind of duplication problem.

Table 2.2 Data transformation types
2.7.2 Data Integration and Consolidation
The real challenge of ETL functions is the pulling together of all the source data from many disparate, dissimilar
source systems. As of today, most of the data warehouses get data extracted from a combination of legacy mainframe
systems, old minicomputer applications, and some newer client/server systems. Most of these source systems do not
conform to the same set of business rules. Very often they follow different naming conventions and varied standards
for data representation. Following fgure shows a typical data source environment. Notice the challenging issues
indicated in the fgure.
*Multiple character sets (EBCDIC/ASCII)*
*Multiple data types*
*No default values*
*Conficting business rules*
*Inconsistent values*
*Missing values*
*Multiple naming standards*
*Incompatible structures*
MAINFRAME
MINI
UNIX
Fig. 2.12 Typical data source environment
Integrating the data is the combining of all the relevant operational data into coherent data structures to be made
ready for loading into the data warehouse. You may need to consider data integration and consolidation as a type
of pre-process before other major transformation routines are applied. You have to standardise the names and data
representations and resolve discrepancies in the ways in which same data is represented in different source systems.
Although time-consuming, many of the data integration tasks can be managed. However, let us go over a couple
of more diffcult challenges.
37/JNU OLE
Entity Identifcation problem
If you have three different legacy applications developed in your organisation at different times in the past, you
are likely to have three different customer fles supporting those systems. One system may be the old order entry
system, the second the customer service support system, and the third the marketing system. Most of the customers
will be common to all three fles. The same customer on each of the fles may have a unique identifcation number.
These unique identifcation numbers for the same customer may not be the same across the three systems. This is a
problem of identifcation in which you do not know which of the customer records relate to the same customer. But
in the data warehouse you need to keep a single record for each customer. You must be able to get the activities of
the single customer from the various source systems and then match up with the single record to be loaded to the
data warehouse. This is a common but very diffcult problem in many enterprises where applications have evolved
over time from the distant past. This type of problem is prevalent where multiple sources exist for the same entities.
Vendors, suppliers, employees, and sometimes products are the kinds of entities that are prone to this type of problem.
In the above example of the three customer fles, you have to design complex algorithms to match records from all
the three fles and form groups of matching records. No matching algorithm can completely determine the groups.
If the matching criteria are too tight, then some records will escape the groups. On the other hand, if the matching
criteria are too loose, a particular group may include records of more than one customer. You need to get your users
involved in reviewing the exceptions to the automated procedures. You have to weigh the issues relating to your
source systems and decide how to handle the entity identifcation problem. Every time a data extract function is
performed for your data warehouse, which may be every day, do you pause to resolve the entity identifcation problem
before loading the data warehouse? How will this affect the availability of the data warehouse to your users? Some
companies, depending on their individual situations, take the option of solving the entity identifcation problem in
two phases. In the frst phase, all records, irrespective of whether they are duplicates or not, are assigned unique
identifers. The second phase consists of reconciling the duplicates periodically through automatic algorithms and
manual verifcation.
Multiple Source Problem
This is another kind of problem affecting data integration, although less common and complex than the entity
identifcation problem. This problem results from a single data element having more than one source. For example,
suppose unit cost of products is available from two systems. In the standard costing application, cost values are
calculated and updated at specifc intervals. Your order processing system also carries the unit costs for all products.
There could be slight variations in the cost fgures from these two systems.
You need to know from which system you should get the cost for storing in the data warehouse. For the same, a
straightforward solution is to assign a higher priority to one of the two sources and pick up the product unit cost from
that source. Sometimes, a straightforward solution such as this may not sit well with needs of the data warehouse
users. You may have to select from either of the fles based on the last update date. Or, in some other instances, your
determination of the appropriate source depends on other related felds.
2.7.3 Implementing Transformation
The complexity and the extent of data transformation strongly suggest that manual methods alone will not be enough.
You must go beyond the usual methods of writing conversion programs when you deployed operational systems.
The types of data transformation are by far more diffcult and challenging.
The methods you may want to adopt depend on some signifcant factors. If you are considering automating most of
the data transformation functions, frst consider if you have the time to select the tools, confgure and install them,
train the project team on the tools, and integrate the tools into the data warehouse environment. Data transformation
tools can be expensive. If the scope of your data warehouse is modest, then the project budget may not have room
for transformation tools.
In many cases, a suitable combination of both methods will prove to be effective. Find the proper balance based on
the available time frame and the money in the budget.
Data Mining
38/JNU OLE
Using Transformation Tools
In recent years, transformation tools have greatly increased in functionality and fexibility. Although the desired
goal for using transformation tools is to eliminate manual methods altogether, in practice this is not completely
possible. Even if you get the most sophisticated and comprehensive set of transformation tools, be prepared to use
in-house programs here and there.
Use of automated tools certainly improves effciency and accuracy. As a data transformation specialist, you just
have to specify the parameters, the data defnitions, and the rules to the transformation tool. If your input into the
tool is accurate, then the rest of the work is performed effciently by the tool. You gain a major advantage from
using a transformation tool because of the recording of metadata by the tool. When you specify the transformation
parameters and rules, these are stored as metadata by the tool. This metadata then becomes part of the overall metadata
component of the data warehouse. It may be shared by other components. When changes occur to transformation
functions because of changes in business rules or data defnitions, you just have to enter the changes into the tool.
The metadata for the transformations get automatically adjusted by the tool.
Using Manual Techniques
This was the predominant method until recently when transformation tools began to appear in the market. Manual
techniques may still be adequate for smaller data warehouses. Here, manually coded programs and scripts perform
every data transformation. Mostly, these programs are executed in the data staging area. The analysts and programmers
who already possess the knowledge and the expertise are able to produce the programs and scripts. This method
involves elaborate coding and testing. Although the initial cost may be reasonable, ongoing maintenance may escalate
the cost. Unlike automated tools, the manual method is more likely to be prone to errors. It may also turn out that
several individual programs are required in your environment.
A major disadvantage relates to metadata. Automated tools record their own metadata, but in-house programs have
to be designed differently if you need to store and use metadata. Even if the in-house programs record the data
transformation metadata initially, each time changes occur to transformation rules, which the metadata has to be
maintained. This puts an additional burden on the maintenance of the manually coded transformation programs.
2.8 Data Loading
It is generally agreed that transformation functions end as soon as load images are created. The next major set of
functions consists of the ones that take the prepared data, apply it to the data warehouse, and store it in the database
there. You create load images to correspond to the target fles to be loaded in the data warehouse database.
The whole process of moving data into the data warehouse repository is referred to in several ways. You must have
heard the phrases applying the data, loading the data, and refreshing the data. For the sake of clarity we will use
the phrases as indicated below:
Initial Load—populating all the data warehouse tables for the very frst time •
Incremental Load—applying ongoing changes as necessary in a periodic manner •
Full Refresh—completely erasing the contents of one or more tables and reloading with fresh data (initial load •
is a refresh of all the tables)
As loading the data warehouse may take an inordinate amount of time, loads are generally caused for great concern.
During the loads, the data warehouse has to be offine. You need to fnd a window of time when the loads may be
scheduled without affecting your data warehouse users. Therefore, consider dividing up the whole load process into
smaller chunks and populating a few fles at a time. This will give you two benefts such as you may be able to run
the smaller loads in parallel and you might also be able to keep some parts of the data warehouse up and running
while loading the other parts. It is hard to estimate the running times of the loads, especially the initial load or a
complete refresh. Do test loads to verify the correctness and to estimate the running times.
39/JNU OLE
2.9 Data Quality
Accuracy is associated with a data element. Consider an entity such as customer. The customer entity has attributes
such as customer name, customer address, customer state, customer lifestyle, and so on. Each occurrence of the
customer entity refers to a single customer. Data accuracy, as it relates to the attributes of the customer entity, means
that the values of the attributes of a single occurrence accurately describe the particular customer. The value of the
customer name for a single occurrence of the customer entity is actually the name of that customer. Data quality
implies data accuracy, but it is much more than that. Most cleansing operations concentrate on data accuracy only.
You need to go beyond data accuracy. If the data is ft for the purpose for which it is intended, we can then say such
data has quality. Therefore, data quality is to be related to the usage for the data item as defned by the users. Does
the data item in an entity refect exactly what the user is expecting to observe? Does the data item possess ftness of
purpose as defned by the users? If it does, the data item conforms to the standards of data quality.
If the database records conform to the feld validation edits, then we generally say that the database records are of
good data quality. But such single feld edits alone do not constitute data quality. Data quality in a data warehouse
is not just the quality of individual data items but the quality of the full, integrated system as a whole. It is more
than the data edits on individual felds. For example, while entering data about the customers in an order entry
application, you may also collect the demographics of each customer. The customer demographics are not germane
to the order entry application and, therefore, they are not given too much attention. But you run into problems when
you try to access the customer demographics in the data warehouse the customer data as an integrated whole lacks
data quality.
The following list is a survey of the characteristics or indicators of high-quality data.
Accuracy
The value stored in the system for a data element is the right value
for that occurrence of the data element. If you have a customer
name and an address stored in a record, then the address is the
correct address for the customer with that name. If you fnd the
quantity ordered as 1000 units in the record for order number
12345678, then that quantity is the accurate quantity for that
order.
Domain Integrity
The data value of an attribute falls in the range of allowable,
defned values. The common example is the allowable values
being “male” and “female” for the gender data element.
Data Type
Value for a data attribute is actually stored as the data type defned
for that attribute. When the data type of the store name feld is
defned as “text,” all instances of that feld contain the store name
shown in textual format and not in numeric codes.
Consistency
The form and content of a data feld is the same across multiple
source systems. If the product code for product ABC in one
system is 1234, then the code for this product must be 1234 in
every source system.
Redundancy
The same data must not be stored in more than one place in a
system. In case, for reasons of effciency, a data element is
intentionally stored in more than one place in a system, then the
redundancy must be clearly identifed.
Completeness There are no missing values for a given attribute in the system.
Duplication
Duplication of records in a system is completely resolved. If
the product fle is known to have duplicate records, then all the
duplicate records for each product are identifed and a cross-
reference created.
Data Mining
40/JNU OLE
Conformance to Business Rules
The values of each data item adhere to prescribed business rules.
In an auction system, the hammer or sale price cannot be less
than the reserve price. In a bank loan system, the loan balance
must always be positive or zero.
Structural Defniteness
Wherever a data item can naturally be structured into individual
components, the item must contain this well-defned structure.
Data Anomaly
A feld must be used only for the purpose for which it is defned.
If the feld Address-3 is defned for any possible third line of
address for long addresses, then this feld must be used only for
recording the third line of address. It must not be used for entering
a phone or fax number for the customer.
Clarity
A data element may possess all the other characteristics of
quality data but if the users do not understand its meaning clearly,
then the data element is of no value to the users. Proper naming
conventions help to make the data elements well understood by
the users.
Timely
The users determine the timeliness of the data. If the users expect
customer dimension data not to be older than one day, the changes
to customer data in the source systems must be applied to the data
warehouse daily.
Usefulness
Every data element in the data warehouse must satisfy some
requirements of the collection of users. A data element may be
accurate and of high quality, but if it is of no value to the users,
then it is totally unnecessary for that data element to be in the
data warehouse.
Adherence to Data Integrity Rules
The data stored in the relational databases of the source systems
must adhere to entity integrity and referential integrity rules.
Any table that permits null as the primary key does not have
entity integrity. Referential integrity forces the establishment
of the parent–child relationships correctly. In a customer-to-
order relationship, referential integrity ensures the existence of a
customer for every order in the database.

Table 2.3 Characteristics or indicators of high-quality data
2.10 Information Access and Delivery
You have extracted and transformed the source data. You have the best data design for the data warehouse
repository. You have applied the most effective data cleansing methods and got rid of most of the pollution from
the source data. Using the most optimal methods, you have loaded the transformed and cleansed data into your data
warehouse database. After performing all of these tasks most effectively, if your team has not provided the best
possible mechanism for information delivery to your users, you have really accomplished nothing from the users’
perspective. As you know, the data warehouse exists for one reason and one reason alone. It’s function is to provide
strategic information to your users. For the users, the information delivery mechanism is the data warehouse. The
user interface for information is what determines the ultimate success of your data warehouse. If the interface is
intuitive, easy to use, and enticing, the users will keep coming back to the data warehouse. If the interface is diffcult
to use, cumbersome, and convoluted, your project team may as well leave the scene.
2.11 Matching Information to Classes of Users OLAP in Data Warehouse
The users in enterprises make use of the information from the operational systems to perform their day-to-day work
and run the business. If we have been involved in information delivery from operational systems and we understand
what information delivery to the users entails, then what is the need for this special study on information delivery
from the data warehouse? Let us review how information delivery from a data warehouse differs from information
41/JNU OLE
delivery from an operational system. If the kinds of strategic information made available in a data warehouse were
readily available from the source systems, then we would not really need the warehouse. Data warehousing enables
the users to make better strategic decisions by obtaining data from the source systems and keeping it in a format
suitable for querying and analysis.
2.11.1 Information from the Data Warehouse
You must have worked on different types of operational systems that provide information to users. The users in
enterprises make use of the information from the operational systems to perform their day-to-day work and run
the business. If we have been involved in information delivery from operational systems and we understand what
information delivery to the users entails, then what is the need for this special study on information delivery from
the data warehouse?
If the kinds of strategic information made available in a data warehouse were readily available from the source
systems, then we would not really need the warehouse. Data warehousing enables the users to make better strategic
decisions by obtaining data from the source systems and keeping it in a format suitable for querying and analysis.
2.11.2 Information Potential
It is necessary to gain an appreciation of the enormous information potential of the data warehouse. Because of
this great potential, we have to pay adequate attention to information delivery from the data warehouse. We cannot
treat information delivery in a special way unless we fully realise the signifcance of how the data warehouse plays
a key role in the overall management of an enterprise.
Overall Enterprise Management
In every enterprise, three sets of processes govern the overall management. First, the enterprise is engaged in planning.
Secondly, execution of the plans takes place, followed by. assessment of the results of the execution. Following
fgure indicates these plan–executive–assess processes.
Data
warehouse
helps in
planning
Plan
marketing
campaigns
PLANNING
ASSESSMENT
EXECUTION
Enhance
campaigns
based in
results
Data
warehouse
helps
assess
results
Assess
result of
campaign
Execute
marketing
campaign
Fig. 2.13 Enterprise plan-execute-assess closed loop
Assessment of the results determines the effectiveness of the campaigns. Based on the assessment of the results,
more plans may be made to vary the composition of the campaigns or launch additional ones. The cycle of planning,
executing, and assessing continues.
Data Mining
42/JNU OLE
It is very interesting to note that the data warehouse, with its specialised information potential, fts nicely in this
plan–execute–assess loop. The data warehouse reports on the past and helps to plan the future. Initially, the data
warehouse assists in the planning. Once the plans are executed, the data warehouse is used to assess the effectiveness
of the execution.
Information potential for business areas
We considered one isolated example of how the information potential of your data warehouse can assist in the
planning for a market expansion and in the assessment of the results of the execution of marketing campaigns for
that purpose. Following are a few general areas of the enterprise where the data warehouse can assist in the planning
and assessment phases of the management loop.
Proftability Growth
To increase profts, management has to understand how
the profts are tied to product lines, markets, and services.
Management must gain insights into which product lines and
markets produce greater proftability. The information from the
data warehouse is ideally suited to plan for proftability growth
and to assess the results when the plans are executed.
Strategic Marketing
Strategic marketing drives business growth. When management
studies the opportunities for up-selling and cross-selling to
existing customers and for expanding the customer base, they
can plan for business growth. The data warehouse has great
information potential for strategic marketing.
Customer Relationship Management
A customer’s interactions with an enterprise are captured in
various operational systems. The order processing system
contains the orders placed by the customer; the product
shipment system, the shipments; the sales system, the details
of the products sold to the customer; the accounts receivable
system, the credit details and the outstanding balances. The
data warehouse has all the data about the customer extracted
from the various disparate source systems, transformed,
and integrated. Thus, your management can “know” their
customers individually from the information available in the
data warehouse. This knowledge results in better customer
relationship management.
Corporate Purchasing
The management can get the overall pictures of corporate-wide
purchasing patterns from data warehouse. This is where all data
about products and vendors are collected after integration from
the source systems. Your data warehouse empowers corporate
management to plan for streamlining purchasing processes.
Realising the Information Potential
The various operational systems collect massive quantities of
data on numerous types of business transactions. But these
operational systems are not directly helpful for planning and
assessment of results. The users need to assess the results by
viewing the data in the proper business context.
Table 2.4 General areas where data warehouse can assist in the planning and assessment phases
43/JNU OLE
Summary
Data design consists of putting together the data structures. A group of data elements form a data structure. •
Logical data design includes determination of the various data elements that are needed and combination of the •
data elements into structures of data. Logical data design also includes establishing the relationships among
the data structures.
Many case tools are available for data modelling. These tools can be used for creating the logical schema and •
the physical schema for specifc target database management systems (DBMS).
Another very useful function found in the case tools is the ability to forward-engineer the model and generate •
the schema for the target database system you need to work with.
Creating the STAR schema is the fundamental data design technique for the data warehouse. It is necessary to •
gain a good grasp of this technique.
Dimensional modelling gets its name from the business dimensions we need to incorporate into the logical data •
model.
The multidimensional information package diagram we have discussed is the foundation for the dimensional •
model.
Dimensional modelling is a technique for conceptualising and visualising data models as a set of measures that •
are described by common aspects of the business.
Source identifcation encompasses the identifcation of all the proper data sources. It does not stop with just the •
identifcation of the data sources.
Business transactions keep changing the data in the source systems. •
Operational data in the source system may be thought of as falling into two broad categories. •
Irrespective of the variety and complexity of the source operational systems, and regardless of the extent of •
your data warehouse, you will fnd that most of your data transformation functions break down into a few basic
tasks.
The whole process of moving data into the data warehouse repository is referred to in several ways. You must •
have heard the phrases applying the data, loading the data, and refreshing the data.
References
Han, J. and Kamber, M., 2006. • Data Mining: Concepts and Techniques, 2nd ed., Diane Cerra.
Kimball, R., 2006. The Data warehouse Lifecycle Toolkit, Wiley-India. •
Mento, B. and Rapple, B., 2003. • Data Mining and Warehousing [Online] Available at: <http://www.arl.org/
bm~doc/spec274webbook.pdf>. [Accessed 9 September 2011].
Orli, R and Santos, F., 1996. • Data Extraction, Transformation, and Migration Tools [Online] Available at:
<http://www.kismeta.com/extract.html>. [Accessed 9 September 2011].
Learndatavault, 2009. • Business Data Warehouse (BDW) [Video Online] Available at: <http://www.youtube.
com/watch?v=OjIqP9si1LA&feature=related>. [Accessed 12 September 2011].
SQLUSA, 2009. • SQLUSA.com Data Warehouse and OLAP [Video Online] Available at : < http://www.youtube.
com/watch?v=OJb93PTHsHo>. [Accessed 12 September 2011].
Recommended Reading
Prabhu, C. S. R., 2004. • Data warehousing: concepts, techniques, products and applications, 2nd ed., PHI
Learning Pvt. Ltd.
Ponniah, P., 2001. • Data Warehousing Fundamentals-A Comprehensive Guide for IT Professionals, Wiley-
Interscience Publication.
Ponniah, P., 2010. • Data Warehousing Fundamentals for IT Professionals, 2nd ed., John Wiley and Sons.
Data Mining
44/JNU OLE
Self Assessment
__________ consists of putting together the data structures. 1.
Data mining a.
Data design b.
Data warehousing c.
Metadata d.
Match the columns. 2.
Choosing the process 1.
Selecting the subjects from the information pack- A.
ages for the frst set of logical structures to be de-
signed.
Choosing the gain 2.
Determining the level of detail for the data in the B.
data structures.
Choosing the facts 3.
Selecting the metrics or units of measurements to C.
be included in the frst set of structures.
Choosing the durations of the 4.
database
Determining how far back in time you should go D.
for historical data.
1-A, 2-B, 3-C, 4-D a.
1-D, 2-A, 3-C, 4-B b.
1-B, 2-A, 3-C, 4-D c.
1-C, 2-B, 3-A, 4-D d.
Which of the following is used to defne the tables, the attributes and the relationships? 3.
Metadata a.
Data warehousing b.
Data design c.
Case tools d.
Creating the __________ is the fundamental data design technique for the data warehouse. 4.
STAR schema a.
Data transformation b.
Dimensional modelling c.
Data extraction d.
Each row in a dimension table is identifed by a unique value of an attribute designated as the ___________of 5.
the dimension.
ordinary key a.
primary key b.
surrogate key c.
foreign key d.
How many general principles are to be applied when choosing primary keys for dimension tables? 6.
One a.
Two b.
Three c.
Four d.
45/JNU OLE
Which of the following keys are simply system-generated sequence numbers? 7.
Ordinary key a.
Primary key b.
Surrogate key c.
Foreign key d.
Which of the following is a logical design technique to structure the business dimensions and the metrics that 8.
are analysed along these dimensions?
Mapping a.
Data extraction b.
Dimensional modelling c.
E-R modelling d.
Which technique is adopted to create the data models for these systems? 9.
E-R modelling a.
Dimensional modelling b.
Source identifcation c.
Data extraction d.
__________ in the source systems are said to be time-dependent or temporal. 10.
Data a.
Data mining b.
Data warehousing c.
Mapping d.
Data Mining
46/JNU OLE
Chapter III
Data Mining
Aim
The aim of this chapter is to:
introduce the concept of data mining •
analyse different data mining techniques •
explore the crucial concepts of data mining •
Objectives
The objectives of this chapter are to:
explicate cross-industry standard process •
highlight the dimensional modelling •
describe the process of graph mining •
elucidate social network analysis •
Learning outcome
At the end of this chapter, you will be able to:
discuss multirelational data mining •
comprehend data mining algorithms •
understand classifcation, clust • ering and association rules
47/JNU OLE
3.1 Introduction
Data mining refers to the process of fnding interesting patterns in data that are not explicitly part of the data. The
interesting patterns can be used to make predictions. The process of data mining is composed of several steps
including selecting data to analyse, preparing the data, applying the data mining algorithms, and then interpreting
and evaluating the results. Sometimes, the term, data mining, refers to the step in which the data mining algorithms
are applied. This has created a fair amount of confusion in the literature. But more often the term is used to refer
the entire process of fnding and using interesting patterns in data.
The application of data mining techniques was frst applied to databases. A better term for this process is KDD
(Knowledge Discovery in Databases). Benoît (2002) offers this defnition of KDD (which he refers to as data
mining): Data mining (DM) is a multistage process of extracting previously unanticipated knowledge from large
databases, and applying the results to decision making. Data mining tools detect patterns from the data and infer
associations and rules from them. The extracted information may then be applied to prediction or classifcation
models by identifying relations within the data records or between databases. Those patterns and rules can then
guide decision making and forecast the effects of those decisions.
Data mining techniques can be applied to a wide variety of data repositories including databases, data warehouses,
spatial data, multimedia data, Internet or Web-based data and complex objects. A more appropriate term for describing
the entire process would be knowledge discovery, but unfortunately the term data mining is what has caught on.
The following fgure shows data mining as a step in an iterative knowledge discovery process.
Database
Data
Cleaning
Data integration
Selection and
transformation
Data
warehouse
Task-relevant
Data
Pattern
Evaluation
Knowledge
D
a
t
a

M
i
n
i
n
g
Fig. 3.1 Data mining is the core of knowledge discovery process
(Source: http://www.exinfm.com/pdffles/intro_dm.pdf)
The Knowledge Discovery in Databases process comprises of a few steps leading from raw data collections to some
form of new knowledge. The iterative process consists of the following steps:
Data cleaning: This is also known as data cleansing. This is a phase in which noise data and irrelevant data are •
removed from the collection.
Data integration: At this stage, multiple data sources, often heterogeneous, may be combined in a common •
source.
Data selection: At this step, the data relevant to the analysis is decided on and retrieved from the data •
collection.
Data Mining
48/JNU OLE
Data transformation: This is also known as data consolidation, it is a phase in which the selected data is •
transformed into forms appropriate for the mining procedure.
Data mining: It is the crucial step in which clever techniques are applied to extract patterns potentially useful. •
Pattern evaluation: In this step, strictly interesting patterns representing knowledge are identifed based on •
given measures.
Knowledge representation: This is the fnal phase in which the discovered knowledge is visually represented to •
the user. This essential step uses visualisation techniques to help users understand and interpret the data mining
results.
It is common to combine some of these steps together. For instance, data cleaning and data integration can be
performed together as a pre-processing phase to generate a data warehouse. Data selection and data transformation
can also be combined where the consolidation of the data is the result of the selection, or, as for the case of data
warehouses, the selection is done on transformed data.
The KDD is an iterative process. Once the discovered knowledge is presented to the user, the evaluation measures
can be enhanced, the mining can be further refned, new data can be selected or further transformed, or new data
sources can be integrated, in order to get different, more appropriate results.
Data mining derives its name from the similarities between searching for valuable information in a large database and
mining rocks for a vein of valuable ore. Both imply either sifting through a large amount of material or ingeniously
probing the material to exactly pinpoint where the values reside. It is, however, a misnomer, since mining for gold in
rocks is usually called “gold mining” and not “rock mining”, thus, by analogy, data mining should have been called
“knowledge mining” instead. Nevertheless, data mining became the accepted customary term, and very rapidly a
trend that even overshadowed more general terms such as knowledge discovery in databases (KDD) that describe
a more complete process. Other similar terms referring to data mining are: data dredging, knowledge extraction
and pattern discovery.
The ongoing remarkable growth in the feld of data mining and knowledge discovery has been fuelled by a fortunate
confuence of a variety of factors:
The explosive growth in data collection, as exemplifed by the supermarket scanners above •
The storing of the data in data warehouses, so that the entire enterprise has access to a reliable current •
database
The availability of increased access to data from Web navigation and intranets •
The competitive pressure to increase market share in a globalised economy •
The development of off-the-shelf commercial data mining software suites •
The tremendous growth in computing power and storage capacity •
3.2 Crucial Concepts of Data Mining
Some crucial concepts of data mining are explained below.
3.2.1 Bagging (Voting, Averaging)
The concept of bagging (voting for classifcation, averaging for regression-type problems with continuous dependent
variables of interest) applies to the area of predictive data mining, to combine the predicted classifcations (prediction)
from multiple models, or from the same type of model for different learning data. It is also used to address the inherent
instability of results when applying complex models to relatively small data sets. Suppose your data mining task is
to build a model for predictive classifcation, and the dataset from which to train the model (learning data set, which
contains observed classifcations) is relatively small. You could repeatedly sub-sample (with replacement) from
the dataset, and apply, for example, a tree classifer (for example, C&RT and CHAID) to the successive samples.
In practice, very different trees will often be grown for the different samples, illustrating the instability of models
often evident with small data sets. One method of deriving a single prediction (for new observations) is to use all
49/JNU OLE
trees found in the different samples, and to apply some simple voting. The fnal classifcation is the one most often
predicted by the different trees. Note that some weighted combination of predictions (weighted vote, weighted
average) is also possible, and commonly used. A sophisticated (machine learning) algorithm for generating weights
for weighted prediction or voting is the Boosting procedure.
3.2.2 Boosting
The concept of boosting applies to the area of predictive data mining, to generate multiple models or classifers
(for prediction or classifcation), and to derive weights to combine the predictions from those models into a single
prediction or predicted classifcation.
A simple algorithm for boosting works like this: Start by applying some method (For example, a tree classifer
such as C&RT or CHAID) to the learning data, where each observation is assigned an equal weight. Compute the
predicted classifcations, and apply weights to the observations in the learning sample that are inversely proportional
to the accuracy of the classifcation. In other words, assign greater weight to those observations that were diffcult
to classify (where the misclassifcation rate was high), and lower weights to those that were easy to classify (where
the misclassifcation rate was low). In the context of C&RT for example, different misclassifcation costs (for the
different classes) can be applied, inversely proportional to the accuracy of prediction in each class. Then apply the
classifer again to the weighted data (or with different misclassifcation costs), and continue with the next iteration
(application of the analysis method for classifcation to the re-weighted data).
Boosting will generate a sequence of classifers, where each consecutive classifer in the sequence is an “expert”
in classifying observations that were not well classifed by those preceding it. During deployment (for prediction
or classifcation of new cases), the predictions from the different classifers can then be combined (example, via
voting, or some weighted voting procedure) to derive a single best prediction or classifcation.
Note that boosting can also be applied to learning methods that do not explicitly support weights or misclassifcation
costs. In that case, random sub-sampling can be applied to the learning data in the successive steps of the iterative
boosting procedure, where the probability for selection of an observation into the subsample is inversely proportional
to the accuracy of the prediction for that observation in the previous iteration (in the sequence of iterations of the
boosting procedure).
3.2.3 Data Preparation (in Data Mining)
Data preparation and cleaning is an often neglected but extremely important step in the data mining process. The
old saying “garbage-in-garbage-out” is particularly applicable to the typical data mining projects where large data
sets collected via some automatic methods (example, via the Web) serve as the input into the analyses. Often, the
method by which the data where gathered was not tightly controlled, and so the data may contain out-of-range values
(example, Income: -100), impossible data combinations (example, Gender: Male, Pregnant: Yes), and likewise.
Analysing data that has not been carefully screened for such problems can produce highly misleading results, in
particular in predictive data mining.
3.2.4 Data Reduction (for Data Mining)
The term Data Reduction in the context of data mining is usually applied to projects where the goal is to aggregate
or amalgamate the information contained in large datasets into manageable (smaller) information nuggets. Data
reduction methods can include simple tabulation, aggregation (computing descriptive statistics) or more sophisticated
techniques like clustering, principal components analysis, and so on.
3.2.5 Deployment
The concept of deployment in predictive data mining refers to the application of a model for prediction or classifcation
to new data. After a satisfactory model or set of models has been identifed (trained) for a particular application, we
usually want to deploy those models so that predictions or predicted classifcations can quickly be obtained for new
data. For example, a credit card company may want to deploy a trained model or set of models (for example, neural
networks, meta-learner) to quickly identify transactions which have a high probability of being fraudulent.
Data Mining
50/JNU OLE
3.2.6 Drill-Down Analysis
The concept of drill-down analysis applies to the area of data mining, to denote the interactive exploration of data, in
particular of large databases. The process of drill-down analyses begins by considering some simple break-downs of
the data by a few variables of interest (efor example Gender, geographic region, and so on). Various statistics, tables,
histograms, and other graphical summaries can be computed for each group. Next, we may want to “drill-down” to
expose and further analyse the data “underneath” one of the categorisations, for example, we might want to further
review the data for males from the mid-west. Again, various statistical and graphical summaries can be computed
for those cases only, which might suggest further break-downs by other variables (for example; income, age, and
so on). At the lowest (“bottom”) level are the raw data: For example, you may want to review the addresses of male
customers from one region, for a certain income group, and so on, and to offer to those customers some particular
services of particular utility to that group.
3.2.7 Feature Selection
One of the preliminary stage in predictive data mining, when the data set includes more variables than could
be included (or would be effcient to include) in the actual model building phase (or even in initial exploratory
operations), is to select predictors from a large list of candidates. For example, when data are collected via automated
(computerised) methods, it is not uncommon that measurements are recorded for thousands or hundreds of thousands
(or more) of predictors. The standard analytic methods for predictive data mining, such as neural network analyses,
classifcation and regression trees, generalised linear models, or general linear models become impractical when
the number of predictors exceed more than a few hundred variables.
Feature selection selects a subset of predictors from a large list of candidate predictors without assuming that the
relationships between the predictors and the dependent or outcome variables of interest are linear, or even monotone.
Therefore, this is used as a pre-processor for predictive data mining, to select manageable sets of predictors that are
likely related to the dependent (outcome) variables of interest, for further analyses with any of the other methods
for regression and classifcation.
3.2.8 Machine Learning
Machine learning, computational learning theory and similar terms are often used in the context of data mining,
to denote the application of generic model-ftting or classifcation algorithms for predictive data mining. Unlike
traditional statistical data analysis, which is usually concerned with the estimation of population parameters by
statistical inference, the emphasis in data mining (and machine learning) is usually on the accuracy of prediction
(predicted classifcation), regardless of whether or not the “models” or techniques that are used to generate the
prediction is interpretable or open to simple explanation. Good examples of this type of technique often applied to
predictive data mining are neural networks or meta-learning techniques such as boosting, and so on. These methods
usually involve the ftting of very complex “generic” models that are not related to any reasoning or theoretical
understanding of underlying causal processes; instead, these techniques can be shown to generate accurate predictions
or classifcation in cross-validation samples.
3.2.9 Meta-Learning
The concept of meta-learning applies to the area of predictive data mining, to combine the predictions from multiple
models. It is particularly useful when the types of models included in the project are very different. In this context,
this procedure is also referred to as Stacking (Stacked Generalisation).
3.2.10 Models for Data Mining
In the business environment, complex data mining projects may require the coordinate efforts of various experts,
stakeholders, or departments throughout an entire organisation. In the data mining literature, various “general
frameworks” have been proposed to serve as blueprints for how to organise the process of gathering data, analysing
data, disseminating results, implementing results, and monitoring improvements.
51/JNU OLE
One such model, CRISP (Cross-Industry Standard Process for data mining) was proposed in the mid-1990s by a
European consortium of companies to serve as a non-proprietary standard process model for data mining. This
general approach postulates the following (perhaps not particularly controversial) general sequence of steps for
data mining projects:
Business Understanding
Data Preparation Modeling
Evaluation
Deployment
Data Understanding
Fig. 3.2 Steps for data mining projects
Another approach - the Six Sigma methodology - is a well-structured, data-driven methodology for eliminating
defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other
business activities. This model has recently become very popular (due to its successful implementations) in various
American industries, and it appears to gain favour worldwide. It postulated a sequence of, so-called, DMAIC steps
that grew up from the manufacturing, quality improvement, and process control traditions and is particularly well
suited to production environments (including “production of services,” that is service industries).
Defne Measure Analyse Improve Control

Fig. 3.3 Six-sigma methodology
Another framework of this kind (actually somewhat similar to Six Sigma) is the approach proposed by SAS Institute
called SEMMA, which is focusing more on the technical activities typically involved in a data mining project.
Sample Explore Modify Model Assess

Fig. 3.4 SEMMA
All of these models are concerned with the process of how to integrate data mining methodology into an organisation,
how to “convert data into information,” how to involve important stake-holders, and how to disseminate the
information in a form that can easily be converted by stake-holders into resources for strategic decision making.
Some software tools for data mining are specifcally designed and documented to ft into one of these specifc
frameworks.
Data Mining
52/JNU OLE
3.2.11 Predictive Data Mining
The term Predictive Data Mining is usually applied to identify data mining projects with the goal to identify a
statistical or neural network model or set of models that can be used to predict some response of interest. For example,
a credit card company may want to engage in predictive data mining, to derive a (trained) model or set of models
(for example, neural networks, meta-learner) that can quickly identify transactions, which have a high probability
of being fraudulent. Other types of data mining projects may be more exploratory in nature (for example, to identify
cluster or segments of customers), in which case drill-down descriptive and exploratory methods would be applied.
Data reduction is another possible objective for data mining.
3.2.12 Text Mining
While Data Mining is typically concerned with the detection of patterns in numeric data, very often important (for
example, critical to business) information is stored in the form of text. Unlike numeric data, text is often amorphous,
and diffcult to deal with. Text mining generally consists of the analysis of (multiple) text documents by extracting
key phrases, concepts, and so on and the preparation of the text processed in that manner for further analyses
with numeric data mining techniques (for example, to determine co-occurrences of concepts, key phrases, names,
addresses, product names, and so on.).
3.3 Cross-Industry Standard process: Crisp–Dm
There is a temptation in some companies, due to departmental inertia and compartmentalisation, to approach data
mining haphazardly, to reinvent the wheel and duplicate effort. A cross-industry standard was clearly required that
is industry-neutral, tool-neutral, and application-neutral. The Cross-Industry Standard Process for Data Mining
(CRISP–DM) was developed in 1996 by analysts representing DaimlerChrysler, SPSS, and NCR. CRISP provides
a non-proprietary and freely available standard process for ftting data mining into the general problem-solving
strategy of a business or research unit.
According to CRISP–DM, a given data mining project has a life cycle consisting of six phases, as illustrated in
the fg. 3.5. Note that the phase sequence is adaptive. That is, the next phase in the sequence often depends on the
outcomes associated with the preceding phase. The most signifcant dependencies between phases are indicated by
the arrows. For example, suppose that we are in the modelling phase. Depending on the behaviour and characteristics
of the model, we may have to return to the data preparation phase for further refnement before moving forward to
the model evaluation phase.
The iterative nature of CRISP is symbolised by the outer circle in the fgure 3.5. Often, the solution to a particular
business or research problem leads to further questions of interest, which may then be attacked using the same
general process as before.
53/JNU OLE
Business/Research
Understanding phase
Data preparation
phase
Deployment phase
Evaluation phase
Modeling phase
Data understanding
phase
Fig. 3.5 CRISP–DM is an iterative, adaptive process
Following is an outline of each phase. Although conceivably, issues encountered during the evaluation phase can
send the analyst back to any of the previous phases for amelioration, for simplicity we show only the most common
loop, back to the modelling phase.
3.3.1 CRISp-DM: the Six phases
The six phases of CRISP-DM are explained below:
Phases Explanation
Business understanding phase
The frst phase in the CRISP–DM standard process
may also be termed the research understanding
phase.
Enunciate the project objectives and •
requirements clearly in terms of the business
or research unit as a whole.
Translate these goals and restrictions into •
the formulation of a data mining problem
defnition.
Prepare a preliminary strategy for achieving •
these objectives.
Data Mining
54/JNU OLE
Data understanding phase
Collect the data. •
Use exploratory data analysis to familiarise •
yourself with the data and discover initial
insights.
Evaluate the quality of the data. •
If desired, select interesting subsets that may •
contain actionable patterns.
Data preparation phase
Prepare from the initial raw data the fnal data •
set that is to be used for all subsequent phases.
This phase is very labour intensive.
Select the cases and variables you want to •
analyse and that are appropriate for your
analysis.
Perform transformations on certain variables, •
if needed.
Clean the raw data so that it is ready for the •
modelling tools.
Modelling phase
Select and apply appropriate modelling •
techniques.
Calibrate model settings to optimise results. •
Remember that often, several different •
techniques may be used for the same data
mining problem.
If necessary, loop back to the data preparation •
phase to bring the form of the data into line
with the specifc requirements of a particular
data mining technique.
Evaluation phase
Evaluate the one or more models delivered in the •
modelling phase for quality and effectiveness
before deploying them for use in the feld.
Determine whether the model in fact achieves •
the objectives set for it in the frst phase.
Establish whether some important facet of •
the business or research problem has not been
accounted for suffciently.
Come to a decision regarding use of the data •
mining results.
55/JNU OLE
Deployment phase
Make use of the models created: Model creation •
does not signify the completion of a project.
Example of a simple deployment: Generate a •
report.
Example of a more complex deployment: •
Implement a parallel data mining process in
another department.
For businesses, the customer often carries out •
the deployment based on your model.

Table 3.1 The six phases of CRISP-DM
3.4 Data Mining Techniques
As a general data structure, graphs have become increasingly important in modelling sophisticated structures
and their interactions, with broad applications including chemical informatics, bioinformatics, computer vision,
video indexing, text retrieval, and Web analysis. Mining frequent sub graph patterns for further characterisation,
discrimination, classifcation, and cluster analysis becomes an important task. Moreover, graphs that link many
nodes together may form different kinds of networks, such as telecommunication networks, computer networks,
biological networks, and Web and social community networks.
As such networks have been studied extensively in the context of social networks, their analysis has often been
referred to as social network analysis. Furthermore, in a relational database, objects are semantically linked across
multiple relations. Mining in a relational database often requires mining across multiple interconnected relations,
which is similar to mining in connected graphs or networks. Such kind of mining across data relations is considered
multirelational data mining. Data mining techniques can be classifed in the following diagram.
Graph Mining
Social
Network
Analysis
Data Mining
Techniques
Multirelational
Data Mining
Fig. 3.6 Data mining techniques
3.5 Graph Mining
Graphs become increasingly important in modelling complicated structures, such as circuits, images, chemical
compounds, protein structures, biological networks, social networks, the Web, workfows, and XML documents. Many
graph search algorithms have been developed in chemical informatics, computer vision, video indexing, and text
retrieval. With the increasing demand on the analysis of large amounts of structured data, graph mining has become
Data Mining
56/JNU OLE
an active and important theme in data mining. Among the various kinds of graph patterns, frequent substructures are
the very basic patterns that can be discovered in a collection of graphs. They are useful for characterising graph sets,
discriminating different groups of graphs, classifying and clustering graphs, building graph indices, and facilitating
similarity search in graph databases.
Recent studies have developed several graph mining methods and applied them to the discovery of interesting
patterns in various applications. For example, there are reports on the discovery of active chemical structures in
HIV-screening datasets by contrasting the support of frequent graphs between different classes. There have been
studies on the use of frequent structures as features to classify chemical compounds, on the frequent graph mining
technique to study protein structural families, on the detection of considerably large frequent subpathways in metabolic
networks, and on the use of frequent graph patterns for graph indexing and similarity search in graph databases.
Although graph mining may include mining frequent subgraph patterns, graph classifcation, clustering, and other
analysis tasks, in this section we focus on mining frequent subgraphs. Following fgure explains the methods for
Mining Frequent Subgraphs.
Methods for
Mining Frequent
Subgraphs
Apriori-based
Approach
Pattern-Growth
Approach
Fig. 3.7 Methods of mining frequent subgraphs
3.6 Social Network Analysis
From the point of view of data mining, a social network is a heterogeneous and multirelational data set represented
by a graph. The graph is typically very large, with nodes corresponding to objects and edges corresponding to links
representing relationships or interactions between objects. Both nodes and links have attributes. Objects may have class
labels. Links can be one-directional and are not required to be binary. Social networks need not be social in context.
There are many real-world instances of technological, business, economic, and biological social networks.
Examples include electrical power grids, telephone call graphs, the spread of computer viruses, the World Wide
Web, and co-authorship and citation networks of scientists. Customer networks and collaborative fltering problems
(where product recommendations are based on the preferences of other customers) are other examples. In biology,
examples range from epidemiological networks, cellular and metabolic networks, and food webs, to the neural
network of the nematode worm Caenorhabditis elegans (the only creature whose neural network has been completely
mapped). The exchange of e-mail messages within corporations, newsgroups, chat rooms, friendships, sex webs
(linking sexual partners), and the quintessential “old-boy” network (that is the overlapping boards of directors of
the largest companies in the United States) are examples from sociology.
3.6.1 Characteristics of Social Networks
Social networks are rarely static. Their graph representations evolve as nodes and edges are added or deleted over
time. In general, social networks tend to exhibit the following characteristics:
57/JNU OLE
Densifcation power law: Previously, it was believed that as a network evolves, the number of degrees grows •
linearly in the number of nodes. This was known as the constant average degree assumption. However, extensive
experiments have shown that, on the contrary, networks become increasingly dense over time with the average
degree increasing (and hence, the number of edges growing super linearly in the number of nodes). The
densifcation follows the densifcation power law (or growth power law), which states,

where • e(t) and n(t), respectively, represent the number of edges and nodes of the graph at time t, and the exponent
a generally lies strictly between 1 and 2. Note that if a = 1, this corresponds to constant average degree over
time, whereas a = 2 corresponds to an extremely dense graph where each node has edges to a constant fraction
of all nodes.
Shrinking diameter: It has been experimentally shown that the effective diameter tends to decrease as the •
network grows. This contradicts an earlier belief that the diameter slowly increases as a function of network
size decreases. As an intuitive example, consider a citation network, where nodes are papers and a citation from
one paper to another is indicated by a directed edge. The out-links of a node, v (representing the papers cited by
v), are “frozen” at the moment it joins the graph. The decreasing distances between pairs of nodes consequently
appears to be the result of subsequent papers acting as “bridges” by citing earlier papers from other areas.
Heavy-tailed out-degree and in-degree distributions: The number of out-degrees for a node tends to follow •
a heavy-tailed distribution by observing the power law, 1/n
a
, where n is the rank of the node in the order of
decreasing out-degrees and typically, 0 < a < 2. The smaller the value of a, the heavier the tail. This phenomena
is represented in the preferential attachment model, where each new node attaches to an existing network by
a constant number of out-links, following a “rich-get-richer” rule. The in-degrees also follow a heavy-tailed
distribution, although it tends be more skewed than the out-degrees distribution.
Node rank
N
o
d
e

o
u
t
-
d
e
g
r
e
s
s
Fig. 3.8 Heavy-tailed out-degree and in-degree distributions
The number of out-degrees (y-axis) for a node tends to follow a heavy-tailed distribution. The node rank (x-axis)
is defned as the order of deceasing out-degrees of the node.
3.6.2 Mining on Social Networks
Following are exemplar areas of mining on social networks, namely, link prediction, mining customer networks for
viral marketing, mining newsgroups using networks, and community mining from multirelational networks.
Link prediction: What edges will be added to the network?
Approaches to link prediction have been proposed based on several measures for analysing the “proximity” of nodes
in a network. Many measures originate from techniques in graph theory and social network analysis. The general
methodology is as follows: All methods assign a connection weight, score(X, Y), to pairs of nodes, X and Y, based
Data Mining
58/JNU OLE
on the given proximity measure and input graph, G. A ranked list in decreasing order of score(X, Y) is produced.
This gives the predicted new links in decreasing order of confdence. The predictions can be evaluated based on real
observations on experimental data sets. The simplest approach ranks pairs, (X, Y), by the length of their shortest path
in G. This embodies the small world notion that all individuals are linked through short chains. (Since the convention
is to rank all pairs in order of decreasing score, here, and score (X, Y) is defned as the negative of the shortest path
length.) Several measures use neighbourhood information. The simplest such measure is common neighbours—the
greater the number of neighbours that X and Y have in common, the more likely X and Y are to form a link in the
future. Intuitively, if authors X and Y have never written a paper together but have many colleagues in common, the
more likely they are to collaborate in the future. Other measures are based on the ensemble of all paths between two
nodes. The Katz measure, for example, computes a weighted sum over all paths between X and Y, where shorter
paths are assigned heavier weights. All of the measures can be used in conjunction with higher-level approaches,
such as clustering. For instance, the link prediction method can be applied to a cleaned-up version of the graph, in
which spurious edges have been removed.
Mining customer networks for viral marketing
Viral marketing is an application of social network mining that explores how individuals can infuence the buying
behaviour of others. Traditionally, companies have employed direct marketing (where the decision to market to a
particular individual is based solely on the characteristics) or mass marketing (where individuals are targeted based
on the population segment to which they belong). These approaches, however, neglect the infuence that customers
can have on the purchasing decisions of others.
For example, consider a person who decides to see a particular movie and persuades a group of friends to see the
same flm. Viral marketing aims to optimise the positive word-of-mouth effect among customers. It can choose to
spend more money marketing to an individual, if that person has many social connections. Thus, by considering
the interactions between customers, viral marketing may obtain higher profts than traditional marketing, which
ignores such interactions.
The growth of the Internet over the past two decades has led to the availability of many social networks that can be
mined for the purposes of viral marketing. Examples include e-mail mailing lists, UseNet groups, on-line forums,
instant relay chat (IRC), instant messaging, collaborative fltering systems, and knowledge-sharing sites. Knowledge
sharing sites allow users to offer advice or rate products to help others, typically for free. Users can rate the usefulness
or “trustworthiness” of a review, and may possibly rate other reviewers as well. In this way, a network of trust
relationships between users (known as a “web of trust”) evolves, representing a social network for mining.
Mining newsgroups using networks
The situation is rather different in newsgroups on topic discussions. A typical newsgroup posting consists of one or
more quoted lines from another posting followed by the opinion of the author. Such quoted responses form “quotation
links” and create a network in which the vertices represent individuals and the links “responded-to” relationships.
An interesting phenomenon is that people more frequently respond to a message when they disagree than when they
agree. This behaviour exists in many newsgroups and is in sharp contrast to the Web page link graph, where linkage
is an indicator of agreement or common interest. Based on this behaviour, one can effectively classify and partition
authors in the newsgroup into opposite camps by analysing the graph structure of the responses.
This newsgroup classifcation process can be performed using a graph-theoretic approach. The quotation network
(or graph) can be constructed by building a quotation link between person i and person j. If i has quoted from an
earlier posting written by j, we can consider any bipartition of the vertices into two sets: F represents those for
an issue and A represents those against it. If most edges in a newsgroup graph represent disagreements, then the
optimum choice is to maximise the number of edges across these two sets. Because it is known that theoretically the
max-cut problem (that is maximising the number of edges to cut so that a graph is partitioned into two disconnected
subgraphs) is an NP-hard problem, we need to explore some alternative, practical solutions.
59/JNU OLE
In particular, we can exploit two additional facts that hold in our situation: (1) rather than being a general graph, our
instance is largely a bipartite graph with some noise edges added, and (2) neither side of the bipartite graph is much
smaller than the other. In such situations, we can transform the problem into a minimum-weight, approximately
balanced cut problem, which in turn can be well approximated by computationally simple spectral methods. Moreover,
to further enhance the classifcation accuracy, we can frst manually categorise a small number of prolifc posters
and tag the corresponding vertices in the graph. This information can then be used to bootstrap a better overall
partitioning by enforcing the constraint that those classifed on one side by human effort should remain on that side
during the algorithmic partitioning of the graph.
Based on these ideas, an effcient algorithm was proposed. Experiments with some newsgroup data sets on several
highly debatable social topics, such as abortion, gun control, and immigration, demonstrate that links carry less
noisy information than text. Methods based on linguistic and statistical analysis of text yield lower accuracy on
such newsgroup data sets than that based on the link analysis shown earlier. This is because the vocabulary used by
the opponent sides tends to be largely identical, and many newsgroup postings consist of too-brief text to facilitate
reliable linguistic analysis.
Community mining from multirelational networks
With the growth of the Web, community mining has attracted increasing attention. A great deal of such work has
focused on mining implicit communities of Web pages, of scientifc literature from the Web, and of document citations.
In principle, a community can be defned as a group of objects sharing some common properties. Community mining
can be thought of as subgraph identifcation.
For example, in Web page linkage, two Web pages (objects) are related if there is a hyperlink between them. A graph
of Web page linkages can be mined to identify a community or set of Web pages on a particular topic.
Most techniques for graph mining and community mining are based on a homogenous graph, that is, they assume
that only one kind of relationship exists between the objects. However, in real social networks, there are always
various kinds of relationships between the objects. Each relation can be viewed as a relation network. In this sense,
the multiple relations form a multirelational social network (also referred to as a heterogeneous social network).
Each kind of relation may play a distinct role in a particular task. Here, the different relation graphs can provide us
with different communities. To fnd a community with certain properties, it is necessary to identify, which relation
plays an important role in such a community. Such a relation might not exist explicitly, that is, frst discover such
a hidden relation before fnding the community on such a relation network.
Different users may be interested in different relations within a network. Thus, if we mine networks by assuming
only one kind of relation, we may end up missing out on a lot of valuable hidden community information, and
such mining may not be adaptable to the diverse information needs of various users. This brings us to the problem
of multirelational community mining, which involves the mining of hidden communities on heterogeneous social
networks.
3.7 Multirelational Data Mining
Multirelational data mining (MRDM) methods search for patterns that involve multiple tables (relations) from a
relational database. Consider the multirelational schema in the fgure below, which defnes a fnancial database.
Each table or relation represents an entity or a relationship, described by a setoff attributes. Links between relations
show the relationship between them. One method to apply traditional data mining methods (which assume that the
data reside in a single table) is propositionalisation, which converts multiple relational data into a single fat data
relation, using joins and aggregations. This, however, could lead to the generation of a huge, undesirable “universal
relation” (involving all of the attributes). Furthermore, it can result in the loss of information, including essential
semantic information represented by the links in the database design. Multirelational data mining aims to discover
knowledge directly from relational data.
Data Mining
60/JNU OLE
Loan
Order
Account
Card
Disposition
Transaction
Client
District
loan-id
account –id
date
amount
duration
payment
order-id
account-id
to-bank
to-account
amount
type
account –id
district –id
frequency
date
card –id
disp-id
type
issue-date
disp-id
account-id
client-id
type
trans-id
account-id
date
type
operation
amount
balance
symbol
client-id
birthdate
gender
district-id
district –id
name
region
#people
#It-500
#It-2000
#It-10000
#gt-10000
#city
ration-urban
avg-salary
unemply95
unemploy96
den-entry
#crime95
#crime96
Fig. 3.9 A fnancial multirelational schema
There are different multirelational data mining tasks, including multirelational classifcation, clustering, and frequent
pattern mining. Multirelational classifcation aims to build a classifcation model that utilises information in different
relations. Multirelational clustering aims to group tuples into clusters using their own attributes as well as tuples
related to them in different relations. Multirelational frequent pattern mining aims at fnding patterns involving
interconnected items in different relations.
In a database for multirelational classifcation, there is one target relation, R
t
, whose tuples are called target tuples
and are associated with class labels. The other relations are non-target relations. Each relation may have one primary
key (which uniquely identifes tuples in the relation) and several foreign keys (where a primary key in one relation
can be linked to the foreign key in another). If we assume a two-class problem, then we pick one class as the positive
class and the other as the negative class. The most important task for building an accurate multirelational classifer
is to fnd relevant features in different relations that help distinguish positive and negative target tuples.
3.8 Data Mining Algorithms and their Types
The data mining algorithm is the mechanism that creates mining models. To create a model, an algorithm frst
analyses a set of data, looking for specifc patterns and trends. The algorithm then uses the results of this analysis
to defne the parameters of the mining model.
The mining model that an algorithm creates can take various forms, including:
A set of rules that describe how products are grouped together in a transaction. •
A decision tree that predicts whether a particular customer will buy a product. •
61/JNU OLE
A mathematical model that forecasts sales. •
A set of clusters that describe how the cases in a dataset are related. •
Types of Data Mining Algorithms
There are different types of data mining algorithms. However, the three main data mining algorithms are classifcation,
clustering and association rules.
3.8.1 Classifcation
Classifcation is the task of generalising known structure to apply to new data. For example, an email program might
attempt to classify an email as legitimate or spam.
preparing data for classifcation: The following pre-processing steps may be applied to the data to help improve
the accuracy, effciency, and scalability of the classifcation process.
Data cleaning: This refers to the pre-processing of data in order to remove or reduce noise (by applying smoothing
techniques, for example) and the treatment of missing values (for example, by replacing a missing value with the
most commonly occurring value for that attribute, or with the most probable value based on statistics). Although,
most classifcation algorithms have some mechanisms for handling noisy or missing data, this step can help reduce
confusion during learning.
Relevance analysis: Many of the attributes in the data may be redundant. Correlation analysis can be used to identify
whether any two given attributes are statistically related. For example, a strong correlation between attributes A
1

and A
2
would suggest that one of the two could be removed from further analysis. A database may also contain
irrelevant attributes. Attribute subset can be used in these cases to fnd a reduced set of attributes such that the
resulting probability distribution of the data classes is as close as possible to the original distribution obtained using
all attributes. Hence, relevance analysis, in the form of correlation analysis and attribute subset selection, can be
used to detect attributes that do not contribute to the classifcation or prediction task. Including such attributes may
otherwise slow down, and possibly mislead, the learning step.
Ideally, the time spent on relevance analysis, when added to the time spent on learning from the resulting “reduced”
attribute (or feature) subset, should be less than the time that would have been spent on learning from the original
set of attributes. Hence, such analysis can help to improve classifcation effciency and scalability.
Data transformation and reduction: The data may be transformed by normalisation, particularly when neural
networks or methods involving distance measurements are used in the learning step. Normalisation involves scaling
all values for a given attribute so that they fall within a small specifed range, such as 1:0 to 1:0, or 0:0 to 1:0. In
methods that use distance measurements, for example, this would prevent attributes with initially large ranges (for
example, income) from outweighing attributes with initially smaller ranges (such as binary attributes). The data can
also be transformed by generalising it to higher-level concepts. Concept hierarchies may be used for this purpose.
This is particularly useful for continuous valued attributes. For example, numeric values for the attribute income
can be generalised to discrete ranges, such as low, medium, and high. Similarly, categorical attributes, like street,
can be generalised to higher-level concepts, like city. Because generalisation compresses the original training data,
fewer input/output operations may be involved during learning. Data can also be reduced by applying many other
methods, ranging from wavelet transformation and principle components analysis to discretisation techniques, such
as binning, histogram analysis, and clustering.
Bayesian classifcation: Bayesian classifers are statistical classifers. They can predict class membership probabilities,
such as the probability that a given tuple belongs to a particular class. Bayesian classifcation is based on Bayes’
theorem. Studies comparing classifcation algorithms have found a simple Bayesian classifer known as the naive
Bayesian classifer to be comparable in performance with decision tree and selected neural network classifers.
Bayesian classifers have also exhibited high accuracy and speed when applied to large databases. Naïve Bayesian
classifers assume that the effect of an attribute value on a given class is independent of the values of the other
Data Mining
62/JNU OLE
attributes. This assumption is called class conditional independence. It is made to simplify the computations involved
and, in this sense, is considered “naïve.” Bayesian belief networks are graphical models, which unlike naïve Bayesian
classifers allow the representation of dependencies among subsets of attributes. Bayesian belief networks can also
be used for classifcation.
Bayes’ theorem
Bayes’ theorem is named after Thomas Bayes, a nonconformist English clergyman who performed early work in
probability and decision theory during the 18th century. Let X be a data tuple. In Bayesian terms, X is considered
“evidence.” As usual, it is described by measurements made on a set of n attributes. Let H be some hypothesis, such
as that the data tuple X belongs to a specifed class C. For classifcation problems, we want to determine P(H X),
the probability that the hypothesis H holds given the “evidence” or observed data tuple X. In other words, we are
looking for the probability that tuple X belongs to class C, given that we know the attribute description of X. P(H
X) is the posterior probability, or a posterior probability, of H conditioned on X.
For example, suppose our world of data tuples is confned to customers described by the attributes age and income,
respectively, and that X is a 35-year-old customer with an income of $40,000. Suppose that H is the hypothesis that
our customer will buy a computer. Then P(H X) refects the probability that customer X will buy a computer given
that we know the customer’s age and income.
In contrast, P(H) is the prior probability, or a priori probability, of H. For the above example, this is the probability
that any given customer will buy a computer, regardless of age, income, or any other information, for that matter.
The posterior probability, P(H X), is based on more information (for example, customer information) than the prior
probability, P(H), which is independent of X.
Similarly, P(X ) is the posterior probability of X conditioned on H. That is, it is the probability that a customer,
X, is 35 years old and earns $40,000, given that we know the customer will buy a computer. P(X) is the prior
probability of X. Using our example, it is the probability that a person from our set of customers is 35 years old
and earns $40,000.
“How are these probabilities estimated?” P(H), P(X ), and P(X) may be estimated from the given data, as we
shall see below. Bayes’ theorem is useful in that it provides a way of calculating the posterior probability, P(H
X), from P(H), P(X ), and P(X).
Bayes’ theorem is
naïve Bayesian classifcation
The naïve Bayesian classifer, or simple Bayesian classifer, works as follows:
Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented by an •
n-dimensional attribute vector, X = (x
1
, x
2
, ... , x
n
), depicting n
measurements made on the tuple from n attributes, respectively, A
1
, A
2
,... , A
n
.
Suppose that there are m classes, • C
1
, C
2
, ... , C
m
. Given a tuple, X, the classifer will predict that X belongs to the
class having the highest posterior probability, conditioned on X. That is, the naïve Bayesian classifer predicts
that tuple X belongs to the class C
i
if and only if
for
Thus, we maximise . The classC
i
for which is maximised is called the maximum posteriori
hypothesis according to the Bayes’ theorem.
63/JNU OLE
As • P(X) is constant for all classes, only ) needs to be maximised. If the class prior probabilities
are not known, then it is commonly assumed that the classes are equally likely, that is, P(C
1
) = P(C
2
) = ... =
P(C
m
), and we would therefore, maximise . Otherwise, we maximise . Note that the
class prior probabilities may be estimated by where is the number of training tuples
of class C
i
in D.
Given data sets with many attributes, it would be extremely computationally expensive to compute. In order to •
reduce computation in evaluating , the naive assumption of class conditional independence is made.
This presumes that the values of the attributes are conditionally independent of one another, given the class
label of the tuple (that is there are no dependence relationships among the attributes). Thus,
=
We can easily estimate the probabilities from the training tuples. Recall
that here x
k
refers to the value of attribute A
k
for tuple X. For each attribute, we look at whether the attribute is
categorical or continuous-valued. For instance, to compute , we consider the following:
If A •
k
is categorical, then ) is the number of tuples of class C
i
in D having the value x
k
for A
k
, divided
by , the number of tuples of class C
i
in D.
If A •
k
is continuous-valued, then we need to do a bit more work, but the calculation is pretty straightforward. A
continuous-valued attribute is typically assumed to have a Gaussian distribution with a mean and standard
deviation , defned by
So that,
These equations may appear daunting, but we need to compute and , which are the mean (that is average)
and standard deviation, respectively, of the values of attribute A
k
for training tuples of class C
i
. We then plug these
two quantities into the 1
st
equation above, together with x
k
, in order to estimate
Rule-Based Classifcation
Rule-based classifers are the learned model, which is represented as a set of IF-THEN rules.
Using IF-tHEn rules for classifcation
Rules are a good way of representing information or bits of knowledge. A rule-based classifer uses a set of IF-THEN
rules for classifcation. An IF-THEN rule is an expression of the form
IF condition THEN conclusion.
An example is rule R1,
R1: IF age = youth AND student = yes THEN buys_computer = yes.
The “IF”-part (or left-hand side) of a rule is known as the rule antecedent or precondition. The “THEN”-part (or
Data Mining
64/JNU OLE
right-hand side) is the rule consequent. In the rule antecedent, the condition consists of one or more attribute tests
(such as age = youth, and student = yes) that are logically ANDed. The rule’s consequent contains a class prediction
(in this case, we are predicting whether a customer will buy a computer). R1 can also be written as
R1: (age = youth) ^ (student = yes) (buys computer = yes).
If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given tuple, we say that the
rule antecedent is satisfed (or simply, that the rule is satisfed) and that the rule covers the tuple.
A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class labelled data set, D, let n
covers

be the number of tuples covered by R; n
correct
be the number of tuples correctly classifed by R; and |D| be the number
of tuples in D. We can defne the coverage and accuracy of R as
That is, a rule’s coverage is the percentage of tuples that are covered by the rule (that is whose attribute values hold
true for the rule’s antecedent). For a rule’s accuracy, we look at the tuples that it covers and see what percentage of
them the rule can correctly classify.
Rule extraction from a Decision Tree
Decision tree classifers are a popular method of classifcation—it is easy to understand how decision trees work
and they are known for their accuracy. Decision trees can become large and diffcult to interpret. In this subsection,
we look at how to build a rule based classifer by extracting IF-THEN rules from a decision tree. In comparison
with a decision tree, the IF-THEN rules may be easier for humans to understand, particularly if the decision tree is
very large. To extract rules from a decision tree, one rule is created for each path from the root to a leaf node. Each
splitting criterion along a given path is logically ANDed to form the rule antecedent (“IF” part). The leaf node holds
the class prediction, forming the rule consequent (“THEN” part).
Rule induction using a sequential covering algorithm
IF-THEN rules can be extracted directly from the training data (that is without having to generate a decision tree
frst) using a sequential covering algorithm. The name comes from the notion that the rules are learned sequentially
(one at a time), where each rule for a given class will ideally cover many of the tuples of that class (and hopefully
none of the tuples of other classes). Sequential covering algorithms are the most widely used approach to mining
disjunctive sets of classifcation rules, and form the topic of this subsection. Note that in a newer alternative approach,
classifcation rules can be generated using associative classifcation algorithms, which search for attribute-value
pairs that occur frequently in the data. These pairs may form association rules, which can be analysed and used in
classifcation.
There are many sequential covering algorithms. Popular variations include AQ, CN2, and the more recent, RIPPER.
The general strategy is such that :
Rules are learned one at a time. •
Each time a rule is learned, the tuples covered by the rule are removed. •
The process repeats on the remaining tuples. •

This sequential learning of rules is in contrast to decision tree induction. Because the path to each leaf in a decision
tree corresponds to a rule, we can consider decision tree induction as learning a set of rules simultaneously.
65/JNU OLE
A basic sequential covering algorithm is shown in following fgure. Here, rules are learned for one class at a time.
Ideally, when learning a rule for a class, Ci, we would like the rule to cover all (or many) of the training tuples of
class C and none (or few) of the tuples from other classes. In this way, the rules learned should be of high accuracy.
The rules need not necessarily be of high coverage. This is because we can have more than one rule for a class, so
that different rules may cover different tuples within the same class.
Algorithm: Sequential covering. Learn a set of IF-THEN rules for classifcation.

Input:
D, a data set class-labelled tuples; ‚
Att_vals, the set of all attributes and their possible values. ‚
Output: A set of IT-THEN rules.
Method:

(1) Rule_set = {}; // initial set of rules learned is empty
(2) for each class c do
(3) repeat
(4) Rule = Learn_One_Rule (D, Att_vals, c);
(5) remove tuples covered by Rule from D;
(6) until terminating conditions;
(7) Rule_set = Rule_set + Rule; // add new rule to rule set
(8) endfor
(9) return Rule_Set;
Fig. 3.10 Basic sequential covering algorithm
The process continues until the terminating condition is met, such as when there are no more training tuples or the
quality of a rule returned is below a user-specifed threshold. The Learn One Rule procedure fnds the “best” rule
for the current class, given the current set of training tuples.
Typically, rules are grown in a general-to-specifc manner. We can think of this as a beam search, where we start off
with an empty rule and then gradually keep appending attribute tests to it. We append by adding the attribute test
as a logical conjunct to the existing condition of the rule antecedent. Suppose our training set, D, consists of loan
application data. Attributes regarding each applicant include their age, income, education level, residence, credit
rating, and the term of the loan. The classifying attribute is loan decision, which indicates whether a loan is accepted
(considered safe) or rejected (considered risky). To learn a rule for the class “accept,” we start off with the most
general rule possible, that is, the condition of the rule antecedent is empty. The rule is:
IF THEN loan_decision = accept.
Data Mining
66/JNU OLE
IF
THEN loan_decision = accept
IF loan_term = short
THEN loan_decision = accept
IF income = high AND
age = youth
THEN loan_decision = accept
IF income = high AND
age = middle_aged
THEN loan_decision = accept
IF income = high AND
credit_rating = excellent
THEN loan_decision = accept
IF income = high AND
credit_rating = fair
THEN loan_decision = accept
IF loan_term = long
THEN loan_decision = accept
IF income = high
THEN loan_decision = accept
IF loan_term=medium
THEN loan_decision = accept

Fig. 3.11 A general-to-specifc search through rule space
We then consider each possible attribute test that may be added to the rule. These can be derived from the parameter
Att_vals, which contains a list of attributes with their associated values. For example, for an attribute-value pair
(att, val), we can consider attribute tests such as att = val, att val, att val, and so on. Typically, the training data
will contain many attributes, each of which may have several possible values. Finding an optimal rule set becomes
computationally explosive.
Instead, Learn One Rule adopts a greedy depth-frst strategy. Each time it is faced with adding a new attribute test
(conjunct) to the current rule, it picks the one that most improves the rule quality, based on the training samples.
We will say more about rule quality measures in a minute. For the moment, let’s say we use rule accuracy as our
quality measure. Getting back to our example with the above fgure, suppose Learn One Rule fnds that the attribute
test income = high best improves the accuracy of our current (empty) rule. We append it to the condition, so that
the current rule becomes
IF income = high THEN loan decision = accept.
Each time we add an attribute test to a rule, the resulting rule should cover more of the “accept” tuples. During
the next iteration, we again consider the possible attribute tests and end up selecting credit rating = excellent. Our
current rule grows to become
IF income = high AND credit rating = excellent THEN loan decision = accept.
The process repeats, where at each step, we continue to greedily grow rules until the resulting rule meets an acceptable
quality level. Greedy search does not allow for backtracking. At each step, we heuristically add what appears to be
the best choice at the moment. What if we unknowingly made a poor choice along the way? In order to lessen the
chance of this happening, instead of selecting the best attribute test to append to the current rule, we can select the
best k attribute tests. In this way, we perform a beam search of width k wherein we maintain the k best candidates
overall at each step, rather than a single best candidate.
Classifcation by backpropagation
Backpropagation is a neural network learning algorithm. The feld of neural networks was originally kindled by
psychologists and neurobiologists who sought to develop and test computational analogues of neurons. Roughly
speaking, a neural network is a set of connected input/output units in which each connection has a weight associated
with it. During the learning phase, the network learns by adjusting the weights so as to be able to predict the
67/JNU OLE
correct class label of the input tuples. Neural network learning is also referred to as connectionist learning due to
the connections between units. Neural networks involve long training times and are therefore more suitable for
applications where this is feasible. They require a number of parameters that are typically best determined empirically,
such as the network topology or “structure.” Neural networks have been criticised for their poor interpretability.
For example, it is diffcult for humans to interpret the symbolic meaning behind the learned weights and of “hidden
units” in the network. These features initially made neural networks less desirable for data mining.
A multilayer feed-forward neural network
The backpropagation algorithm performs learning on a multilayer feed-forward neural network. It iteratively learns
a set of weights for prediction of the class label of tuples. A multilayer feed-forward neural network consists of an
input layer, one or more hidden layers, and an output layer. An example of a multilayer feed-forward network is
shown in the following fgure.
Each layer is made up of units. The inputs to the network correspond to the attributes measured for each training
tuple. The inputs are fed simultaneously into the units making up the input layer. These inputs pass through the input
layer and are then weighted and fed simultaneously to a second layer of “neuronlike” units, known as a hidden layer.
The outputs of the hidden layer units can be input to another hidden layer, and so on. The number of hidden layers
is arbitrary, although in practice, usually only one is used. The weighted outputs of the last hidden layer are input
to units making up the output layer, which emits the network’s prediction for given tuples. The units in the input
layer are called input units. The units in the hidden layers and output layer are sometimes referred to as neurodes,
due to their symbolic biological basis, or as output units. The multilayer neural network shown in the fgure has
two layers of output units.
x1
Input
layer
Hidden
layer
Output
layer
x2
x3
x4
wnj
wij
w2j
w1j
oj
wjk
ok

Fig. 3.12 A multilayer feed-forward neural network
Therefore, we say that it is a two-layer neural network. (The input layer is not counted because it serves only to
pass the input values to the next layer.) Similarly, a network containing two hidden layers is called a three-layer
neural network, and so on. The network is feed-forward in that none of the weights cycles back to an input unit or
to an output unit of a previous layer. It is fully connected in each unit, which provides input to each unit in the next
forward layer.
Defning a network topology
Before beginning the training, the user must decide on the network topology by specifying the number of units in
the input layer, the number of hidden layers (if more than one), the number of units in each hidden layer, and the
number of units in the output layer.
Data Mining
68/JNU OLE
Normalising the input values for each attribute measured in the training tuples will help speed up the learning phase.
Typically, input values are normalised so as to fall between 0.0 and 1.0. Discrete-valued attributes may be encoded
such that there is one input unit per domain value. For example, if an attribute A has three possible or known values,
namely {a0, a1, a2}, then we may assign three input units to represent A. That is, we may have, say, I
0
, I
1
, I
2
as input
units. Each unit is initialised to 0. If A=a
0
, then I
0
is set to 1. If A = a
1
, I
1
is set to 1, and so on. Neural networks can be
used for either classifcation (to predict the class label of a given tuple) or prediction (to predict a continuous-valued
output). For classifcation, one output unit may be used to represent two classes (where the value 1 represents one
class and the value 0 represents the other). If there are more than two classes, then one output unit per class is used.
There are no clear rules as to the “best” number of hidden layer units. Network design is a trial-and-error process
and may affect the accuracy of the resulting trained network.
The initial values of the weights may also affect the resulting accuracy. Once a network has been trained and its
accuracy is not considered acceptable, it is common to repeat the training process with a different network topology
or a different set of initial weights. Cross-validation techniques for accuracy estimation can be used to help to decide
when an acceptable network has been found. A number of automated techniques have been proposed that search
for a “good” network structure. These typically use a hill-climbing approach that starts with an initial structure that
is selectively modifed.
Backpropagation
Backpropagation learns by iteratively processing a data set of training tuples, comparing the network’s prediction
for each tuple with the actual known target value. The target value may be the known class label of the training
tuple (for classifcation problems) or a continuous value (for prediction). For each training tuple, the weights are
modifed so as to minimise the mean squared error between the network’s prediction and the actual target value.
These modifcations are made in the “backwards” direction, that is, from the output layer, through each hidden
layer down to the frst hidden layer (hence the name backpropagation). Although it is not guaranteed, in general
the weights will eventually converge, and the learning process stops. The algorithm is summarised in the following
fgure. The steps involved are expressed in terms of inputs, outputs, and errors, and may seem awkward if this is
your frst look at neural network learning. However, once you become familiar with the process, you will see that
each step is inherently simple. The steps are described below.
Algorithm: Backpropagation, Neural network learning for classifcation or prediction using the backpropagation •
algorithm.
Input: •
D, a data set consisting of the training tuples and their associated target values; ‚
l, the learning rate; ‚
network, a multilayer feed-forward network. ‚
Output: A trained neural network. •
Method: •
Initialise all weights and biases in network; 1.
while terminating condition is not satisfed { 2.
for each training tuple X in D { 3.
// Propagate the inputs forward: 4.
for each input layer unit j { 5.
O 6.
j
= I
j
; // output of an input unit is its actual input value
for each hidden or output layer unit j { 7.
69/JNU OLE
I 8.
j
= ; //compute the net input of unit j with respect to the
previous layer, i
O
j
= // compute the output of each unit j
// Backpropagate the errors: 9.
for each unit j in the output layer 10.
Err 11.
j
= O
j
(1- O
j
)(T
j
- O
j
); // compute the error
for each unit j in the hidden layers, from the last to the frst hidden layer 12.
Err 13.
j
= Oj(1- O
j
) // compute the error with respect to the
next higher layer, k 14.
for each weight w 15.
ij
in network {
16. ; // weight increment
17. ;} // weight update
for each bias 18. in network {
19. ; // bias increment
20. ; } // bias update
} } 21.
3.8.2 Clustering
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A
cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the
objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered
as a form of data compression. Although classifcation is an effective means for distinguishing groups or classes
of objects, it requires the often costly collection and labelling of a large set of training tuples or patterns, which
the classifer uses to model each group. It is often more desirable to proceed in the reverse direction: First partition
the set of data into groups based on data similarity (for example, using clustering), and then assign labels to the
relatively small number of groups. Additional advantages of such a clustering-based process are that it is adaptable
to changes and helps single out useful features that distinguish different groups.
Cluster analysis has been widely used in numerous applications, including market research, pattern recognition,
data analysis, and image processing. In business, clustering can help marketers discover distinct groups in their
customer bases and characterise customer groups based on purchasing patterns. In biology, it can be used to derive
plant and animal taxonomies, categorise genes with similar functionality, and gain insight into structures inherent
in populations. Clustering may also help in the identifcation of areas of similar land use in an earth observation
database and in the identifcation of groups of houses in a city according to house type, value, and geographic
location, as well as the identifcation of groups of automobile insurance policy holders with a high average claim
cost. It can also be used to help to classify documents on the Web for information discovery. Clustering is also
called data segmentation in some applications because clustering partitions large data sets into groups according
Data Mining
70/JNU OLE
to their similarity. Clustering can also be used for outlier detection, where outliers (values that are “far away” from
any cluster) may be more interesting than common cases.
Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities
in electronic commerce. For example, exceptional cases in credit card transactions, such as very expensive and
frequent purchases, may be of interest as possible fraudulent activity. As a data mining function, cluster analysis
can be used as a stand-alone tool to gain insight into the distribution of data, to observe the characteristics of each
cluster, and to focus on a particular set of clusters for further analysis. Alternatively, it may serve as a preprocessing
step for other algorithms, such as characterisation, attribute subset selection, and classifcation, which would then
operate on the detected clusters and the selected attributes or features.
Data clustering is under vigorous development. Contributing areas of research include data mining, statistics,
machine learning, spatial database technology, biology, and marketing. Owing to the huge amounts of data collected
in databases, cluster analysis has recently become a highly active topic in data mining research. As a branch of
statistics, cluster analysis has been extensively studied for many years, focusing mainly on distance-based cluster
analysis. Cluster analysis tools based on k-means, k-medoids, and several other methods have also been built into
many statistical analysis software packages or systems, such as S-Plus, SPSS, and SAS. In machine learning,
clustering is an example of unsupervised learning. Unlike classifcation, clustering and unsupervised learning do not
rely on predefned classes and class-labelled training examples. For this reason, clustering is a form of learning by
observation, rather than learning by examples. In data mining, efforts have focused on fnding methods for effcient
and effective cluster analysis in large databases. Active themes of research focus on the scalability of clustering
methods, the effectiveness of methods for clustering complex shapes and types of data, high-dimensional clustering
techniques, and methods for clustering mixed numerical and categorical data in large databases.
Clustering is a challenging feld of research in which its potential applications pose their own special requirements.
The following are typical requirements of clustering in data mining:
Scalability
Many clustering algorithms work well on small
data sets containing fewer than several hundred
data objects; however, a large database may
contain millions of objects. Clustering on a sample
of a given large data set may lead to biased results.
Highly scalable clustering algorithms are needed.
Ability to deal with different types of attributes
Many algorithms are designed to cluster interval-
based (numerical) data. However, applications
may require clustering other types of data, such as
binary, categorical (nominal), and ordinal data, or
mixtures of these data types.
Discovery of clusters with arbitrary shape
Many clustering algorithms determine clusters
based on Euclidean or Manhattan distance
measures. Algorithms based on such distance
measures tend to fnd spherical clusters with similar
size and density. However, a cluster could be of
any shape. It is important to develop algorithms
that can detect clusters of arbitrary shape.
71/JNU OLE
Minimal requirements for domain knowledge to
determine input parameters
Many clustering algorithms require users to
input certain parameters in cluster analysis
(such as the number of desired clusters). The
clustering results can be quite sensitive to input
parameters. Parameters are often diffcult to
determine, especially for data sets containing
high-dimensional objects. This not only burdens
users, but it also makes the quality of clustering
diffcult to control.
Ability to deal with noisy data
Most real-world databases contain outliers or
missing, unknown, or erroneous data. Some
clustering algorithms are sensitive to such data
and may lead to clusters of poor quality.
Incremental clustering and insensitivity to the
order of input records
Some clustering algorithms cannot incorporate
newly inserted data (that is database updates) into
existing clustering structures and, instead, must
determine a new clustering from scratch. Some
clustering algorithms are sensitive to the order of
input data. That is, given a set of data objects, such
an algorithm may return dramatically different
clustering depending on the order of presentation
of the input objects. It is important to develop
incremental clustering algorithms and algorithms
that are insensitive to the order of input.
High Dimensionality
A database or a data warehouse can contain
several dimensions or attributes. Many clustering
algorithms are good at handling low-dimensional
data, involving only two to three dimensions.
Human eyes are good at judging the quality of
clustering for up to three dimensions. Finding
clusters of data objects in high dimensional space
is challenging, especially considering that such
data can be sparse and highly skewed.
Constraint-based clustering
Real-world applications may need to perform
clustering under various kinds of constraints.
Suppose that your job is to choose the locations
for a given number of new automatic banking
machines (ATMs) in a city. To decide upon this,
you may cluster households while considering
constraints such as the city’s rivers and highway
networks, and the type and number of customers
per cluster. A challenging task is to fnd groups of
data with good clustering behaviour that satisfy
specifed constraints.
Interpretability and usability
Users expect clustering results to be interpretable,
comprehensible, and usable. That is, clustering may
need to be tied to specifc semantic interpretations
and applications. It is important to study how an
application goal may infuence the selection of
clustering features and methods.

Table 3.2 Requirements of clustering in data mining
Data Mining
72/JNU OLE
Types of data in cluster analysis
Suppose that a data set to be clustered contains n objects, which may represent persons, houses, documents,
countries, and so on. Main memory-based clustering algorithms typically operate on either of the following two
data structures.
Data matrix (or object-by-variable structure):
This represents n objects, such as persons, with p variables (also called measurements or attributes), such as age,
height, weight, gender, and so on. The structure is in the form of a relational table, or n-by-p matrix (n objects
p variables):
Dissimilarity matrix (or object-by-object structure):
This stores a collection of proximities that are accessible for all pairs of n objects. It is often represented by an
n-by-n table:
where d(i, j) is the measured difference or dissimilarity between objects i and j. In general, d(i, j) is a nonnegative
number that is close to 0 when objects i and j are highly similar or “near” each other, and becomes larger the more
they differ. Since d(i, j)=d( j, i), and d(i, i)=0, we have the above matrix.
The rows and columns of the data matrix represent different entities, while those of the dissimilarity matrix represent
the same entity. Thus, the data matrix is often called a two-mode matrix, whereas the dissimilarity matrix is called a
one-mode matrix. Many clustering algorithms operate on a dissimilarity matrix. If the data are presented in the form
of a data matrix, it can frst be transformed into a dissimilarity matrix before applying such clustering algorithms.
Categorisation of major clustering methods
Many clustering algorithms exist in the literature. It is diffcult to provide a crisp categorisation of clustering methods
as these categories may overlap, so that a method may have features from several categories. Nevertheless, it is
useful to present a relatively organised picture of the different clustering methods. In general, the major clustering
methods can be classifed into the following categories.
Partitioning methods
To achieve global optimality in partitioning-based clustering, we would require the exhaustive enumeration of all of
the possible partitions. The heuristic clustering methods work well for fnding spherical-shaped clusters in small to
medium-sized databases. To fnd clusters with complex shapes and for clustering very large data sets, partitioning-
based methods need to be extended. The most well-known and commonly used partitioning methods are k-means,
k-medoids, and their variations.
73/JNU OLE
k-means algorithm
The k-means algorithm proceeds as follows:
First, it randomly selects k of the objects, each of which initially represents a cluster mean or centre. For each of
the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance
between the object and the cluster mean. It then computes the new mean for each cluster. This process iterates until
the criterion function converges. Typically, the square-error criterion is used, defned as
Where, E is the sum of the square error for all objects in the data set; p is the point in space representing a given
object; and m
i
is the mean of cluster C
i
(both p and m
i
are multidimensional). In other words, for each object in each
cluster, the distance from the object to its cluster centre is squared, and the distances are summed. This criterion
tries to make the resulting k clusters as compact and as separate as possible.
k-medoids algorithm
Each remaining object is clustered with the representative object to which it is the most similar. The partitioning
method is then performed based on the principle of minimising the sum of the dissimilarities between each object
and its corresponding reference point. That is, an absolute-error criterion is used, defned as
where E is the sum of the absolute error for all objects in the data set; p is the point in space representing a given
object in cluster C
j
; and o
j
is the representative object of C
j
. In general, the algorithm iterates until, eventually, each
representative object is actually the medoid, or most centrally located object, of its cluster. This is the basis of the
k-medoids method for grouping n objects into k clusters.
The k-medoids clustering process is discussed below.
The initial representative objects (or seeds) are chosen arbitrarily. •
The iterative process of replacing representative objects by non-representative objects continues as long as the •
quality of the resulting clustering is improved.
This quality is estimated using a cost function that measures the average dissimilarity between an object and •
the representative object of its cluster.
To determine whether a non-representative object, o •
random
, is a good replacement for a current representative
object, o
j
, the following four cases are examined for each of the non-representative objects, p.
Case 1: p currently belongs to representative object, o
j
. If o
j
is replaced by o
random
as a representative object and p is
closest to one of the other representative objects, o
i
, i j, then p is reassigned to o
i
.
Case 2: p currently belongs to representative object, o
j
. If o
j
is replaced by o
random
as a representative object and p is
closest to o
random
, then p is reassigned to o
random
.
Case 3: p currently belongs to representative object, o
i
, i j. If o
j
is replaced by o
random
as a representative object and
p is still closest to o
i
, then the assignment does not change.
Hierarchical methods
A hierarchical method creates a hierarchical decomposition of the given set of data objects. A hierarchical method
can be classifed as being either agglomerative or divisive, based on how the hierarchical decomposition is formed.
The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group.
It successively merges the objects or groups that are close to one another, until all of the groups are merged into one
Data Mining
74/JNU OLE
(the topmost level of the hierarchy), or until a termination condition holds. The divisive approach, also called the
top-down approach, starts with all of the objects in the same cluster. In each successive iteration, a cluster is split
up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds.
In general, there are two types of hierarchical clustering methods:
Agglomerative hierarchical clustering:This bottom-up strategy starts by placing each object in its own cluster and •
then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or
until certain termination conditions are satisfed. Most hierarchical clustering methods belong to this category.
They differ only in their defnition of intercluster similarity.
Divisive hierarchical clustering •
This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with all objects •
in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster on its
own or until it satisfes certain termination conditions, such as a desired number of clusters is obtained or the
diameter of each cluster is within a certain threshold.
Density-based methods
Most partitioning methods cluster objects based on the distance between objects. Such methods can fnd only spherical-
shaped clusters and encounter diffculty at discovering clusters of arbitrary shapes. Other clustering methods have
been developed based on the notion of density. Their general idea is to continue growing the given cluster as long
as the density (number of objects or data points) in the “neighbourhood” exceeds some threshold; that is, for each
data point within a given cluster, the neighbourhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to flter out noise (outliers) and discover clusters of arbitrary shape.
Density-based spatial clustering of applications with noise
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density based clustering algorithm.
The algorithm grows regions with suffciently high density into clusters and discovers clusters of arbitrary shape in
spatial databases with noise. It defnes a cluster as a maximal set of density-connected points.
The basic ideas of density-based clustering involve a number of new defnitions.
The neighbourhood within a radius • of a given object is called the -neighbourhood of the object.
If the • -neighbourhood of an object contains at least a minimum number, MinPts, of objects, then the object is
called a core object.
Given a set of objects, • D, we say that an object p is directly density-reachable from object q if p is within the
-neighbourhood of q, and q is a core object.
An object • p is density-reachable from object q with respect to e and MinPts in a set of objects, D, if there is a
chain of objects p
1
, : : : , p
n
, where p1 = q and p
n
= p such that p
i
+1 is directly density-reachable from p
i
with
respect to e and MinPts, for 1 i n, p
i
D.
An object • p is density-connected to object q with respect to and MinPts in a set of objects, D, if there is an
object o D such that both p and q are density-reachable from o with respect to and MinPts.
OptICS: Ordering points to Identify the Clustering Structure
A cluster analysis method called OPTICS was proposed. Rather than produce a data set clustering explicitly,
OPTICS computes an augmented cluster ordering for automatic and interactive cluster analysis. This ordering
represents the density-based clustering structure of the data. It contains information that is equivalent to density-
based clustering obtained from a wide range of parameter settings. The cluster ordering can be used to extract basic
clustering information (such as cluster centers or arbitrary-shaped clusters) as well as provide the intrinsic clustering
structure.
75/JNU OLE
Grid-based methods
Grid-based methods quantise the object space into a fnite number of cells that form a grid structure. All of the
clustering operations are performed on the grid structure (that is on the quantised space). The main beneft of this
approach is its fast processing time, which is typically independent of the number of data objects and dependent
only on the number of cells in each dimension in the quantised space. STING is a typical example of a grid-based
method.
StInG: Statistical Information Grid
STING is a grid-based multiresolution clustering technique in which the spatial area is divided into rectangular cells.
There are usually several levels of such rectangular cells corresponding to different levels of resolution, and these
cells form a hierarchical structure: each cell at a high level is partitioned to form a number of cells at the next lower
level. Statistical information regarding the attributes in each grid cell (such as the mean, maximum, and minimum
values) is precomputed and stored.
1st layer
(i-1)-st layer
ith layer

Fig. 3.13 A hierarchical structure for STING clustering
WaveCluster: Clustering using Wavelet transformation
The advantages of wavelet transformation for clustering are explained below. It provides unsupervised clustering. •
It uses hat-shaped flters that emphasise regions where the points cluster, while suppressing weaker information
outside of the cluster boundaries. Thus, dense regions in the original feature space act as attractors for nearby
points and as inhibitors for points that are further away. This means that the clusters in the data automatically
stand out and “clear” the regions around them. Thus, another advantage is that wavelet transformation can
automatically result in the removal of outliers.
The multiresolution property of wavelet transformations can help detect clusters at varying levels of ‚
accuracy.
Wavelet-based clustering is very fast, with a computational complexity of O(n), where n is the number of ‚
objects in the database. The algorithm implementation can be made parallel.
Data Mining
76/JNU OLE
Model-based methods
Model-based methods hypothesise a model for each of the clusters and fnd the best ft of the data to the given model.
A model-based algorithm may locate clusters by constructing a density function that refects the spatial distribution
of the data points. It also leads to a way of automatically determining the number of clusters based on standard
statistics, taking “noise” or outliers into account and thus yielding robust clustering methods. EMis an algorithm
that performs expectation-maximisation analysis based on statistical modelling. COBWEB is a conceptual learning
algorithm that performs probability analysis and takes concepts as a model for clusters.
Expectation-Maximisation
The EM (Expectation-Maximisation) algorithm is a popular iterative refnement algorithm that can be used for
fnding the parameter estimates. It can be viewed as an extension of the k-means paradigm, which assigns an object
to the cluster with which it is most similar, based on the cluster mean. The algorithm is described as follows:
Make an initial guess of the parameter vector: This involves randomly selecting k objects to represent the cluster •
means or centres (as in k-means partitioning), as well as making guesses for the additional parameters.
g(m
1
, σ
1
)
g(m
2
, σ
2
)
Fig. 3.14 EM algorithm
Each cluster can be represented by a probability distribution, centred at a mean, and with a standard deviation. Here,
we have two clusters, corresponding to the Gaussian distributions g(m
1
,
1
) and g(m
2
,
2
), respectively, where
the dashed circles represent the frst standard deviation of the distributions.
Iteratively refne the parameters (or clusters) based on the following two steps: •
(a) Expectation Step: Assign each object xi to cluster Ck with the probability

) =
where p(x
i
|C
k
) = N(m
k
, E
k
(x
i
)) follows the normal (that is Gaussian) distribution around mean, m
k
, with expectation,
E
k
. In other words, this step calculates the probability of cluster membership of object x
i
, for each of the clusters.
These probabilities are the “expected” cluster memberships for object x
i
.
(b) Maximisation Step: Use the probability estimates from above to re-estimate (or refne) the model parameters.
This step is the “maximisation” of the likelihood of the distributions given the data.
77/JNU OLE
3.8.3 Association Rules
The goal of the techniques described in this topic is to detect relationships or associations between specifc values
of categorical variables in large data sets. This is a common task in many data mining projects as well as in the data
mining subcategory text mining. These powerful exploratory techniques have a wide range of applications in many
areas of business practice and research too.
Working of association rules
The usefulness of this technique to address unique data mining problems is best illustrated in a simple example.
Suppose we are collecting data at the check-out cash registers at a large book store. Each customer transaction is
logged in a database, and consists of the titles of the books purchased by the respective customer, perhaps additional
magazine titles and other gift items that were purchased, and so on. Hence, each record in the database will represent
one customer (transaction), and may consist of a single book purchased by that customer, or it may consist of many
(perhaps hundreds of) different items that were purchased, arranged in an arbitrary order depending on the order
in which the different items (books, magazines, and so on) came down the conveyor belt at the cash register. The
purpose of the analysis is to fnd associations between the items that were purchased, that is to derive association
rules, which identify the items and co-occurrences of various items that appear with the greatest (co-)frequencies.
The rules of association are discussed below:
Sequence analysis • : Sequence analysis is concerned with a subsequent purchase of a product or products given
a previous buy. For instance, buying an extended warranty is more likely to follow (in that specifc sequential
order) the purchase of a TV or other electric appliances. Sequence rules, however, are not always that obvious, and
sequence analysis helps you to extract such rules no matter how hidden they may be in your market basket data.
There is a wide range of applications for sequence analysis in many areas of industry including customer shopping
patterns, phone call patterns, the fuctuation of the stock market, DNA sequence, and Web log streams.
Link analysis • : Once extracted, rules about associations or the sequences of items as they occur in a transaction
database can be extremely useful for numerous applications. Obviously, in retailing or marketing, knowledge
of purchase “patterns” can help with the direct marketing of special offers to the “right” or “ready” customers
(that is those who, according to the rules, are most likely to purchase specifc items given their observed past
consumption patterns). However, transaction databases occur in many areas of business, such as banking. In
fact, the term “link analysis” is often used when these techniques - for extracting sequential or non-sequential
association rules - are applied to organise complex “evidence.” It is easy to see how the “transactions” or “shopping
basket” metaphor can be applied to situations where individuals engage in certain actions, open accounts, contact
other specifc individuals, and so on. Applying the technologies described here to such databases may quickly
extract patterns and associations between individuals and actions and, hence, for example, reveal the patterns
and structure of some clandestine illegal network.
Unique data analysis requirements • : Cross tabulation tables, and in particular Multiple Response tables can
be used to analyse data of this kind. However, in cases when the number of different items (categories) in the
data is very large (and not known ahead of time), and when the “factorial degree” of important association
rules is not known ahead of time, then these tabulation facilities may be too cumbersome to use, or simply not
applicable: Consider once more the simple “bookstore-example” discussed earlier. First, the number of book
titles is practically unlimited. In other words, if we would make a table where each book title would represent one
dimension, and the purchase of that book (yes/no) would be the classes or categories for each dimension, then
the complete cross tabulation table would be huge and sparse (consisting mostly of empty cells). Alternatively,
we could construct all possible two-way tables from all items available in the store; this would allow us to
detect two-way associations (association rules) between items. However, the number of tables that would have
to be constructed would again be huge, most of the two-way tables would be sparse, and worse, if there were
any three-way association rules “hiding” in the data, we would miss them completely. The a-priori algorithm
implemented in Association Rules will not only automatically detect the relationships (“cross-tabulation tables”)
that are important (that is cross-tabulation tables that are not sparse, not containing mostly zero’s), but also
determine the factorial degree of the tables that contain the important association rules.
Data Mining
78/JNU OLE
Tabular representation of associations
Association rules are generated of the general form if Body then Head, where Body and Head stand for single codes
or text values (items) or conjunctions of codes or text values (items; for example, if (Car=Porsche and Age<20)
then (Risk=High and Insurance=High). The major statistics computed for the association rules are Support (relative
frequency of the Body or Head of the rule), Confdence (conditional probability of the Head given the Body of the
rule), and Correlation (support for Body and Head, divided by the square root of the product of the support for the
Body and the support for the Head). These statistics can be summarised in a spreadsheet, as shown below.
Fig. 3.15 Tabular representation of association
(Source: http://www.statsoft.com/textbook/association-rules/)
This results spreadsheet shows an example of how association rules can be applied to text mining tasks. This analysis
was performed on the paragraphs (dialog spoken by the characters in the play) in the frst scene of Shakespeare’s
“All’s Well That Ends Well,” after removing a few very frequent words like is, of, and so on. The values for support,
confdence, and correlation are expressed in percent.
Graphical representation of association
As a result of applying Association Rules data mining techniques to large datasets rules of the form if “Body” then
“Head” will be derived, where Body and Head stand for simple codes or text values (items), or the conjunction of
codes and text values (items; for example, if (Car=Porsche and Age<20) then (Risk=High and Insurance=High)).
These rules can be reviewed in textual format or tables, or in graphical format.
Association Rules Networks, 3D: Association rules can be graphically summarised in 2D Association Networks,
as well as 3D Association Networks. Shown below are some (very clear) results from an analysis. Respondents in
a survey were asked to list their (up to) 3 favourite fast-foods. The association rules derived from those data are
summarised in a 3D Association Network display.
79/JNU OLE
Fig. 3.16 Association Rules Networks, 3D
(Source: http://www.statsoft.com/textbook/association-rules/)
Data Mining
80/JNU OLE
Summary
Data mining refers to the process of fnding interesting patterns in data that are not explicitly part of the data. •
Data mining techniques can be applied to a wide range of data repositories including databases, data warehouses, •
spatial data, multimedia data, Internet or Web-based data and complex objects.
The Knowledge Discovery in Databases process consists of a few steps leading from raw data collections to •
some form of new knowledge.
Data selection and data transformation can also be combined where the consolidation of the data is the result of •
the selection, or, as for the case of data warehouses, the selection is done on transformed data.
Data mining derives its name from the similarities between searching for valuable information in a large database •
and mining rocks for a vein of valuable ore.
The concept of bagging applies to the area of predictive data mining, to combine the predicted classifcations •
from multiple models, or from the same type of model for different learning data.
The Cross-Industry Standard Process for Data Mining (CRISP–DM) was developed in 1996, by analysts •
representing DaimlerChrysler, SPSS, and NCR. CRISP provides a non-proprietary and freely available standard
process for ftting data mining into the general problem-solving strategy of a business or research unit.
With the increasing demand on the analysis of large amounts of structured data, graph mining has become an •
active and important theme in data mining.
The graph is typically very large, with nodes corresponding to objects and edges corresponding to links •
representing relationships or interactions between objects. Both nodes and links have attributes.
Social networks are rarely static. Their graph representations evolve as nodes and edges are added or deleted •
over time.
Multirelational data mining (MRDM) methods search for patterns that involve multiple tables (relations) from •
a relational database.
The data mining algorithm is the mechanism that creates mining models. •
Classifcation is the task of generalising known structure to apply to new data. •
A cluster of data objects can be treated collectively as one group and so may be considered as a form of data •
compression.
The goal of the techniques described in this topic is to detect relationships or associations between specifc •
values of categorical variables in large data sets. This is a common task in many data mining projects as well
as in the data mining subcategory text mining.
References
Seifert, J, W., 2004. • Data Mining: An Overview [Online PDF] Available at: <http://www.fas.org/irp/crs/RL31798.
pdf>. [Accessed 9 September 2011].
Alexander, D., • Data Mining [Online] Available at: <http://www.laits.utexas.edu/~norman/BUS.FOR/course.
mat/Alex/>. [Accessed 9 September 2011].
Han, J., Kamber, M. and Pei, J., 2011. • Data Mining: Concepts and Techniques, 3rd ed., Elsevier.
Adriaans, P., 1996. • Data Mining, Pearson Education India.
StatSoft, 2010 • . Data Mining, Cluster Techniques - Session 28 [Video Online] Available at: < http://www.youtube.
com/watch?v=WvR_0Vs1U8w>. [Accessed 12 September 2011].
Swallacebithead, 2010. • Using Data Mining Techniques to Improve Forecasting [Video Online] Available at:
<http://www.youtube.com/watch?v=UYkf3i6LT3Q>. [Accessed 12 September 2011].
81/JNU OLE
Recommended Reading
Chattamvelli, R., 2011. • Data Mining Algorithms, Alpha Science International Ltd.
Thuraisingham, B. M., 1999. • Data mining: technologies, techniques, tools, and trends, CRC Press.
Witten, I. H. and Frank, E., 2005. • Data mining: practical machine learning tools and techniques, 2nd ed.,
Morgan Kaufmann
Data Mining
82/JNU OLE
Self Assessment
Which of the following refers to the process of fnding interesting patterns in data that are not explicitly part 1.
of the data?
Data mining a.
Data warehousing b.
Data extraction c.
Metadata d.
What is the full form of KDD? 2.
Knowledge Data Defning a.
Knowledge Defning Database b.
Knowledge Discovery in Database c.
Knowledge Database Discovery d.
Which of the following is used to address the inherent instability of results while applying complex models to 3.
relatively small data sets?
Boosting a.
Bagging b.
Data reduction c.
Data preparation d.
The concept of ________ in predictive data mining refers to the application of a model for prediction or 4.
classifcation to new data.
bagging a.
boosting b.
drill-down analysis c.
deployment d.
What is referred to as Stacking (Stacked Generalisation)? 5.
Meta-learning a.
Metadata b.
Deployment c.
Boosting d.
Which statement is false? 6.
The concept of meta-learning applies to the area of predictive data mining, to combine the predictions from a.
multiple models.
In the business environment, complex data mining projects may require the coordinate efforts of various b.
experts, stakeholders, or departments throughout an entire organisation.
The concept of drill-down analysis applies to the area of data mining, to denote the interactive exploration c.
of data, in particular of large databases
Text mining is usually applied to identify data mining projects with the goal to identify a statistical or neural d.
network model or set of models that can be used to predict some response of interest.
83/JNU OLE
The Cross-Industry Standard Process for Data Mining (CRISP–DM) was developed in _______. 7.
1995 a.
1996 b.
1997 c.
1998 d.
Match the column 8.
Densifcation power law 1.
It is a heterogeneous and multirelational A.
data set represented by a graph.
Shrinking diameter 2.
This phenomena is represented in the B.
preferential attachment model
Heavy-tailed out-degree and in-degree 3.
distributions
This contradicts an earlier belief that the C.
diameter slowly increases as a function
of network size decreases.
Social Network Analysis 4.
This was known as the constant average D.
degree assumption.
1-A, 2-B, 3-C, 4-D a.
1-D, 2-C, 3-B, 4-A b.
1-B, 2-A, 3-D, 4-C c.
1-C, 2-D, 3-4, D-B d.
__________ algorithm is a popular iterative refnement algorithm that can be used for fnding the parameter 9.
estimates.
Model-based methods a.
Wavecluster b.
Expectation-Maximisation c.
Grid-based method d.
_________ is a grid-based multiresolution clustering technique in which the spatial area is divided into rectangular 10.
cells.
STING a.
KDD b.
DBSCAN c.
OPTICS d.
Data Mining
84/JNU OLE
Chapter IV
Web Application of Data Mining
Aim
The aim of this chapter is to:
introduce the concept of knowledge discovery in databases •
analyse different goals of data mining and knowledge discovery •
explore the knowledge discovery process •
Objectives
The objectives of this chapter are to:
highlight web mining •
describe what is graph mining •
elucidate web content mining •
Learning outcome
At the end of this chapter, you will be able to:
enlist types of knowledge discovered during data mining •
comprehend benefts of web mining •
understa • nd web structure mining and web usage mining
85/JNU OLE
4.1 Introduction
Knowledge Discovery in Databases, frequently abbreviated as KDD, typically encompasses more than data mining.
The knowledge discovery process comprises six phases, such as data selection, data cleansing, enrichment, data
transformation or encoding, data mining, and the reporting and display of the discovered information.
As an example, consider a transaction database maintained by a specialty consumer goods retailer. Suppose the
client data includes a customer name, zip code, phone number, date of purchase, item code, price, quantity, and
total amount. A variety of new knowledge can be discovered by KDD processing on this client database. During
data selection, data about specifc items or categories of items, or from stores in a specifc region or area of the
country, may be selected. The data cleansing process correct invalid zip codes or eliminate records with incorrect
phone prefxes. Enrichment typically enhances the data with additional sources of information. For example, given
the client names and phone numbers, the store may purchase other data about age, income, and credit rating and
append them to each record.
Data transformation and encoding may be done to reduce the amount of data. For instance, item codes may be
grouped in terms of product categories into audio, video, supplies, electronic gadgets, camera, accessories, and so
on. Zip codes may be aggregated into geographic regions; incomes may be divided into ranges, and so on. If data
mining is based on an existing warehouse for this retail store chain, we would expect that the cleaning has already
been applied. It is only after such preprocessing that data mining techniques are used to mine different rules and
patterns.
We can see that many possibilities exist for discovering new knowledge about buying patterns, relating factors such
as age, income group, place of residence, to what and how much the customers purchase. This information can then
be utilised to plan additional store locations based on demographics, to run store promotions, to combine items in
advertisements, or to plan seasonal marketing strategies. As this retail store example shows, data mining must be
preceded by signifcant data preparation before it can yield useful information that can directly infuence business
decisions. The results of data mining may be reported in a variety of formats, such as listings, graphic outputs,
summary tables, or visualisations.
Fig. 4.1 Knowledge base
Knowledge
Base
Business
Case
Defnition
Data Mining
Evaluation
Data
Preparation
Data Mining
86/JNU OLE
4.2 Goals of Data Mining and Knowledge Discovery
Data mining is typically carried out with some end goals or applications. Broadly speaking, these goals fall into the
following classes: prediction, identifcation, classifcation, and optimisation.
Prediction • -Data mining can show how certain attributes within the data will behave in the future. Examples
of predictive data mining include the analysis of buying transactions to predict what consumers will buy under
certain discounts, how much sales volume a store would generate in a given period, and whether deleting a
product line would yield more profts. In such applications, business logic is used coupled with data mining. In
a scientifc context, certain seismic wave patterns may predict an earthquake with high probability.
Identifcation • -Data patterns can be used to identify the existence of an item, an event, or an activity. For instance,
intruders trying to break a system may be identifed by the programs executed, fles accessed, and CPU time
per session. In biological applications, existence of a gene may be identifed by certain sequences of nucleotide
symbols in the DNA sequence. The area known as authentication is a form of identifcation. It confrms whether
a user is indeed a specifc user or one from an authorized class, and involves a comparison of parameters or
images or signals against a database.
Classifcation • -Data mining can partition the data so that different classes or categories can be identifed based
on combinations of parameters. For example, customers in a supermarket can be categorized into discount-
seeking shoppers, shoppers in a rush, loyal regular shoppers, shoppers attached to name brands, and infrequent
shoppers. This classifcation may be used in different analyses of customer buying transactions as a post-mining
activity. Sometimes, classifcation based on common domain knowledge is used as an input to decompose the
mining problem and make it simpler. For instance, health foods, party foods, or school lunch foods are distinct
categories in the supermarket business. It makes sense to analyze relationships within and across categories as
separate problems. Such categorization may be used to encode the data appropriately before subjecting it to
further data mining.
Optimisation • -One eventual goal of data mining may be to optimise the use of limited resources such as time,
space, money, or materials and to maximise output variables such as sales or profts under a given set of constraints.
As such, this goal of data mining resembles the objective function used in operations research problems that
deals with optimisation under constraints.
The term data mining is popularly being used in a very broad sense. In some situations it includes statistical analysis
and constrained optimization as well as machine learning. There is no sharp line separating data mining from these
disciplines. It is beyond our scope, therefore, we need to discuss in detail the entire range of applications that make
up this vast body of work. For a detailed understanding of the area, readers are referred to specialized books devoted
to data mining.
4.3 Types of Knowledge Discovered during Data Mining
The term “knowledge” is very broadly interpreted as involving some degree of intelligence. There is a progression
from raw data to information to knowledge as we go through additional processing.
Knowledge is often classifed as inductive versus deductive. Deductive knowledge deduces new information based
on applying pre-specifed logical rules of deduction on the given data. Data mining addresses inductive knowledge,
which discovers new rules and patterns from the supplied data. Knowledge can be represented in many forms: In an
unstructured sense, it can be represented by rules or propositional logic. In a structured form, it may be represented
in decision trees, semantic networks, neural networks, or hierarchies of classes or frames. It is common to describe
the knowledge discovered during data mining in fve ways, as follows:
Association rules • -These rules correlate the presence of a set of items with another range of values for another
set of variables. Examples: (1) When a female retail shopper buys a handbag, she is likely to buy shoes. (2) An
X-ray image containing characteristics a and b is likely to also exhibit characteristic c.
87/JNU OLE
Classifcation hierarchies • -The goal is to work from an existing set of events or transactions to create a hierarchy
of classes. Examples: (I) A population may be divided into fve ranges of credit worthiness based on a history
of previous credit transactions. (2) A model may be developed for the factors that determine the desirability
of location of a store on a 1-10 scale. (3) Mutual funds may be classifed based on performance data using
characteristics such as growth, income, and stability.
Sequential patterns • -A sequence of actions or events is sought. Example: If a patient underwent cardiac bypass
surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery,
he or she is likely to suffer from kidney failure within the next 18 months. Detection of sequential patterns is
equivalent to detecting associations among events with certain temporal relationships.
Patterns within time series • -Similarities can be detected within positions of a time series of data, which is a
sequence of data taken at regular intervals such as daily sales or daily closing stock prices. Examples: (1) Stocks
of a utility company, ABC Power, and a fnancial company, XYZ Securities, showed the same pattern during
2002 in terms of closing stock price. (2) Two products show the same selling pattern in summer but a different
pattern in winter. (3) A pattern in solar magnetic wind may be used to predict changes in earth atmospheric
conditions.
Clustering • -A given population of events or items can be partitioned (segmented) into sets of “similar” elements.
Examples: (1) An entire population of treatment data on a disease may be divided into groups based on the
similarity of side effects produced. (2) The adult population in the United States may be categorised into fve
groups from “most likely to buy” to “least likely to buy” a new product. (3) The web accesses made by a
collection of users against a set of documents (say, in a digital library) may be analysed in terms of the keywords
of documents to reveal clusters or categories of users.
For most applications, the desired knowledge is a combination of the above discussed types.
4.4 Knowledge Discovery Process
Before, It is essential to understand the overall approach, before one attempts to extract useful knowledge from
data . Merely knowing many algorithms used for data analysis is not suffcient for a successful data mining (DM)
project. The process defnes a sequence of steps (with eventual feedback loops) that should be followed to discover
knowledge (for example, patterns) in data. Each step is usually realised with the help of available commercial or
open-source software tools. To formalise the knowledge discovery processes (KDPs) within a common framework,
we introduce the concept of a process model. The model helps organisations to better understand the KDP and
provides a roadmap to follow while planning and executing the project. This in turn results in cost and time savings,
better understanding, and acceptance of the results of such projects. We need to understand that such processes are
nontrivial and involve multiple steps, reviews of partial results, possibly several iterations, and interactions with the
data owners. There are several reasons to structure a KDP as a standardised process model:
The end product must be helpful for the user/owner of the data. A blind, unstructured application of DM techniques •
to input data, called data dredging, normally produces meaningless results/knowledge that is knowledge that,
while interesting, does not contribute to solving the user’s problem. This result ultimately leads to the failure
of the project. Only through the application of well-defned KDP models will the end product be valid, novel,
useful, and understandable.
A well-defned KDP model should have a logical, cohesive, well-thought-out structure and approach that can be •
presented to decision-makers who may have diffculty understanding the need, value, and mechanics behind a
KDP. Humans often fail to grasp the potential knowledge available in large amounts of untapped and possibly
valuable data. They often do not want to devote signifcant time and resources to the pursuit of formal methods
of knowledge extraction from the data, but rather prefer to rely heavily on the skills and experience of others
(domain experts) as their source of information. However, because they are typically ultimately responsible
for the decision(s) based on that information, they frequently want to understand (be comfortable with) the
technology applied to those solution. A process model that is well structured and logical will do much to alleviate
any misgivings they may have.
Knowledge discovery projects require a signifcant project management effort that requires to be grounded in •
a solid framework. Most knowledge discovery projects involve teamwork and thus need careful planning and
Data Mining
88/JNU OLE
scheduling. For most project management specialists, KDP and DM are not familiar terms. Therefore, these
specialists need a defnition of what such projects involve and how to carry them out in order to develop a
sound project schedule.
Knowledge discovery should follow the example of other engineering disciplines that already have established •
models. A good example is the software engineering feld, which is a relatively new and dynamic discipline
that exhibits many characteristics that are pertinent to knowledge discovery. Software engineering has adopted
several development models, including the waterfall and spiral models that have become well-known standards
in this area.
There is a widely recognized need for standardization of the KDP. The challenge for modern data miners is •
to come up with widely accepted standards that will stimulate major industry growth. Standardization of the
KDP model would allow the development of standardised methods and procedures, thereby enabling end users
to deploy their projects more easily. It would lead directly to project performance that is faster, cheaper, more
reliable, and more manageable. The standards would promote the development and delivery of solutions that
use business terminology rather than the traditional language of algorithms, matrices, criterions, complexities,
and the like, resulting in greater exposure and acceptability for the knowledge discovery feld.
As there is some confusion about the terms data mining, knowledge discovery, and knowledge discovery in databases,
we frst need to defne them. Note, however, that many researchers and practitioners use DM as a synonym for
knowledge discovery; DM is also just one step of the KDP.
The knowledge discovery process (KDP), also called knowledge discovery in databases, seeks new knowledge in
some application domain. It is defned as the nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data.
The process generalises to non-database sources of data, although it emphasizes databases as a primary source of
data. It consists of many steps (one of them is DM), each attempting to complete a particular discovery task and
each accomplished by the application of a discovery method. Knowledge discovery concerns the entire knowledge
extraction process, including how data are stored and accessed, how to use effcient and scalable algorithms to
analyze massive datasets, how to interpret and visualize the results, and how to model and support the interaction
between human and machine. It also concerns support for learning and analyzing the application domain.
Input data
(database, images,
video, semi-struc-
tured data, etc.)
Knowledge

(patterns, rules,
clusters, classifcation,
association, etc.)
STEP 1 STEP2 STEP n-1 STEP n
Fig. 4.2 Sequential structure of KDP model
(Source: www.springer.com/cda/content/)
4.4.1 Overview of Knowledge Discovery Process
The KDP model consists of a set of processing steps to be followed by practitioners when executing a knowledge
discovery project. The model describes procedures that are performed in each of its steps. It is primarily used to
plan, work through, and reduce the cost of any given project.
Since the 1990s, various different KDPs have been developed. The initial efforts were led by academic research
but were quickly followed by industry. The frst basic structure of the model was proposed by Fayyad et al. and
later improved/modifed by others. The process consists of multiple steps that are executed in a sequence. Each
subsequent step is initiated upon successful completion of the previous step, and requires the result generated by
the previous step as its input. Another common feature of the proposed models is the range of activities covered,
89/JNU OLE
which stretches from the task of understanding the project domain and data, through data preparation and analysis,
to evaluation, understanding, and application of the generated results.
All the proposed models also emphasise the iterative nature of the model, in terms of many feedback loops that are
triggered by a revision process. A schematic diagram is shown in the above fgure. The main differences between the
models described here are found in the number and scope of their specifc steps. A common feature of all models is
the defnition of inputs and outputs. Typical inputs include data in various formats, such as numerical and nominal
data stored in databases or fat fles; images; video; semi-structured data, such as XML or HTML; and so on. The
output is the generated new knowledge — usually described in terms of rules, patterns, classifcation models,
associations, trends, statistical analysis and so on.

70

60
50
40

30
20
10
0
Cabena et al. estimate Relative effort [%]
KDDM steps
Understanding
of domain
Understanding
of domain
Preparation of
data
Data mining Evaluation of
results
Deployment of
results
Shearer estimates
Cios and Kurgan estimates

Fig. 4.3 Relative effort spent on specifc steps of the KD process
(Source: www.springer.com/cda/content/)
Most models follow a similar sequence of steps, while the common steps between the fve are domain understanding,
data mining, and evaluation of the discovered knowledge. The nine-step model carries out the steps concerning
the choice of DM tasks and algorithms late in the process. The other models do so before preprocessing of the
data in order to obtain data that are correctly prepared for the DM step without having to repeat some of the earlier
steps. In the case of Fayyad’s model, the prepared data may not be suitable for the tool of choice, and thus a loop
back to the second, third, or fourth step may be needed. The fve-step model is very similar to the six-step models,
except that it omits the data understanding step. The eight-step model gives a very detailed breakdown of steps in
the initial phases of the KDP, but it does not allow for a step concerned with applying the discovered knowledge.
Simultaneously, it recognizes the important issue of human resource identifcation.
A very important aspect of the KDP is the relative time spent in completing each of the steps. Evaluation of this
effort allows precise scheduling. Various estimates have been proposed by researchers and practitioners alike. Figure
4.3 shows a comparison of these different estimates. We note that the numbers given are only estimates, which are
used to quantify relative effort, and their sum may not equal 100%. The specifc estimated values depend on many
factors, such as existing knowledge about the considered project domain, the skill level of human resources, and the
complexity of the problem at hand, to name just a few. The common theme of all estimates is an acknowledgment
that the data preparation step is by far the most time-consuming part of the KDP.
The following summarises the steps of the knowledge discovery process:
Defne the problem • : The frststep involves understanding the problem and fguring out what the goals and
expectations are of the project.
Collect, clean, and prepare the data • : This requires fguring out what data are needed, which data are most
important and integrating the information. This step needs considerable effort, as much as 70% of the total data
mining effort.
Data mining • : This model-building step includes selecting data mining tools, transforming the data if the tool
needs it, generating samples for training and testing the model, and fnally using the tools to build and select
a model.
Data Mining
90/JNU OLE
Validate the models • : Test the model to make sure that it is producing accurate and adequate results.
Monitor the model • . Monitoring a model is essential as with passing time, it will be necessary to revalidate the
model to make sure that it is still meeting requirements. A model that works today may not work tomorrow;
therefore, it is necessary to monitor the behaviour of the model to ensure it is meeting performance standards.
4.5 Web Mining
With the explosive growth of information sources available on the World Wide Web, it has become increasingly
necessary for users to use automated tools in order to fnd the desired information resources, and to track and
analyse their usage patterns. These factors give rise to the necessity of creating server side and client side intelligent
systems that can effectively mine for knowledge. Web mining can be broadly defned as the discovery and analysis
of useful information from the World Wide Web. This describes the automatic search of information resources
available on line, that is Web content mining, and the discovery of user access patterns from Web servers, that is
Web usage mining.
Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or
activity related to the World Wide Web. There are roughly three knowledge discovery domains that pertain to web
mining: Web Content Mining, Web Structure Mining, and Web Usage Mining. Web content mining is the process
of extracting knowledge from the content of documents or their descriptions. Web document text mining, resource
discovery based on concepts indexing or agent based technology may also fall in this category. Web structure
mining is the process of inferring knowledge from the World Wide Web organisation and links between references
and referents in the Web. Finally, web usage mining, also known as Web Log Mining, is the process of extracting
interesting patterns in web access logs.
Name:
Address:
Age:
Occup:
Documents
and usage
Attributes
Data Cleaning
Server log data
Registration data
Clean log Transaction data
Formatted data
Integrated data
Database
query
language
Path analysis
Association
rules
Sequential
patterns
Cluster and
classifcation
rules
OLAP/
Visualisation
tools
Knowledge
query
mechanism
Intelligent
agent
Transaction
Identifcation
Data
Integration
Transformation Pattern discovery Pattern Analysis
STAGE 1 STAGE 2
Fig. 4.4 Web mining architecture
(Source: http://www.galeas.de/webmining.html)
4.5.1 Web Analysis
All visitors to a web site leave digital trails, which servers automatically store in log fles. Web analysis tools
analyse and process these web server log fles to produce meaningful information. Essentially a complete profle
of site traffc is created, for example how many visitors to the site, what sites they have come from, and which of
the site’s pages are most popular.
91/JNU OLE
Web analysis tools offer companies with previously unknown statistics and helpful insights into the behaviour of
their on-line customers. While the usage and popularity of such tools may continue to increase. Many e-tailers are
now demanding more useful information on their customers from the vast amounts of data generated by their web
sites.
The result of the changing paradigm of commerce, from traditional brick and mortar shop fronts to electronic
transactions over the Internet, has been the dramatic shift in the relationship between e-tailers and their customers.
There is no longer any personal contact between retailers and customers. Customers are now highly mobile and
are demonstrating loyalty only to value, often irrespective of brand or geography. A major challenge for e-tailers
is to identify and understand their new customer base. E-tailers need to learn as much as possible regarding the
behaviour, the individual tastes and the preferences of the visitors to their sites in order to remain competitive in
this new era of commerce.
4.5.2 Benefts of Web mining
Web Mining allows e-tailers to leverage their on-line customer data by understanding and predicting the behaviour
of their customers. For the frst time e-tailers now have access to detailed marketing intelligence on the visitors to
their web sites. The business benefts that web mining afford to digital service providers include - enhanced customer
support, personalisation, collaborative fltering, , product and service strategy defnition, particle marketing and fraud
detection. In short, the ability to understand their customers’ needs and to deliver the best and most appropriate
service to those individual customers at any given moment.
4.6 Web Content Mining
Web content mining, also known as text mining, is generally the second step in Web data mining. Content mining is
the scanning and mining of text, pictures and graphs of a Web page to determine the relevance of the content to the
search query. This scanning is completed after the clustering of web pages through structure mining and provides
the results based upon the level of relevance to the suggested query. With the massive amount of information that
is available on the World Wide Web, content mining provides the results lists to search engines in order of highest
relevance to the keywords in the query.
Text mining is directed toward specifc information provided by the customer search information in search engines.
This allows for the scanning of the entire Web to retrieve the cluster content triggering the scanning of specifc Web
pages within those clusters. The results are pages relayed to the search engines through the highest level of relevance
to the lowest. Though, the search engines have the ability to provide links to Web pages by the thousands in relation
to the search content, this type of web mining enables the reduction of irrelevant information.
Web text mining is very useful when used in relation to a content database dealing with specifc topics. For example,
online universities use a library system to recall articles related to their general areas of study. This specifc content
database enables to pull only the information within those subjects, providing the most specifc results of search
queries in search engines. This allowance of only the most relevant information being provided gives a higher quality
of results. This productivity increase is due directly to use of content mining of text and visuals.
The main uses of this type of data mining are to gather, categorise, organise and provide the best possible information
available on the WWW to the user requesting the information. This tool is imperative to scanning the many HTML
documents, images, and text provided on Web pages. The resulting information is provided to the search engines
in order of relevance giving more productive results of each search.
Web content categorization with a content database is the most signifcant tool to the effcient use of search engines.
A customer requesting information on a particular subject or item would otherwise have to search through thousands
of results to fnd the most relevant information to his query. Thousands of results through use of mining text are
reduced by this step. This eliminates the frustration and improves the navigation of information on the Web.
Data Mining
92/JNU OLE
Business uses of content mining allow the information provided on their sites to be structured in a relevance-order
site map. This enables for a customer of the Web site to access specifc information without having to search the
entire site. With the use of this type of mining, data remains available through order of relativity to the query, thus
providing productive marketing.Used as a marketing tool this provides additional traffc to the Web pages of a
company’s site based on the amount of keyword relevance the pages offer to general searches. As the second section
of data mining, text mining is useful to enhance the productive uses of mining for businesses, Web designers, and
search engines operations. Organization, categorization, and gathering of the information provided by the WWW
become easier and produce results that are more productive through the use of this type of mining.
In short, the ability to conduct Web content mining enables results of search engines to reduce the fow of customer
clicks to a Web site, or particular Web pages of the site, to be accessed number of times in relevance to search
queries. The clustering and organisation of Web content in a content database enables effective navigation of the
pages by the customer and search engines. Images, content, formats and Web structure are examined to produce a
higher quality of information to the user based upon the requests made. Businesses can reduce the use of this text
mining to enhance marketing of their sites as well as the products they offer.
4.7 Web StructureMining
Web structure mining, one of three categories of web mining for data, is a tool used to identify the relationship between
Web pages linked by information or direct link connection. This structure data is discoverable by the provision
of web structure schema through database techniques for Web pages. This connection allows a search engine to
pull data relating to a search query directly to the linking Web page from the Web site the content rests upon. This
completion takes place through use of spiders scanning the Web sites, retrieving the home page, then, linking the
information through reference links to bring forth the specifc page containing the desired information.
Structure mining uses two main problems of the World Wide Web due to its vast amount of information. The frst
of these problems is irrelevant search results. Relevance of search information become misconstrued due to the
problem that search engines often only allow for low precision criteria. The second problem is the inability to index
the vast amount if information provided on the Web. This causes a low amount of recall with content mining. This
minimisation comes in part with the function of discovering the model underlying the Web hyperlink structure
provided by Web structure mining.
The main purpose for structure mining is to extract previously unknown relationships between Web pages. This
structure data mining offers use for a business to link the information of its own Web site to enable navigation and
cluster information into site maps. This allows its users the ability to access the desired information through keyword
association and content mining. Hyperlink hierarchy is also determined to path the related information within the
sites to the relationship of competitor links and connection through search engines and third party co-links.This
enables clustering of connected Web pages to establish the relationship of these pages.
On the WWW, the use of structure mining enables the determination of similar structure of Web pages by clustering
through the identifcation of underlying structure. This information can be used to project the similarities of web
content. The known similarities then provide ability to maintain or improve the information of a site to enable access
of web spiders in a higher ratio. The larger the amount of Web crawlers, the more benefcial to the site because of
related content to searches.
In the business world, structure mining can be quite useful in determining the connection between two or more
business Web sites. The determined connection brings forth an effective tool for mapping competing companies
through third party links such as resellers and customers. This cluster map allows for the content of the business
pages placing upon the search engine results through connection of keywords and co-links throughout the relationship
of the Web pages. This determined information will provide the proper path through structure mining to enhance
navigation of these pages through their relationships and link hierarchy of the Web sites.
93/JNU OLE
With improved navigation of Web pages on business Web sites, connecting the requested information to a search
engine becomes more effective. This stronger connection allows generating traffc to a business site to provide results
that are more productive. The more links provided within the relationship of the web pages enable the navigation
to yield the link hierarchy allowing navigation ease. This improved navigation attracts the spiders to the correct
locations providing the requested information, proving more benefcial in clicks to a particular site.
Therefore, Web mining and the use of structure mining can provide strategic results for marketing of a Web site
for production of sale. The more traffc directed to the Web pages of a particular site increases the level of return
visitation to the site and recall by search engines relating to the information or product provided by the company.
This also enables marketing strategies to provide results that are more productive through navigation of the pages
linking to the homepage of the site itself. Structure mining is a must, in order to truly utilize your website as a
business tool web.
4.8 Web Usage Mining
Web usage mining is the third category in web mining. This type of web mining allows for the collection of Web
access information for Web pages. This usage data provides the paths leading to accessed Web pages. This information
is often gathered automatically into access logs via the Web server. CGI scripts offer other useful information such
as referrer logs, user subscription information and survey logs. This category is important to the overall use of data
mining for companies and their internet/ intranet based applications and information access.
Usage mining allows companies to produce productive information pertaining to the future of their business function
ability. Some of this information can be derived from the collective information of lifetime user value, product
cross marketing strategies and promotional campaign effectiveness. The usage data that is gathered provides the
companies with the ability to produce results more effective to their businesses and increasing of sales. Usage data
can also be effective for developing marketing skills that will out-sell the competitors and promote the company’s
services or product on a higher level.
Usage mining is valuable not only to businesses using online marketing, but also to e-businesses whose business
is based solely on the traffc provided through search engines. The use of this type of web mining helps to gather
the important information from customers visiting the site. This enables an in-depth log to complete analysis of a
company’s productivity fow. E-businesses depend on this information to direct the company to the most effective
Web server for promotion of their product or service.
This web mining also enables Web based businesses to provide the best access routes to services or other
advertisements. When a company advertises for services provided by other companies, the usage mining data allows
for the most effective access paths to these portals. In addition, there are typically three main uses for mining in
this fashion.
The frst is usage processing, used to complete pattern discovery. This use is also the most diffcult because only
bits of information like IP addresses, user information, and site clicks are available. With this minimal amount of
information available, it is harder to track the user through a site, being that it does not follow the user throughout
the pages of the site.
The second use is content processing, consisting of the conversion of Web information like text, images, scripts
and others into useful forms. This helps with the clustering and categorization of Web page information based on
the titles, specifc content and images available.
Finally, the third use is structure processing. This consists of analysis of the structure of each page contained in a
Web site. This structure process can prove to be diffcult if resulting in a new structure having to be performed for
each page.
Data Mining
94/JNU OLE
Analysis of this usage data will provide the companies with the information needed to provide an effective presence
to their customers. This collection of information may include user registration, access logs and information leading
to better Web site structure, proving to bemost valuable to company online marketing. These present some of the
advantages for external marketing of the company’s products, services and overall management.
Internally, usage mining effectively provides information to improvement of communication through intranet
communications. Developing strategies through this type of mining will allow for intranet based company databases
to be more effective through the provision of easier access paths. The projection of these paths helps to log the user
registration information giving commonly used paths the forefront to its access.
Therefore, it is easily determined that usage mining has valuable uses to the marketing of businesses and a direct
impact to the success of their promotional strategies and internet traffc. This information is gathered regularly
and continues to be analyzed consistently. Analysis of this pertinent information will help companies to develop
promotions that are more useful, internet accessibility, inter-company communication and structure, and productive
marketing skills through web usage mining.
95/JNU OLE
Summary
The knowledge discovery process comprises six phases, such as data selection, data cleansing, enrichment, data •
transformation or encoding, data mining, and the reporting and display of the discovered information.
Data mining is typically carried out with some end goals or applications. Broadly speaking, these goals fall into •
the following classes: prediction, identifcation, classifcation, and optimisation.
The term “knowledge” is very broadly interpreted as involving some degree of intelligence. •
Deductive knowledge deduces new information based on applying pre-specifed logical rules of deduction on •
the given data.
The knowledge discovery process defnes a sequence of steps (with eventual feedback loops) that should be •
followed to discover knowledge in data.
The KDP model consists of a set of processing steps to be followed by practitioners when executing a knowledge •
discovery project.
Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts •
or activity related to the World Wide Web.
Web analysis tools analyse and process these web server log fles to produce meaningful information. •
Web analysis tools offer companies with previously unknown statistics and useful insights into the behaviour •
of their on-line customers.
Content mining is the scanning and mining of text, pictures and graphs of a Web page to determine the relevance •
of the content to the search query.
Web structure mining, one of three categories of web mining for data, is a tool used to identify the relationship •
between Web pages linked by information or direct link connection.
Web usage mining allows for the collection of Web access information for Web pages. •
References
Zaptron, 1999. • Introduction to Knowledge-based Knowledge Discovery [Online] Available at: <http://www.
zaptron.com/knowledge/>. [Accessed 9 September 2011].
Maimom, O. and Rokach, L., • Introduction To Knowledge Discovery In Database [Online PDF] Available at:
<http://www.ise.bgu.ac.il/faculty/liorr/hbchap1.pdf>. [Accessed 9 September 2011].
Maimom, O. and Rokach, L., 2005. • Data mining and knowledge discovery handbook, Springer Science and
Business.
Liu, B., 2007. • Web data mining: exploring hyperlinks, contents, and usage data, Springer.
http://nptel.iitm.ac.in, 2008 • . Lecture - 34 Data Mining and Knowledge Discovery [Video Online] Available at
:< http://www.youtube.com/watch?v=m5c27rQtD2E>. [Accessed 12 September 2011].
http://nptel.iitm.ac.in, 2008. • Lecture - 35 Data Mining and Knowledge Discovery Part II [Video Online] Available
at: <http://www.youtube.com/watch?v=0hnqxIsXcy4&feature=relmfu>. [Accessed 12 September 2011].
Recommended Reading
Scime, A., 2005. Web mining: applications and techniques, Idea Group Inc (IGI). •
Han, J., Kamber, M. 2006. • Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann.
Chang, G., 2001. • Mining the World Wide Web: an information search approach, Springer.
Data Mining
96/JNU OLE
Self Assessment
During__________, data about specifc items or categories of items, or from stores in a specifc region or area 1.
of the country, may be selected.
data selection a.
data extraction b.
data mining c.
data warehousing d.
_________ and encoding may be done to reduce the amount of data. 2.
Data selection a.
Data extraction b.
Data transformation c.
Data mining d.
Which of the following is defned as the nontrivial process of identifying valid, novel, potentially useful, and 3.
ultimately understandable patterns in data?
Knowledge discovery process a.
Data mining b.
Data warehousing c.
Web mining d.
Match the columns 4.
Association rules 1.
correlate the presence of a set of items with another range of A.
values for another set of variables
Classifcation 2.
hierarchies
work from an existing set of events or transactions to create a B.
hierarchy of classes
Sequential patterns 3. sequence of actions or events is sought C.
Clustering 4.
a given population of events or items can be partitioned D.
(segmented) into sets of “similar” elements
1-A, 2-B, 3-C, 4-D a.
1-D, 2-A, 3-C, 4-B b.
1-B, 2-A, 3-C, 4-D c.
1-C, 2-B, 3-A, 4-D d.
Which of the following requires a signifcant project management effort that needs to be grounded in a solid 5.
framework?
Knowledge discovery process a.
Knowledge discovery in database b.
Knowledge discovery project c.
Knowledge discovery d.
97/JNU OLE
__________ is the extraction of interesting and potentially useful patterns and implicit information from artifacts 6.
or activity related to the World Wide Web.
Data extraction a.
Data mining b.
Web mining c.
Knowledge discovery process d.
____________ tools provide companies with previously unknown statistics and useful insights into the behaviour 7.
of their on-line customers.
Web analysis a.
Web mining b.
Data mining c.
Web usage mining d.
What enables e-tailers to leverage their on-line customer data by understanding and predicting the behaviour 8.
of their customers?
Web analysis a.
Web mining b.
Data mining c.
Web usage mining d.
Which of the following is known as text mining? 9.
Web mining a.
Web usage mining b.
Web content mining c.
Web structure mining d.
What is used to identify the relationship between Web pages linked by information or direct link connection? 10.
Web mining a.
Web usage mining b.
Web content mining c.
Web structure mining d.
Data Mining
98/JNU OLE
Chapter V
Advance topics of Data Mining
Aim
The aim of this chapter is to:
introduce the concept of spatial data mining and knowledge discovery •
analyse different techniques of spatial data mining and knowledge discovery •
explore the sequence mining •
Objectives
The objectives of this chapter are to:
explicate the cloud model •
describe the elucidate temporal mining •
elucidate database mediators •
Learning outcome
At the end of this chapter, you will be able to:
enlist types of temporal data •
comprehend temporal data processing •
understand the classifcation techniques •
99/JNU OLE
5.1 Introduction
The technical progress in computerised data acquisition and storage, results in the growth of vast databases. With the
continuous increase and accumulation, the huge amounts of the computerised data have far exceeded human ability
to completely interpret and use. These phenomena may be more serious in geo-spatial science. In order to understand
and make full use of these data repositories, a few techniques have been tried, for instance, expert system, database
management system, spatial data analysis, machine learning, and artifcial intelligence. In 1989, knowledge discovery
in databases was further proposed. Later, in 1995, data mining also appeared. As both data mining and knowledge
discovery in databases virtually point to the same techniques, people would like to call them together, that is data
mining and knowledge discovery (DMKD). As 80% data are geo-referenced, the necessity forces people to consider
spatial characteristics in DMKD and to further develop a branch in geo-spatial science, that is SDMKD.
Spatial data are more complex, more changeable and larger that common affair datasets. Spatial dimension means each
item of data has a spatial reference where each entity occurs on the continuous surface, or where the spatial-referenced
relationship exists between two neighbour entities. Spatial data contains not only positional data and attribute data,
but also spatial relationships among spatial entities. Moreover, spatial data structure is more complex than the tables
in ordinary relational database. Besides tabular data, there are vector and raster graphic data in spatial database.
Moreover, the features of graphic data are not explicitly stored in the database. At the same time, contemporary
GIS have only basic analysis functionalities, the results of which are explicit. And it is under the assumption of
dependency and on the basis of the sampled data that geostatistics estimates at unsampled locations or make a map
of the attribute. Because the discovered spatial knowledge can support and improve spatial data-referenced decision-
making, a growing attention has been paid to the study, development and application of SDMKD.
5.2 Concepts
Spatial data mining and knowledge discovery (SDMKD) is an effcient extraction of hidden, implicit, interesting,
previously unknown, potentially useful, ultimately understandable, spatial or non-spatial knowledge (rules,
regularities, patterns, constraints) from incomplete, noisy, fuzzy, random and practical data in large spatial databases. It
is a confuence of databases technology, artifcial intelligence, machine learning, probabilistic statistics, visualisation,
information science, pattern recognition and other disciplines. Understood from different viewpoints (Table 5.1),
SDMKD shows many new interdisciplinary characteristics.
Viewpoints Spatial data mining and knowledge discovery
Discipline
An interdisciplinary subject, and its theories and techniques are linked
with database, computer, statistics, cognitive science, artifcial intelligence,
mathematics, machine learning, network, data mining, knowledge discovery
database, data analysis, pattern recognition, and so on.
Analysis
Discover unknown and useful rules from huge amount of data via sets of
interactive, repetitive, associative, and data-oriented manipulations.
Logic
An advanced technique of deductive spatial reasoning. It is discovery, not
proof. The knowledge is conditional generic on the mined data.
Cognitive science
An inductive process that is from concrete data to abstract patterns, from
special phenomena to general rules.
Objective data
Data forms: vector, raster, and vector-raster
Data structures: hierarchy, relation, net, and object-oriented.
Spatial and non-spatial data contents: positions, attributes, texts, images,
graphics, databases, fle system, log fles, voices, web and multimedia
Systematic information
Original data in database, cleaned data in data warehouse, senior commands
from users, background knowledge from applicable felds.
Data Mining
100/JNU OLE
Methodology
Match the multidisciplinary philosophy of human thinking that suitably
deals with the complexity, uncertainty, and variety when briefng data and
representing rules.
Application
All spatial data-referenced felds and decision-making process, for example,
GIS, remote sensing, GPS (global positioning system), transportation, police,
medicine, transportation, navigation, robot, and so on.
Table 5.1 Spatial data mining and knowledge discovery in various viewpoints
5.2.1 Mechanism
SDMKD is a process of discovering a form of rules along with exceptions at hierarchal view-angles with various
thresholds, for instance, drilling, dicing and pivoting on multidimensional databases, spatial data warehousing,
generalising, characterising and classifying entities, summarising and contrasting data characteristics, describing
rules, predicting future trends and so on. It is also a supportable process of spatial decision-making. There are two
mining granularities, such as spatial object granularity and pixel granularity.
It may be briefy partitioned into three big steps, such as data preparation (positioning mining objective, collecting
background knowledge, cleaning spatial data), data mining (decreasing data dimensions, selecting mining techniques,
discovering knowledge), and knowledge application (interpretation, evaluation and application of the discovered
knowledge). In order to discovery the confdential knowledge, it is common to use more than one technique to mine
the data sets simultaneously. Moreover, it is also suitable to select the mining techniques on the basis of the given
mining task and the knowledge to be discovered.
5.2.2 Knowledge to be Discovered
The knowledge is more generalised, condensed and understandable than data. The common knowledge is summarised
and generalised from huge amounts of spatial data sets. The amount of spatial data is huge, while the volume of
spatial rules is very small. The more generalised the knowledge, the bigger is the contrast. There are many kinds of
knowledge that can be mined from large spatial data sets. In table 5.2, these kinds of rules are not isolated, and they
often beneft from each other. And various forms, such as linguistic concept, characteristic table, predication logic,
semantic network, object orientation, and visualisation, can represent the discovered knowledge. Very complex
nonlinear knowledge may be depicted with a group of rules.
Knowledge Description Examples
Association rule
A logic association among different sets of
spatial entities that associate one or more
spatial objects with other spatial objects.
Study the frequency of items occurring
together in transactional databases.
Rain (x, pour) => Landslide (x,
happen), interestingness is 98%,
support is 76%, and confdence is
51%.
Characteristics rule
A common character of a kind of spatial
entity, or several kinds of spatial entities. A
kind of tested knowledge for summarising
similar features of objects in a target class.
Characterise similar ground
objects in a large set of remote
sensing images
Discriminate rule
A special rule that tells one spatial entity
from other spatial entity. Different spatial
characteristics rules. Comparison of general
features of objects between a target class
and a contrasting class.
Compare land price in urban
boundary and land price in urban
centre
101/JNU OLE
Clustering rule
A segmentation rule that groups a set of
objects together by virtue of their similarity
or proximity to each other in the unknown
contexts what groups and how many
groups will be clustered. Organise data in
unsupervised clusters based on attribute
values.
Group crime locations to fnd
distribution patterns.
Classifcation rule
A rule that defnes whether a spatial entity
belongs to a particular class or set in the
known contexts what classes and how
many classes will be classifed. Organise
data in given/supervised classes based on
attribute values.
Classify remote sensed images
based on spectrum and GIS data.
Serial rules
A spatiotemporal constrained rule that
relates spatial entities in time continuously,
or the function dependency among the
parameters. Analyse the trends, deviations,
regression, sequential pattern, and similar
sequences.
In summer, landslide disaster
often happens. Land price is the
function of infuential factors and
time.
Predictive rule
An inner trend that forecasts future values
of some spatial variables when the temporal
or spatial centre is moved to another one.
Predict some unknown or missing attribute
values based on other seasonal or periodical
information.
Forecast the movement trend
of landslide based on available
monitoring data.
Exceptions
Outliers that are isolated from common rules
or derivate from other data observations
very much
A monitoring point with much
bigger movement.

Table 5.2 Main spatial knowledge to be discovered
Knowledge is a combination of rule and exception. A spatial rule is a pattern showing the intersection of two or
more spatial objects or space-depending attributes according to a particular spacing or set of arrangements (Ester,
2000). Besides the rules, during the discovering process of description or prediction, there may be some exceptions
(also named outliers) that derivate so much from other data observations. They identify and explain exceptions
(surprises). For example, spatial trend predictive modelling frst discovered the centres that are local maximal
of some non-spatial attribute, then determined the (theoretical) trend of some non-spatial attribute when moving
away from the centres. Finally, few deviations found that some data were away from the theoretical trend. These
deviations may arouse suspicious that they are noise, or generated by a different mechanism. How to explain these
outliers? Traditionally, outliers’ detection has been studies via statistics, and a number of discordance tests have
been developed. Most of them treat outliers as “noise” and they try to eliminate the effects of outliers by removing
outliers or develop some outlier-resistant methods. In fact, these outliers prove the rules. In the context of data
mining, they are meaningful input signals rather than noise. In some cases, outliers represent unique characteristics
of the objects that are important to an organisation. Therefore, a piece of generic knowledge is virtually in the form
of rule plus exception.
5.3 Techniques of SDMKD
As SDMKD is an interdisciplinary subject, there are various techniques associated with the abovementioned different
knowledge. They may include, probability theory, evidence theory, spatial statistics, fuzzy sets, cloud model, rough
sets, neural network, genetic algorithms, decision tree, exploratory learning, inductive learning, visualisation, spatial
online analytical mining (SOLAM), outlier detection, and so on main techniques are briefed in Table 5.3.
Data Mining
102/JNU OLE
Techniques Description
Probability theory
Mine spatial data with randomness on the basis of stochastic probabilities.
The knowledge is represented as a conditional probability in the contexts of
given conditions and a certain hypothesis of truth. Also named probability
theory and mathematical statistics.
Spatial statistics
Discover sequential geometric rules from disorder data via covariance
structure and variation function in the contexts of adequate samples and
background knowledge. Clustering analysis is a branch.
Evidence theory
Mine spatial data via belief function and possibility function. It is an
extension of probability theory, and suitable for stochastic uncertainty
based SDMKD.
Fuzzy sets
Mine spatial data with fuzziness on the basis of a fuzzy membership function
that depicts an inaccurate probability, by using fuzzy comprehensive
evaluation, fuzzy clustering analysis, fuzzy control, fuzzy pattern
recognition and so on.
Rough sets
Mine spatial data with incomplete uncertainties via a pair of lower
approximation and upper approximation. Rough sets-based SDMKD is
also a process of intelligent decision-making under the umbrella of spatial
data.
Neural network
Mine spatial data via a nonlinear, self-learning, self-suitable, parallel, and
dynamic system composed of many linked neurons in a network. The set
of neurons collectively fnd out rules by continuously learning and training
samples in the network.
Genetic algorithms
Search the optimised rules from spatial data via three algorithms simulating
the replication, crossover, and aberrance of biological evolution.
Decision tree
Reasoning the rules via rolling down and drilling up a tree-structured map,
of which a root node is the mining task, item and branch nodes are mining
process, and leaf nodes are exact data sets. After pruning, the hierarchical
patterns are uncovered.
Exploratory learning
Focusing on data characteristics by analysing topological relationships,
overlaying map-layers, matching images, buffering features (points, lines,
polygon) and optimising road.
Spatial inductive learning
Comes from machine learning. Summarise and generalise spatial data in
the context of given background that is from users or a task of SDMKD.
The algorithms require that the training data be composed of several tuples
with various attributes. And one of the attributes of each tuples is the class
label.
Visualisation
Visually mine spatial data by computerised visualisation techniques that
make abstract data and complicated algorithms change into concrete
graphics, images, animation and so on, which user may sense directly in
eyes.
SOLAM
Mine data via online analytical processing and spatial data warehouse.
Based on multidimensional view and web. It is a tested mining that
highlights executive effciency and timely responsibility to commands.
Outlier detection
Extract the interesting exceptions from spatial data via statistics, clustering,
classifcation, and regression besides the common rules.

Table 5.3 Techniques to be used in SDMKD
103/JNU OLE
5.3.1 SDMKD-based Image Classifcation
This section presents an approach to combine spatial inductive learning with Bayesian image classifcation in a loose
manner, which takes learning tuple as mining granularities for learning knowledge subdivide classes into subclasses,
that is pixel granularity and polygon granularity, and selects class probability values of Bayesian classifcation,
shape features, locations and elevations as the learning attributes. GIS data are used in training area selection for
Bayesian classifcation, generating learning data of two granularities, and testing area selection for classifcation
accuracy evaluation. The ground control points for image rectifcation are also chosen from GIS data. It implements
inductive learning in spatial data mining via C5.0 algorithm on the basis of learning granularities. Figure 5.1 shows
the principle of the method.
GIS database
Training area Remote sensing images
Initial classifcation result
Pixel granularity data Polygon granularity data
Test area
Knowledge base
Deductive reasoning
Bayesian classifcation
Inductive learning in
polygon granularity
Inductive learning in
polygon granularity
Final classifcation result
Evaluation of classifcation accuracy
Fig. 5.1 Flow diagram of remote sensing image classifcation with inductive learning
In Figure 5.1, the remote sensing images are classifed initially by Bayesian method before using the knowledge, and
the probabilities of each pixel to every class are retained. Secondly, inductive learning is conducted by the learning
attributes. Learning with probability simultaneously makes use of the spectral information of a pixel and the statistical
information of a class since the probability values are derived from both of them. Thirdly, the knowledge on the
attributes of general geometric features, spatial distribution patterns and spatial relationships is further discovered
from GIS database, for instance, the polygons of different classes. For example, the water areas in the classifcation
image are converted from pixels to polygons by raster to vector conversion, and then the location and shape features
of these polygons are calculated. Finally, the polygons are subdivided into subclasses by deductive reasoning based
on the knowledge, for example, class water is subdivided into subclasses such as river, lake, reservoir and pond.
In Figure 5.2, the fnal classifcation results are obtained by post-processing of the initial classifcation results by
deductive reasoning. Except the class label attribute, the attributes for deductive reasoning are the same as that in
inductive learning. The knowledge discovered by C5.0 algorithm is a group of classifcation rules and a default
class, and each rule is together with a confdence value between 0 and 1. According to how the rule is activated that
the attribute values match the conditions of this rule, the deductive reasoning adopts four strategies:
If only one rule is activated, then let the fnal class be the same as this ruleIf several rules are activated, then •
let the fnal class be the same as the rule with the maximum confdenceIf several rules are activated and the
confdence values are the same, then let the fnal class be the same as the rule with the maximum coverage of
learning samplesIf no rule is activated, then let the fnal class be the default class.
Data Mining
104/JNU OLE
5.3.2 Cloud Model
The cloud model is a model of the uncertainty transition between qualitative and quantitative analysis, that is a
mathematical model of the uncertainty transition between a linguistic term of a qualitative concept and its numerical
representation data. A piece of cloud is made up of lots of cloud drops, visible shape in a whole, but fuzzy in
detail, which is similar to the natural cloud in the sky. Thus, the terminology cloud is used to name the uncertainty
transition model proposed here. Any one of the cloud drops is a stochastic mapping in the discourse universe from
qualitative concept, that is a specifed realisation with uncertain factors. With the cloud model, the mapping from
the discourse universe to the interval [0,1] is a one-point to multi-point transition, that is a piece of cloud while not
a membership curve. As well, the degree that any cloud drop represents the qualitative concept can be specifed.
Cloud model may mine spatial data with both fuzzy and stochastic uncertainties, and the discovered knowledge
is close to human thinking. Recently, in geo-spatial science, the cloud model has been further explored to spatial
intelligent query, image interpretation, land price discovery, factors selection, mechanism of spatial data mining,
and landslide-monitoring.
The cloud model well integrates the fuzziness and randomness in a unifed way via three numerical characteristics,
such as Expected value (Ex), Entropy (En), and Hyper-Entropy (He). In the discourse universe, Ex is the position
corresponding to the centre of the cloud gravity, whose elements are fully compatible with the spatial linguistic
concept; En is a measure of the concept coverage, that is a measure of the spatial fuzziness, which indicates how
many elements could be accepted to the spatial linguistic concept; and He is a measure of the dispersion on the
cloud drops, which can also be considered as the entropy of En. In the extreme case, {Ex, 0, 0}, denotes the concept
of a deterministic datum where both the entropy and hyper entropy equal to zero. The greater the number of cloud
drops, the more deterministic the concept. Figure 5.2 shows the three numerical characteristics of the linguistic
term “displacement is 9 millimetres (mm) around”. Given three numerical characteristics Ex, En and He, the cloud
generator can produce as many drops of the cloud as you would like.
1
05
0 9 mm
x
He
Ex
3En
C
7
(x)
Fig. 5.2 Three numerical characteristics
The above three visualisation methods are all implemented with the forward cloud generator in the context of the
given {Ex, En, He}. Despite of the uncertainty in the algorithm, the positions of cloud drops produced each time are
deterministic. Each cloud drop produced by the cloud generator is plotted deterministically according to the position.
On the other hand, it is an elementary issue in spatial data mining that spatial concept is always constructed from
the given spatial data, and spatial data mining aims to discover spatial knowledge represented by a cloud from the
database. That is, the backward cloud generator is also essential. It can be used to perform the transition from data
to linguistic terms, and may mine the integrity {Ex, En, He} of cloud drops specifed by many precise data points.
Under the umbrella of mathematics, the normal cloud model is common, and the functional cloud model is more
interesting. Because it is common and useful to represent spatial linguistic atoms, the normal compatibility cloud
will be taken as an example to study the forward and backward cloud generators in the following.
105/JNU OLE
The input of the forward normal cloud generator is three numerical characteristics of a linguistic term, (Ex, En,
He), and the number of cloud-drops to be generated, N, while the output is the quantitative positions of N cloud
drops in the data space and the certain degree that each cloud-drop can represent the linguistic term. The algorithm
in details is:
Produce a normally distributed random number En’ with mean En and standard deviation He; •
Produce a normally distributed random number x with mean Ex and standard deviation En’; •
Calculate •
Drop (xi, yi) is a cloud-drop in the discourse universe; •
Repeat the above steps until N cloud-drops are generated. •
Simultaneously, the input of the backward normal cloud generator is the quantitative positions of N cloud-drops,
xi (i=1,…,N), and the certainty degree that each cloud-drop can represent a linguistic term, yi(i=1,…,N), while the
output is the three numerical characteristics, Ex, En, He, of the linguistic term represented by the N cloud-drops.
The algorithm in details is:
Calculate the mean value of xi(i=1,…,N), •
For each pair of (xi, yi), calculate •
Calculate the mean value of Eni (i=1,…, N), •
Calculate the standard deviation of Eni,

With the given algorithms of forward and backward cloud generators, it is easy to build the mapping relationship
inseparably and interdependently between qualitative concept and quantitative data. The cloud model enhances
the weakness of rigid specifcation and too much certainty, which comes into confict with the human recognition
process, appeared in commonly used transition models. Moreover, it performs the interchangeable transition between
qualitative concept and quantitative data through the use of strict mathematic functions, the preservation of the
uncertainty in transition makes cloud model well meet the need of real life situation. Obviously, the cloud model is
not a simple combination of probability methods and fuzzy methods.
5.3.3 Data Fields
The obtained spatial data are comparatively incomplete. Each datum in the concept space has its own contribution
in forming the conception and the concept hierarchy. Thus, it is necessary for the observed data to radiate their
data energies from the sample space to their parent space. In order to describe the data radiation, data feld is
proposed.
Spatial data radiate energies into data feld. The power of the data feld may be measured by its potential with a feld
function. This is similar with the electric charges contribute to form the electric feld that every electric charge has
effect on electric potential everywhere in the electric feld. Therefore, the function of data feld can be derived from
the physical felds. The potential of a point in the number universe is the sum of all data potentials.
Where, k is a constant of radiation gene, r
i
is the distance from the point to the position of the ith observed data, ρ
i

is the certainty of the ith data, and N is the amount of the data. With a higher certainty, the data may have greater
contribution to the potential in concept space. Besides them, space between the neighbour isopotential, computerised
grid density of Descartes coordinate, and so on may also make their contributions to the data feld.
Data Mining
106/JNU OLE
5.4 Design- and Model-based Approaches to Spatial Sampling
The design and model based approaches to spatial sampling are explained below:
5.4.1 Design-based Approach to Sampling
The design-based approach or classical sampling theory approach to spatial sampling views the population of values
in the region as a set of unknown values, which are apart from any measurement error fxed in value. Randomness
enters through the process for selecting the locations to sample. In the case of a discrete population, the target of
inference is some global property such as:
(eq. 1)
Where N is the number of members of the population, so the above equation is the population mean. If z(k) is
binary depending on whether the kth member of the population is of a certain category or not, then the equation is
the population proportion of some specifed attribute. In the case of a continuous population in region A of area |A|
then the equation would be replaced by the integral:
(eq. 2)
The design-based approach is principally used for tackling ‘how much’ questions such as estimating the above
equations. In principal, individual z(k) could be targets of inference but, because design-based estimators disregard
most of the information that is available on where the samples are located in the study area, in practice this is either
not possible or gives rise to estimators with poor properties.

k + 1
k – 1
s – 1
Design-based sample with one observation per strata. In the absence of spatial information the point X
in strata (k, s) would have to be estimated using the other point in the strata (k, s) even though in fact
the samples in two other strata are closer and may well provide better estimates.
s + 1 s
K
X
Fig. 5.3 Using spatial information for estimation from a sample
107/JNU OLE
5.4.2 Model-based Approach to Sampling
The model-based approach or superpopulation approach to spatial sampling views the population of values in the
study region as but one realisation of some stochastic model. The source of randomness that is present in a sample
derives from a stochastic model. Again, the target of inference could be (eq.1) or (eq.2).
Under the superpopulation approach, (eq.1) for example now represents the mean of just one realisation. While
other realisations to be generated, (eq.1) would differ across realisations. Under this strategy, since (eq.1) is a sum
of random variables, (eq.1) is itself a random variable and it is usual to speak of predicting its value. A model-based
sampling strategy provides predictors that depend on model properties and are optimal with respect to the selected
model. Results may be dismissed if the model is subsequently rejected or disputed.
In the model-based approach it is the mean (μ) of the stochastic model assumed to have generated the realised
population that is the usual target of inference rather than a quantity such as (eq.1). This model mean can be considered
the underlying signal of which (eq.1) is a ‘noisy’ refection. Since μ is a (fxed) parameter of the underlying stochastic
model, if it is the target of inference, it is usual to speak of estimating its value. In spatial epidemiology for example,
it is the true underlying relative risk for an area rather than the observed or realised relative risk revealed by the
specifc data set that is of interest. Another important target of inference within the model-based strategy is often z(i
) – the value of Z at the location i. Since Z is a random variable it is usual to speak of predicting the value z(i ).
5.5 Temporal Mining
A database system consists of three layers: physical, logical, and external. The physical layer deals with the storage
of the data, while the logical layer deals with the modelling of the data. The external layer is the layer that the
database user interacts with by submitting database queries. A database model depicts the way that the database
management system stores the data and manages their relations. The most prevalent models are the relational and
the object-oriented. For the relational model, the basic construct at the logical layer is the table, while for the object-
oriented model it is the object. Because of its popularity, we will use the relational model in this book. Data are
retrieved and manipulated in a relational database, using SQL. A relational database is a collection of tables, also
known as relations. The columns of the table correspond to attributes of the relational variable, while the rows, also
known as tuples, correspond to the different values of the relational variable. Other frequently used database terms
are the following:
Constraint: A rule imposed on a table or a column. •
Trigger: The specifcation of a condition whose occurrence in the database causes the appearance of an external •
event, such as the appearance of a popup.
View: A stored database query that hides rows and/or columns of a table. •
Temporal databases are databases that contain time-stamping information. Time-stamping can be done as
follows:
With a valid time, this is the time that the element information is true in the real world. •
With a transaction time, this is the time that the element information is entered into the database. •
Bi-temporally, with both a valid time and a transaction time. •
Time-stamping is usually applied to each tuple; however, it can be applied to each attribute as well. Databases that
support time can be divided into four categories:
Snapshot databases: They keep the most recent version of the data. Conventional databases fall into this •
category.
Rollback databases: They support only the concept of transaction time. •
Historical databases: They support only valid time. •
Temporal databases: They support both valid and transaction times. •
Data Mining
108/JNU OLE
The two types of temporal entities that can be stored in a database are:
Interval: A temporal entity with a beginning time and an ending time. •
Event: A temporal entity with an occurrence time. •
In addition to interval and event, another type of a temporal entity that can be stored in a database is a time series. A
time series consists of a series of real valued measurements at regular intervals. Other frequently used terms related
to temporal data are the following:
Granularity It describes the duration of the time sample/measurement.
Anchored Data
Anchored data can be used to describe either the time of an occurrence of an
event or the beginning and ending times of an interval.
Unanchored data
They are used to represent the duration of an
Interval.
Data coalescing
The replacement of two tuples A and B with a single tuple C, where A and B
have identical non-temporal attributes and adjacent or overlapping temporal
intervals. C has the same non-temporal attributes as A and B, while its temporal
interval is the union of A’s and B’s temporal intervals.

Table 5.4 Terms related to temporal data
5.5.1 Time in Data Warehouses
A data warehouse (DW) is a repository of data that can be used in support of business decisions. Many data warehouses
have a time dimension and thus they support the idea of valid time. In addition, data warehouses contain snapshots of
historical data and inherently support the idea of transaction time. Therefore, a DW can be considered as a temporal
database, because it inherently contains bi-temporal time-stamping. Time affects the structure of the warehouse too.
This is done by gradually increasing the granularity coarseness as we move further back in the time dimension of
the data. Data warehouses, therefore, inherently support coalescing. Despite the fact that data warehouses inherently
support the notion of time, they are not equipped to deal with temporal changes in master data.
5.5.2 Temporal Constraints and Temporal Relations
Temporal constraints can be either qualitative or quantitative. Regarding quantitative temporal constraints, variables
take their values over the set of temporal entities and the constraints are imposed on a variable by restricting its set
of possible values. In qualitative temporal constraints, variables take their value from a set of temporal relations.
5.5.3 Requirements for a Temporal Knowledge-Based Management System
Following is a list of requirements that must be fulflled by a temporal knowledge-based management system:
To be able to answer real-world queries, it must be able to handle large temporal data amounts. •
It must be able to represent and answer queries about both quantitative and qualitative temporal relationships. •
It must be able to represent causality between temporal events. •
It must be able to distinguish between the history of an event and the time that system learns about the event. •
In other words, it must be able to distinguish between valid and transaction times.
It must offer an expressive query language that also allows updates. •
It must be able to express persistence in a parsimonious way. In other words, when an event happens, it should •
change only the parts of the system that are affected by the event.
5.6 Database Mediators
Temporal database mediator is used to discover temporal relations, implement temporal granularity conversion, and
also discover semantic relationships. This mediator is a computational layer placed between the user interface and
the database. Following fgure shows the different layers of processing of the user query: “Find all patients who were
109/JNU OLE
admitted to the hospital in February.” The query is submitted in natural language in the user interface, and then, in
the Temporal Mediator (TM) layer, it is converted to an SQL query. It is also the job of the Temporal Mediator to
perform temporal reasoning to fnd the correct beginning and end dates of the SQL query.
User
UI: Find all patients who
were admitted in
February
TM: Perform temporal
reasoning; convert to SQL
Patient records
S
Q
L

q
u
e
r
y

Fig. 5.4 Different layers of user query processing
5.6.1 Temporal Relation Discovery
The discovery of temporal relations has applications in temporal queries, constraint, and trigger implementation.
The following are some temporal relations:
A column constraint that implements an after relation between events. •
A query about a before relation between events. •
A query about an equal relationship between intervals. •
A database trigger about a meets relationship between an interval and an event or between two intervals. •
5.6.2 Semantic Queries on Temporal Data
Traditional temporal data mining focuses heavily on the harvesting of one specifc type of temporal information:
cause/effect relationships through the discovery of association rules, classifcation rules, and so on. However, in
this book we take a broader view of temporal data mining, one that encompasses the discovery of structural and
semantic relationships, where the latter is done within the context of temporal ontologies. The discovery of structural
relationships will be discussed in the next chapter; however, the discovery of semantic relationships is discussed
in this section, because it is very closely intertwined with the representation of time inside the database. Regarding
terminology, ontology is a model of real-world entities and their relationships, while the term semantic relationship
denotes a relationship that has a meaning within ontology. Ontologies are being developed today for a large variety
of felds ranging from the medical feld to geographical systems to business process management. As a result, there
is a growing need to extract information from database systems that can be classifed according to the ontology of
the feld. It is also desirable that this information extraction is done using natural language processing (NLP).
Data Mining
110/JNU OLE
5.7 Temporal Data Types
Temporal data can be distinguished into three types:
Time series: They represent ordered real-valued measurements at regular temporal intervals. A time series X = •
{x
1
, x
2
,…, x
n
} for t = t
1
, t
2
,…, t
n
is a discrete function with value x
1
for time t
1
, value x
2
for time t
2
, and so on. Time
series data consist of varying frequencies where the presence of noise is also a common phenomenon. A time
series can be multivariate or univariate. A multivariate time series is created by more than one variable, while
in a univariate time series there is one underlying variable. Another differentiation of time series is stationary
and non-stationary. A stationary time series has a mean and a variance that does not change over time, while a
non-stationary time has no salient mean and can decrease or increase over time.
Temporal sequences: These can be timestamped at regular or irregular time intervals. •
Semantic temporal data:These are defned within the context of ontology. •

Finally, an event can be considered as a special case of a temporal sequence with one time-stamped element. Similarly,
a series of events is another way to denote a temporal sequence, where the elements of the sequence are of the same
types semantically, such as earthquakes, alarms and so on.
5.8 Temporal Data Processing
The temporal data mining process is explained in detail below.
5.8.1 Data Cleaning
In order for the data mining process to yield meaningful results, the data must be appropriately preprocessed to make
it “clean,” which can have different meanings depending on the application. Here, we will deal with two aspects of
data “cleanliness”: handling missing data and noise removal.
Missing data
A problem that quite often complicates time series analysis is missing data. There are several reasons for this, such
as malfunctioning equipment, human error, and environmental conditions. The specifc handling of missing data
depends on the specifc application and the amount and type of missing data. There are two approaches to deal with
missing data:
Not flling in the missing values: For example, in the similarity computation section of this chapter, we discuss •
a method that computes the similarities of two time series by comparing their local slopes. If a segment of data
is missing in one of the two time series, we simply ignore that piece in our similarity computation.
Filling in the missing value with an estimate: For example, in the case of a time series and for small numbers •
of contiguous missing values, we can use data interpolation to create an estimate of the missing values, using
adjacent values as a form of imputation. The greater the distance between adjacent values used to estimate
the intermediate missing values, the greater is the interpolation error. The allowable interpolation error and,
therefore, the interpolation distance vary from application to application. The simplest type of interpolation that
gives reasonable results is linear interpolation.
Noise removal
Noise is defned as random error that occurs in the data mining process. It can be due to several factors, such as
faulty measurement equipment and environmental factors. The two methods of dealing with noise in data mining
are: binning and moving-average smoothing. In binning, the data are divided into buckets or bins of equal size.
Then the data are smoothed by using either mean, the median, or the boundaries of the bin.
111/JNU OLE
5.8.1 Data Normalisation
In data normalisation, the data are scaled so that they fall within a prespecifed range, such as [0–1]. Normalisation
allows data to be transformed to the same “scale” and, therefore, allows direct comparisons among their values.
Normalisation can be examined in two easy types:
Min-max normalisation: To do this type of normalisation, we need to know the minimum (x •
min
) and the maximum
(x
max
) of the data:
Z-score normalisation: Here, the mean and the standard deviation of the data are used to normalise them: •
Z-score normalisation is useful, in cases of outliers in the data, that is, data points with extremely low or high values
that are not representative of the data and could be due to measurement error.
As we can see, both types of normalisation preserve the shape of the original time series, but z-score normalisation
follows the shape more closely.
5.9 Temporal Event Representation
Temporal even representation is explained below.
5.9.1 Event Representation Using Markov Models
In many areas of research, a question that often arises is, what is the probability that a specifc event B will happen
after an event A? For example, in earthquake research, one would like to determine the probability that there will
be a major earthquake, after certain types of smaller earthquakes. In speech recognition, one would like to know,
for example, the probability that the vowel e will appear after the letter h. Markov models offer a way to do the
same.
0.6
0.9
0.2
0.1
0.2
0.8
0.2
Stay enrolled
in electrical
Engineering
Enrolment in
business
Enrolment in
mechanical
engineering
Change
enrolment again
Fig. 5.5 A Markov diagram that describes the probability of program enrolment changes
Here, we will use the fact that a Markov model can be represented using a graph that consists of vertices and arcs.
The vertices show states, while the arcs show transitions between the states.
A more specifc example is shown in Figure 5.5, which shows the probability of changing majors in college. Thus,
we see that the probability of staying enrolled in electrical engineering is 0.6, while the probability of changing major
from electrical engineering to business studies is 0.2 and the probability of switching from electrical engineering to
mechanical engineering is also 0.2. As we can see, the sum of probabilities leaving the “Electrical eng. enrolment”
is 1.
Data Mining
112/JNU OLE
5.9.2 A Formalism for Temporal Objects and Repetitions
Allen’s temporal algebra is the most widely accepted way to characterise relationships between intervals. Some of
these relationships are before meets, starts, during, overlaps, fnishes, and equals, and they are respectively written
as .b., .m., .s., .d., .o., .f., and .e. Their inverse relationships can be expressed with a negative power of −1. For
example, .e
−1
denotes the inverse of the equal’s relationship, which is not equal.
5.10 Classifcation techniques
In classifcation, we assume we have some domain knowledge about the problem we are trying to solve. This domain
knowledge can come from domain experts or sample data from the domain, which constitute the training data set.
In this section, we will examine classifcation techniques that were developed for nontemporal data. However, they
can be easily extended, as the examples will demonstrate, to temporal data.
5.10.1 Distance-Based Classifer
As the name implies, the main idea in this family of classifers is to classify a new sample to the nearest class. Each
sample is represented by N features. There are different implementations of this type of classifer, based on (1) how
each class is represented and (2) how the distance is computed from each class. There are two variations on how
to represent each class. First, in the K–Nearest Neighbours approach, all samples of a class are used to represent
the class. Second, in the exemplar-based approach, each class is represented by a representative sample, commonly
referred to as the exemplar of this class. The most widely used metric to compute the distance of the unknown class
sample to the existing classes is the Euclidean distance. However, for some classifcation techniques other distance
measures are more suitable.
As an example, we assume we have a new sample x of unknown class with N features, where each i
th
feature is
denoted as x
i
. Then the Euclidean distance from a class sample, whose the i
th
feature is denoted as y
i
, is defned as
K-Nearest neighbours
In this type of classifer, the domain knowledge of each class is represented by all of its samples. The new sample
X, whose class is unknown, is classifed to the class that has the K nearest neighbours to it. K can be 1, 2, 3, and so
on. Because all the training samples are stored for each class, this is a computationally expensive method.
Several important considerations about the nearest neighbourhood algorithm are described below:The algorithm’s
performance is affected by the choice of K. If K is small, then the algorithm can be affected by noise points. If K
is too large, then the nearest neighbours can belong to many different classes.
The choice of the distance measure can also affect the performance of the algorithm. Some distance measures •
are infuenced by the dimensionality of the data. For example, the Euclidean distance’s classifying power is
reduced as the number of attributes increases.
The error of the K-NN algorithm asymptotically approaches that of the Bayes error. •
K-NN is particularly applicable to classifcation problems with multimodal classes. •
5.10.2 Bayes Classifer
The Bayes classifer is a statistical classifer, because classifcation is based on the computation of probabilities, and
domain knowledge is also expressed in a probabilistic way.
5.10.3 Decision Tree
Decision trees are widely used in classifcation because they are easy to construct and use. The frst step in decision
tree classifcation is to build the tree. A popular tree construction algorithm is ID3, which uses information theory
as its premise. An improvement to the ID3 algorithm that can handle non-categorical data is C4.5. A critical step
in the construction of the decision tree is how to order the splitting of the features in the tree. In the ID3 algorithm,
the guide for the ordering of the features is entropy. Given a data set, where each value has probabilities p
1
, p
2
,…,
p
n
, its entropy is defned as
113/JNU OLE
Entropy shows the amount of randomness in a data set and varies from 0 to 1. If there is no amount of uncertainty
in the data, then the entropy is 0. For example, this can happen if one value has probability 1 and the others have
probability 0. If all values in the dataset are equally probable, then the amount of randomness in the data is maximised,
and entropy becomes 1. In this case, the amount of information in the data set is maximised. Here is the main idea
of the ID3 algorithm: A feature is chosen as the next level of the tree if its splitting produces the most information
gain.
5.10.4 neural networks in Classifcation
An artifcial neural network (ANN) is a computational model that mimics the human brain in the sense that it
consists of a set of connected nodes, similar to neurons. The nodes are connected with weighted arcs. The system
is adaptive because it can change its structure based on information that fows through it. In addition to nodes and
arcs, neural networks consist of an input layer, hidden layers, and an output layer. The simplest form of ANN is the
feed-forward ANN, where information fows in one direction and the output of each neuron is calculated by summing
the weighted signals from incoming neurons and passing the sum through an activation function. A commonly used
activation function is the sigmoid function:
A common training process for feed forward neural networks is the back-propagation process, where we go back
in the layers and modify the weights. The weight of each neuron is adjusted such that its error is reduced, where a
neuron’s error is the difference between its expected and actual outputs. The most well-known feedforward ANN is
the perceptron, which consists of only two layers (no hidden layers) and works as a binary classifer. If three or more
layers exist in the ANN (at least one hidden layer) then the perceptron is known as the multilayer perceptron.
Another type of widely used feedforward ANN is the radial-basis function ANN, which consists of three layers
and the activation function is a radial-basis function (RBF). This type of function, as the name implies, has radial
symmetry such as a Gaussian function and allows a neuron to respond to a local region of the feature space. In
other words, the activation of a neuron depends on its distance from a centre vector. In the training phase, the RBF
centres are chosen to match the training samples. Neural network classifcation is becoming very popular and one
of its advantages is that it is resistant to noise. The input layer consists of the attributes used in the classifcation,
and the output nodes correspond to the classes. Regarding hidden nodes, too many nodes lead to over-ftting while
too fewer nodes can lead to reduced classifcation accuracy. Originally, each arc is assigned a random weight, which
can then be modifed in the learning process.
5.11 Sequence Mining
Sequence mining is explained in following points.
5.11.1 Apriori Algorithm and Its Extension to Sequence Mining
A sequence is a time-ordered list of objects, in which each object consists of an item set, with an item set consisting
of all items that appear (or were bought) together in a transaction (or session). Note that the order of items in an item
set does not matter. A sequence database consists of tuples, where each tuple consists of a customer id, transaction
id, and an item set. The purpose of sequence mining is to fnd frequent sequences that exceed a user-specifed
support threshold. Note that the support of a sequence is the percentage of tuples in the database that contain the
sequence. An example of a sequence is {(ABD), (CDA)}, where items ABD indicate the items that were purchased
in the frst transaction of a customer and CDA indicate the items that were purchased in a later transaction of the
same customer.
Apriori algorithm is the most extensively used algorithm for the discovery of frequent item sets and association
rules. The main concepts of the Apriori algorithm are as follows:
Any subset of a frequent item set is a frequent item set. •
The set of item sets of size k will be called C •
k
.
The set of frequent item sets that also satisfy the minimum support constraint is known as L •
k
. This is the seed
set used for the next pass over the data.
Data Mining
114/JNU OLE
C •
k
+
1
is generated by joining L
k
with itself. The item sets of each pass have one more element than the item sets
of the previous pass.
Similarly, L •
k
+
1
is then generated by eliminating from C
k
those elements that do not satisfy the minimum support
rule. As the candidate sequences are generated by starting with the smaller sequences and progressively increasing
the sequence size, Apriori is called a breadth frst approach. An example is shown below, where we require the
minimum support to be two. The left column shows the transaction ID and the right column shows the items
that were purchased.
5.11.2 The GSP Algorithm
GSP was proposed by the same team as the Apriori algorithm, and it can be 20 times faster than the Apriori algorithm
because it has a more intelligent selection of candidates for each step and introduces time constraints in the search
space. GSP is formulated to address three cases:
The presence of time constraints that specify a maximum time between adjacent elements in a pattern. The time •
window can apply to items that are bought in the same transaction or in different transactions.
The relaxation of the restriction that the items in an element of a sequential pattern must come from the same •
transaction.
Allowing elements in a sequence to consist of items that belong to taxonomy. •
The algorithm has very good scalability as the size of the data increases. The GSP algorithm is similar to the Apriori
algorithm. It makes multiple passes over the data as shown below:
In the frst pass, it fnds the frequent sequences. In other words, it fnds the sequences that have minimum support. •
This becomes the next seed set for the next iteration.
At each next iteration, each candidate sequence has one more item than the seed sequence. •
Two key innovations in the GSP algorithm are how candidates are generated and how candidates are counted.
Candidate Generation
The main idea here is the defnition of a contiguous subsequence. Assume we are given a sequence S = {S
1
, S
2
,….
S
N
}, then a subsequence C is defned as a contiguous sequence if any of the following constraints are satisfed:
C is extracted from S by dropping an item from either the beginning or the end of the sequence (S •
1
or S
N
).
C is extracted from S by dropping an item from a sequence element. The element must have at least 2 items. •
C is a contiguous subsequence of C′ and C′ is a contiguous subsequence of S. •
115/JNU OLE
Summary
The technical progress in computerised data acquisition and storage results in the growth of vast databases. •
With the continuous increase and accumulation, the huge amounts of the computerised data have far exceeded
human ability to completely interpret and use.
Spatial data mining and knowledge discovery (SDMKD) is the effcient extraction of hidden, implicit, interesting, •
previously unknown, potentially useful, ultimately understandable, spatial or non-spatial knowledge (rules,
regularities, patterns, constraints) from incomplete, noisy, fuzzy, random and practical data in large spatial
databases.
Cloud model is a model of the uncertainty transition between qualitative and quantitative analysis, that is a •
mathematical model of the uncertainty transition between a linguistic term of a qualitative concept and its
numerical representation data.
The design-based approach or classical sampling theory approach to spatial sampling views the population of •
values in the region as a set of unknown values which are, apart from any measurement error, fxed in value.
The model-based approach or superpopulation approach to spatial sampling views the population of values in •
the study region as but one realisation of some stochastic model.
A database model depicts the way that the database management system stores the data and manages their •
relations.
A data warehouse (DW) is a repository of data that can be used in support of business decisions. Many data •
warehouses have a time dimension and therefore, they support the idea of valid time.
Temporal constraints can be either qualitative or quantitative. •
Temporal database mediator is used to discover temporal relations, implement temporal granularity conversion, •
and also discover semantic relationships.
In classifcation, we assume we have some domain knowledge about the problem we are trying to solve. •
A sequence is a time-ordered list of objects, in which each object consists of an item set, with an item set •
consisting of all items that appear together in a transaction.
References
Han, J., Kamber, M. and Pei, J., 2011. • Data Mining: Concepts and Techniques, 3rd ed., Elsevier.
Mitsa, T., 2009. • Temporal Data Mining, Chapman & Hall/CRC.
Dr. Krie, H. P., Spatial Data Mining [Online] Available at: <http://www.dbs.informatik.uni-muenchen.de/ •
Forschung/KDD/SpatialKDD/>. [Accessed 9 September 2011].
Lin W., Orgun M, A. and Williams G. J., • An Overview of Temporal Data Mining [Online PDF] Available at:
<http://togaware.redirectme.net/papers/adm02.pdf>. Accessed 9 September 2011].
University of Magdeburg, 2007. • 3D Spatial Data Mining on Document Sets [Video Online] Available at :<
http://www.youtube.com/watch?v=jJWl4Jm-yqI>. [Accessed 12 September 2011].
Berlingerio, M., 2009. • Temporal mining for interactive workfow data analysis [Video Online] Available at:
<http://videolectures.net/kdd09_berlingerio_tmiwda/>. [Accessed 12 September].

Recommended Reading
Pujari, A. K., 2001. • Data mining techniques, 4th ed., Universities Press.
Stein, A., Shi, W. and Bijker, W., 2008. • Quality aspects in spatial data mining, CRC Press.
Roddick, J. F. and Hornsby, K., 2001. • Temporal, spatial, and spatio-temporal data mining, Springer.
Data Mining
116/JNU OLE
Self Assessment
Which of the following statement is true? 1.
The technical progress in computerised data acquisition and storage results in the growth of vast web a.
mining.
In 1998, knowledge discovery in databases was further proposed. b.
Metadata are more complex, more changeable and bigger that common affair datasets. c.
Spatial data includes not only positional data and attribute data, but also spatial relationships among spatial d.
entities.
Besides tabular data, there are vector and raster graphic data in __________. 2.
metadata a.
database b.
spatial database c.
knowledge discovery d.
Which of the following is a process of discovering a form of rules plus exceptions at hierarchal view-angles 3.
with various thresholds?
SDMKD a.
DMKD b.
KDD c.
EM d.
The ____________ is more generalised, condensed and understandable than data. 4.
knowledge a.
database b.
mining c.
extraction d.
Match the columns 5.
Probability theory 1.
A. Mine spatial data with randomness on the
basis of stochastic probabilities.
Spatial statistics 2. B. Clustering analysis is a branch.
Evidence theory 3.
C. Mine spatial data via belief function and pos-
sibility function.
Fuzzy sets 4.
D. Mine spatial data with fuzziness on the basis
of a fuzzy membership function that depicts
an inaccurate probability.
1-A, 2-B, 3-C, 4-D a.
1-D, 2-A, 3-C, 4-B b.
1-B, 2-A, 3-C, 4-D c.
1-C, 2-B, 3-A, 4-D d.
117/JNU OLE
Design-based approach is also called ____________. 5.
classical sampling theory approach a.
model based approach b.
DMKD c.
SDMKD d.
Which of the following is a model of the uncertainty transition between qualitative and quantitative analysis? 6.
Dimensional modelling a.
Cloud model b.
Web mining c.
SDMKD d.
Superpopulation approach is also known as _______________. 7.
classical sampling theory approach a.
model based approach b.
DMKD c.
SDMKD d.
A database system consists of three layers: physical, logical, and ____________. 8.
internal a.
inner b.
central c.
external d.
Match the columns 9.
Snapshot database 1. A. They support only valid time.
Rollback database 2. B. They support both valid and transaction times.
Historical database 3. C. They keep the most recent version of the data
Temporal database 4. D. They support only the concept of transaction time.
1-C, 2-D, 3-A, 4-B a.
1-A, 2-B, 3-C, 4-D b.
1-D, 2-C, 3-B, 4-A c.
1-B, 2-A, 3-D, 4-C d.
Data Mining
118/JNU OLE
Chapter VI
Application and Trends of Data Mining
Aim
The aim of this chapter is to:
introduce applications of data mining •
analyse spatial data mining •
explore the multimedia data mining •
Objectives
The objectives of this chapter are to:
explicate text mining •
describe query processing techniques •
elucidate trends in data mining •
Learning outcome
At the end of this chapter, you will be able to:
comprehend system products and research prototypes •
enlist additional themes on data mining •
understand the use of data mining in different sectors such as education, telecommunication, fnance and so •
on
119/JNU OLE
6.1 Introduction
Data mining is the process of extraction of interesting (nontrivial, implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data. It is the set of activities used to fnd new, hidden or unexpected
patterns in data or unusual patterns in data. Using information contained within data warehouse, data mining can
often provide answers to questions about an organisation that a decision maker has previously not thought to ask.
Which products should be promoted to a particular customer? •
What is the probability that a certain customer will respond to a planned promotion? •
Which securities will be most proftable to buy or sell during the next trading session? •
What is the likelihood that a certain customer will default or pay back a schedule? •
What is the appropriate medical diagnosis for this patient? •
These types of questions can be answered easily if the information hidden among the petabytes of data in your
databases can be located and utilised. In the following paragraphs, we will discuss about the applications and trends
in the felds of data mining.
6.2 Applications of Data Mining
An important feature of object-relational and object-oriented databases is their capability of storing, accessing and
modelling complex structure-valued data, such as set- and list-valued data and data with nested structures. A set-
valued attribute may be of homogeneous or heterogeneous type. Typically, set-valued data can be generalised by:
Generalisation of each value in the set to its corresponding higher-level concept •
Derivation of the general behaviour of the set, such as the number of elements in the set, the types or value •
ranges in the set, the weighted average for numerical data, or the major clusters formed by the set.
For Example,
Generalisation of a set-valued attribute
Suppose that the expertise of a person is a set-valued attribute containing the set of values {tennis, hockey, NFS,
violin, prince of Persia}. This set can be generalised to a set of high-level concepts, such as {sports, music, computer
games} or into the number 5 (that is the number of activities in the set). Moreover, a count can be associated with a
generalised value to indicate how many elements are generalised to that value, as in {sports (3), music (1), computer
games (1)}, where sports (3) indicates three kinds of sports, and so on.
6.2.1 Aggregation and Approximation in Spatial and Multimedia Data Generalisation
Aggregation and approximation are another important means of generalisation. They are especially useful for
generalising attributes with large sets of values, complex structures, and spatial or multimedia data.
Example:
Spatial aggregation and approximation
Suppose that we have different pieces of land for several purposes of agricultural usage, such as the planting of
vegetables, grains, and fruits. These pieces can be merged or aggregated into one large piece of agricultural land by
a spatial merge. However, such a piece of agricultural land may contain highways, houses, and small stores. If the
majority of the land is used for agriculture, the scattered regions for other purposes can be ignored, and the whole
region can be claimed as an agricultural area by approximation.
6.2.2 Generalisation of Object Identifers and Class/Subclass Hierarchies
An object identifer can be generalised as follows:
First, the object identifer is generalised to the identifer of the lowest subclass to which the object belongs. •
The identifer of this subclass can then, in turn, be generalised to a higher level class/subclass identifer by •
climbing up the class/subclass hierarchy.
Similarly, a class or a subclass can be generalised to its corresponding super class (es) by climbing up its •
associated class/subclass hierarchy.
Data Mining
120/JNU OLE
6.2.3 Generalisation of Class Composition Hierarchies
An attribute of an object may be composed of or described by another object, some of whose attributes may be in
turn composed of or described by other objects, thus forming a class composition hierarchy. Generalisation on a
class composition hierarchy can be viewed as generalisation on a set of nested structured data (which are possibly
infnite, if the nesting is recursive).
6.2.4 Construction and Mining of Object Cubes
In an object database, data generalisation and multidimensional analysis are not applied to individual objects, but to
classes of objects. Since a set of objects in a class may share many attributes and methods, and the generalisation of
each attribute and method may apply a sequence of generalisation operators, the major issue becomes how to make
the generalisation processes cooperate among different attributes and methods in the class(es).
6.2.5 Generalisation-Based Mining of Plan Databases by Divide-and-Conquer
A plan consists of a variable sequence of actions. A plan database, or simply a planbase, is a large collection of
plans. Plan mining is the task of mining signifcant patterns or knowledge from a planbase.
6.3 Spatial Data Mining
A spatial database stores a large amount of space-related data, such as maps, preprocessed remote sensing or
medical imaging data, and VLSI chip layout data. Spatial data mining refers to the extraction of knowledge, spatial
relationships, or other interesting patterns not explicitly stored in spatial databases.
Spatial Mining
(ODM + Spatial engine)
Original data
Materialised data
(spatial binning,
proximity, collocation
materialisation)
Mining results
Spatial thematic
data layers
Spatial Mining
Functions
ODM engine

Fig. 6.1 Spatial mining
(Source: http://dataminingtools.net/wiki/applications_of_data_mining.php)
121/JNU OLE
6.3.1 Spatial Data Cube Construction and Spatial OLAP
As with relational data, we can integrate spatial data to construct a data warehouse that facilitates spatial data mining.
A spatial data warehouse is a subject-oriented, integrated, time variant, and non-volatile collection of both spatial
and non-spatial data in support of spatial data mining and spatial-data-related decision-making processes.

There are three types of dimensions in a spatial data cube:
A non-spatial dimension •
A spatial-to-non-spatial dimension •
A spatial-to-spatial dimension •
We can distinguish two types of measures in a spatial data cube:
A numerical measure contains only numerical data •
A spatial measure contains a collection of pointers to spatial objects •
6.3.2 Mining Spatial Association and Co-location Patterns
For mining spatial associations related to the spatial predicate close to, we can frst collect the candidates that pass
the minimum support threshold by:
Applying certain rough spatial evaluation algorithms, for example, using an MBR structure (which registers •
only two spatial points rather than a set of complex polygons),
Evaluating the relaxed spatial predicate, g close to, which is a generalised close to covering a broader context •
that includes close to, touch, and intersect.

Spatial clustering methods: Spatial data clustering identifes clusters or densely populated regions, according to
some distance measurement in a large, multidimensional data set.
Spatial classifcation and spatial trend analysis: Spatial classifcation analyses spatial objects to derive classifcation
schemes in relevance to certain spatial properties, such as the neighbourhood of a district, highway, or river.
Example:
Spatial classifcation
Suppose that you would like to classify regions in a province into rich versus poor according to the average family
income. In doing so, you would like to identify the important spatial-related factors that determine a region’s
classifcation. Many properties are associated with spatial objects, such as hosting a university, containing interstate
highways, being near a lake or ocean, and so on. These properties can be used for relevance analysis and to fnd
interesting classifcation schemes. Such classifcation schemes may be represented in the form of decision trees or
rules.
6.3.3 Mining Raster Databases
Spatial database systems usually handle vector data that consist of points, lines, polygons (regions), and their
compositions, such as networks or partitions. Typical examples of such data include maps, design graphs, and 3-D
representations of the arrangement of the chains of protein molecules.
6.4 Multimedia Data Mining
A multimedia database system stores and manages a large collection of multimedia data, such as audio, video, image,
graphics, speech, text, document, and hypertext data, which contain text, text markups, and linkages similarity search
in multimedia data when searching for similarities in multimedia data, we can search on either the data description
or the data content approaches:
Data Mining
122/JNU OLE
Colour histogram–based signature •
Multifeature composed signature •
Wavelet-based signature •
Wavelet-based signature with region-based granularity •
6.4.1 Multidimensional Analysis of Multimedia Data
To facilitate the multidimensional analysis of large multimedia databases, multimedia data cubes can be designed
and constructed in a manner similar to that for traditional data cubes from relational data.
A multimedia data cube can contain additional dimensions and measures for multimedia information, such as colour,
texture, and shape.
6.4.2 Classifcation and prediction Analysis of Multimedia Data
Classifcation and predictive modelling can be used for mining multimedia data, especially in scientifc research,
such as astronomy, seismology, and geoscientifc research.
Example:
Classifcation and prediction analysis of astronomy data
Taking sky images that have been carefully classifed by astronomers as the training set, we can construct models
for the recognition of galaxies, stars, and other stellar objects, based on properties like magnitudes, areas, intensity,
image moments, and orientation. A large number of sky images taken by telescopes or space probes can then be
tested against the constructed models in order to identify new celestial bodies. Similar studies have successfully
been performed to identify volcanoes on Venus.
6.4.3 Mining Associations in Multimedia Data
Associations between image content and non-image content features •
Associations among image contents that are not related to spatial relationships •
Associations among image contents related to spatial relationships •
6.4.4 Audio and Video Data Mining
An incommensurable amount of audiovisual information is now available in digital form, in digital archives, on
the World Wide Web, in broadcast data streams, and in personal and professional databases, and hence it is needed
to mine them.
6.5 Text Mining
Text Data Analysis and Information retrieval (IR) is a feld that has been developing in parallel with database systems
for many years. Basic Measures for Text Retrieval: Precision and Recall;
123/JNU OLE
Collect data
Unrestricted
exploratory
freedom
Parse
Repository
Optimise
View
results
Apply text
mining
algorithms
Fig. 6.2 Text mining
(Source: http://dataminingtools.net/wiki/applications_of_data_mining.php)
Precision: This is the percentage of retrieved documents that are in fact relevant to the query (that is “correct”
responses). Recall: This is the percentage of documents that are relevant to the query and were, in fact, retrieved.
It is formally defned as Text Retrieval Methods
Document selection methods •
Document ranking methods •
Text indexing techniques
Inverted indices •
Signature fles. •
6.6 Query Processing Techniques
Once an inverted index is created for a document collection, a retrieval system can answer a keyword query quickly
by looking up at which documents contain the query keywords.
6.6.1 Ways of dimensionality Reduction for Text
Latent Semantic Indexing •
Locality Preserving Indexing •
Probabilistic Latent Semantic Indexing •
6.6.2 Probabilistic Latent Semantic Indexing schemas
Keyword-Based Association Analysis •
Document Classifcation Analysis •
Document Clustering Analysis •
Data Mining
124/JNU OLE
6.6.3 Mining the World Wide Web
The World Wide Web serves as a huge, widely distributed, global information service center for news, advertisements,
consumer information, fnancial management, education, government, e-commerce, and many other information
services. The Web also contains a rich and dynamic collection of hyperlink information and Web page access and
usage information, providing rich sources for data mining.
6.6.4 Challenges
The Web seems to be too huge for effective data warehousing and data mining
The complexity of Web pages is far greater than that of any traditional text document collection •
The Web is a highly dynamic information source •
The Web serves a broad diversity of user communities •
Only a small portion of the information on the Web is truly relevant or useful •
Authoritative Web pages:
Suppose you would like to search for Web pages relating to a given topic, such as fnancial investing. In addition,
to retrieving pages that are relevant, you also hope that the pages retrieved will be of high quality, or authoritative
on the topic.
6.7 Data Mining for Healthcare Industry
The past decade has seen an explosive growth in biomedical research, ranging from the development of new
pharmaceuticals and in cancer therapies to the identifcation and study of human genome by discovering large scale
sequencing patterns and gene functions. Recent research in DNA analysis has led to the discovery of genetic causes
for many diseases and disabilities as well as approaches for disease diagnosis, prevention and treatment.
6.8 Data Mining for Finance
Most banks and fnancial institutions offer a wide variety of banking services (such as checking, saving, and business
and individual customer transactions), credit (such as business, mortgage, and automobile loans), and investment
services (such as mutual funds). Some also offer insurance services and stock services. Financial data collected in
the banking and fnancial industry is often relatively complete, reliable and high quality, which facilitates systematic
data analysis and data mining. For example, it can also help in fraud detection by detecting a group of people who
stage accidents to collect on insurance money.
6.9 Data Mining for Retail Industry
Retail industry collects huge amount of data on sales, customer shopping history, goods transportation and
consumption and service records and so on. The quantity of data collected continues to expand rapidly, especially
due to the increasing ease, availability and popularity of the business conducted on web, or e-commerce. Retail
industry provides a rich source for data mining. Retail data mining can help identify customer behaviour, discover
customer shopping patterns and trends, enhance the quality of customer service, achieve better customer retention
and satisfaction, improve goods consumption ratios design, more effective goods transportation and distribution
policies and reduce the cost of business.
6.10 Data Mining for Telecommunication
The telecommunication industry has rapidly evolved from offering local and long distance telephone services to
offer many other comprehensive communication services including voice, fax, pager, cellular phone, images, e-mail,
computer and web data transmission and other data traffc. The integration of telecommunication, computer network,
Internet and numerous other means of communication and computing are underway. Moreover, with the deregulation
of the telecommunication industry in many countries and the development of new computer and communication
technologies, the telecommunication market is rapidly expanding and highly competitive. This creates a great
demand from data mining in order to help understand business involved, identify telecommunication patterns, catch
fraudulent activities, make better use of resources, and enhance the quality of service
125/JNU OLE
6.11 Data Mining for Higher Education
An important challenge that higher education faces today is predicting paths of students and alumni. Which student
will enrol in particular course programs? Who will need an additional assistance in order to graduate? Meanwhile
additional issues such as enrolment management and time-to degree, continue to exert pressure on colleges to
search for new and faster solutions. Institutions can better address these students and alumni through the analysis
and presentation of data. Data mining has quickly emerged as a highly desirable tool for using current reporting
capabilities to uncover and understand hidden patterns in vast databases.
6.12 Trends in Data Mining
Different types of data are available for data mining tasks, thus data mining approaches poses many challenging
research issues in data mining. The design of a standard data mining languages, the development of effective and
effcient data mining methods and systems, the construction of interactive and integrated data mining environments,
and the applications of data mining to solve large applications problems are important tasks for data mining researches
and data mining system and application developers. Here, we will discuss some of the trends in data mining that
refect the pursuit of these challenges:
6.12.1 Application Exploration
Earlier, data mining was mainly used for business purpose to overcome the competitors. But as data mining is
becoming more popular it is gaining wide acceptance in other felds too such as biomedicine, stock market, fraud
detection, telecommunication and many more. And many new explorations are being done for this purpose. In
addition, data mining for business continues to expand as e-commerce and marketing becomes mainstream elements
of the retail industry.
6.12.2 Scalable Data Mining Methods
The current data mining methods capable of handling only a specifc type of data and limited amount of data, but
as data is expanding at a massive rate, there is a need to develop new data mining methods, which are scalable and
can handle different types of data and large volume of data.
The data mining methods should be more interactive and user friendly. One important direction towards improving
the repair effciency of the timing process while increasing user interaction is constraint-based mining. This provide
user with more control by allowing the specifcation and use of constraints to guide data mining systems in their
search for interesting patterns.
6.12.3 Combination of Data Mining with Database Systems, Data Warehouse Systems, and Web Database
Systems
Database systems, data warehouse systems, and WWW are loaded with huge amounts of data and have thus become
the major information processing systems. It is important to ensure that data mining serves as essential data analysis
component that can be easily included into such an information-processing environment. The desired architecture
for data mining system is the tight coupling with database and data warehouse systems. Transaction management
query processing, online analytical processing and online analytical mining should be integrated into one unifed
framework.
6.12.4 Standardisation of Data Mining Language
Today, few data mining languages are commercially available in the market like Microsoft’s SQL server 2005, IBM
Intelligent Miner, SAS Enterprise Miner, SGI Mineset, Clementine , DBMiner and many more but a standard data
mining language or other standardisation efforts will provide the orderly development of data mining solutions,
improved interpretability among multiple data mining systems and functions.
Data Mining
126/JNU OLE
6.12.5 Visual Data Mining
It is rightly said a picture is worth a thousand words. So, if the result of the mined data can be shown in the visual
form it will further enhance the worth of the mined data. Visual data mining is an effective way to discover knowledge
from huge amounts of data. The systematic study and development of visual data mining techniques will promote
the use for data mining analysis.
6.12.6 New Methods for Mining Complex Types of Data
The complex types of data like geospatial, multimedia, time series, sequence and text data poses an important
research area in feld of data mining. There is still a huge gap between the needs for these applications and the
available technology.
6.12.7 Web Mining
The World Wide Web is huge collection of globally distributed collection of news, advertisements, consumer records,
fnancial, education, government, e-commerce and many other services. The WWW also contains vast and dynamic
collection hyper linked information, providing a huge source for data mining. Based on the above facts, the web
also poses great challenges for effcient resource and knowledge discovery.
Since, data mining is a young discipline with wide and diverse applications, there is still a nontrivial gap between
general principles of data mining and domain specifc, effective data mining tools for particular applications. A few
application domains of Data Mining (such as fnance, the retail industry and telecommunication) and Trends in Data
Mining, which include further efforts towards the exploration of new application areas and new methods for handling
complex data types, algorithms scalability, constraint based mining and visualisation methods, the integration of
data mining with data warehousing and database systems, the standardisation of data mining languages, and data
privacy protection and security.
6.13 System Products and Research Prototypes
Although data mining is a relatively young feld with many issues that still need to be researched in depth, many
off-the-shelf data mining system products and domain specifc data mining application software are available.
As a discipline, data mining has a relatively short history and is constantly evolving—new data mining systems
appear on the market every year; new functions, features, and visualisation tools are added to existing systems
on a constant basis; and efforts toward the standardisation of data mining language are still underway. Therefore,
it is not our intention in this book to provide a detailed description of commercial data mining systems. Instead,
we describe the features to consider when selecting a data mining product and offer a quick introduction to a few
typical data mining systems. Reference articles, websites, and recent surveys of data mining systems are listed in
the bibliographic notes.
6.13.1 Choosing a Data Mining System
With many data mining system products available in the market, you may ask, “What kind of system should I
choose?” Some people may be under the impression that data mining systems, like many commercial relational
database systems, share the same well defned operations and a standard query language, and behave similarly on
common functionalities. If such were the case, the choice would depend more on the systems’ hardware platform,
compatibility, robustness, scalability, price, and service. Unfortunately, this is far from reality. Many commercial
data mining systems have little in common with respect to data mining functionality or methodology and may even
work with completely different kinds of data sets. To choose a data mining system that is appropriate for your task,
it is essential to have a multidimensional view of data mining systems. In general, data mining systems should be
assessed based on the following multiple features:
Data types: Most data mining systems that are available in the market handle formatted, record-based, relational- •
like data with numerical, categorical, and symbolic attributes. The data could be in the form of ASCII text,
relational database data, or data warehouse data. It is important to check what exact format(s) each system you
are considering can handle. Some kinds of data or applications may require specialised algorithms to search for
patterns, and so their requirements may not be handled by off-the-shelf, generic data mining systems. Instead,
specialised data mining systems may be used, which mine either text documents, geospatial data, multimedia
127/JNU OLE
data, stream data, time-series data, biological data, or Web data, or are dedicated to specifc applications (such
as fnance, the retail industry, or telecommunications). Moreover, many data mining companies offer customised
data mining solutions that incorporate essential data mining functions or methodologies.
System issues: A given data mining system may run on only one operating system or on several. The most •
popular operating systems that host data mining software are UNIX/Linux and Microsoft Windows. There are
also data mining systems that run on Macintosh, OS/2, and others. Large industry-oriented data mining systems
often adopt a client/server architecture, where the client could be a personal computer, and the server could be
a set of powerful parallel computers. A recent trend has data mining systems providing Web-based interfaces
and allowing XML data as input and/or output.
Data sources: This refers to the specifc data formats on which the data mining system will operate. Some •
systems work only on ASCII text fles, whereas many others work on relational data or data warehouse data,
accessing multiple relational data sources. It is essential that a data mining system supports ODBC connections
or OLE DB for ODBC connections. These make sure open database connections, that is, the ability to access
any relational data (including those in IBM/DB2, Microsoft SQL Server, Microsoft Access, Oracle, Sybase,
and so on), as well as formatted ASCII text data.
Data mining functions and methodologies: Data mining functions form the core of a data mining system. Some •
data mining systems provide only one data mining function, such as classifcation. Others may support multiple
data mining functions, such as concept description, discovery-driven OLAP analysis, association mining, linkage
analysis, statistical analysis, classifcation, prediction, clustering, outlier analysis, similarity search, sequential
pattern analysis, and visual data mining. For a given data mining function (such as classifcation), some systems
may support only one method, whereas others may support a wide variety of methods (such as decision tree
analysis, Bayesian networks, neural networks, support vector machines, rule based classifcation, k-nearest-
neighbour methods, genetic algorithms, and case-based reasoning). Data mining systems that support multiple
data mining functions and multiple methods per function provide the user with greater fexibility and analysis
power. Many problems may require users to try a few different mining functions or incorporate several together,
and different methods can be more effective than others for different kinds of data. In order to take advantage of
the added fexibility, however, users may require further training and experience. Thus, such systems should also
provide novice users with convenient access to the most popular function and method, or to default settings.
Coupling data mining with database and/or data warehouse systems: A data mining system should be coupled •
with a database and/or data warehouse system, where the coupled components are seamlessly integrated into a
uniform information processing environment. In general, there are four forms of such coupling: no coupling,
loose coupling, semi tight coupling, and tight coupling. Some data mining systems work only with ASCII data
fles and are not coupled with database or data warehouse systems at all. Such systems have diffculties using
the data stored in database systems and handling large data sets effciently. In data mining systems that are
loosely coupled with database and data warehouse systems, the data are retrieved into a buffer or main memory
by database or warehouse operations, and then mining functions are applied to analyse the retrieved data. These
systems may not be equipped with scalable algorithms to handle large data sets when processing data mining
queries. The coupling of a data mining system with a database or data warehouse system may be semi tight,
providing the effcient implementation of a few essential data mining primitives (such as sorting, indexing,
aggregation, histogram analysis, multiway join, and the precomputation of some statistical measures). Ideally, a
data mining system should be tightly coupled with a database system in the sense that the data mining and data
retrieval processes are integrated by optimising data mining queries deep into the iterative mining and retrieval
process. Tight coupling of data mining with OLAP-based data warehouse systems is also desirable so that data
mining and OLAP operations can be integrated to provide OLAP-mining features.
Scalability: Data mining has two kinds of scalability issues: row (or database size) scalability and column (or •
dimension) scalability. A data mining system is considered row scalable if, when the number of rows is enlarged
10 times, it takes no more than 10 times to execute the same data mining queries. A data mining system is
considered column scalable if the mining query execution time increases linearly with the number of columns
(or attributes or dimensions). Due to the curse of dimensionality, it is much more challenging to make a system
column scalable than row scalable.
Data Mining
128/JNU OLE
Visualisation tools: “A picture is worth a thousand words”—this is very true in data mining. Visualisation in •
data mining can be categorised into data visualisation, mining result visualisation, mining process visualisation,
and visual data mining. The variety, quality, and fexibility of visualisation tools may strongly infuence the
usability, interpretability, and attractiveness of a data mining system.
Data mining query language and graphical user interface: Data mining is an exploratory process. An easy-to- •
use and high-quality graphical user interface is necessary in order to promote user-guided, highly interactive
data mining. Most data mining systems provide user-friendly interfaces for mining. However, unlike relational
database systems, where most graphical user interfaces are constructed on top of SQL (which serves as a standard,
well-designed database query language), most data mining systems do not share any underlying data mining
query language. Lack of a standard data mining language makes it diffcult to standardise data mining products
and to make sure the interoperability of data mining systems. Recent efforts at defning and standardising data
mining query languages include Microsoft’s OLE DB for Data Mining.
6.14 Additional Themes on Data Mining
Due to the broad scope of data mining and the large variety of data mining methodologies, all the themes on
data mining cannot be thoroughly learnt. However, some of the important themes on data mining are mentioned
below.
6.14.1 Theoretical Foundations of Data Mining
Research on the theoretical foundations of data mining has yet to mature. A solid and systematic theoretical foundation
is important because it can help to provide a coherent framework for the development, evaluation, and practice of
data mining technology. Several theories for the basis of data mining include the following:
Data reduction: In this theory, the basis of data mining is to reduce the data representation. Data reduction •
trades accuracy for speed in response to the need to obtain quick approximate answers to queries on very
large databases. Data reduction techniques include singular value decomposition (the driving element behind
principal components analysis), wavelets, regression, log-linear models, histograms, clustering, sampling, and
the construction of index trees.
Data compression: According to this theory, the basis of data mining is to compress the given data by encoding in •
terms of bits, association rules, decision trees, clusters, and so on. Encoding based on the minimum description
length principle states that the “best” theory to infer from a set of data is the one that minimises the length of
the theory and the length of the data when encoded, using the theory as a predictor for the data. This encoding
is typically in bits.
Pattern discovery: In this theory, the basis of data mining is to discover patterns occurring in the database, such •
as associations, classifcation models, sequential patterns, and so on. Areas such as machine learning, neural
network, association mining, sequential pattern mining, clustering, and several other subfelds contribute to
this theory.
Probability theory: This is based on statistical theory. In this theory, the basis of data mining is to discover joint •
probability distributions of random variables, for example, Bayesian belief networks or hierarchical Bayesian
models.
Microeconomic view: The microeconomic view considers data mining as the task of fnding patterns that are •
interesting only to the extent that they can be used in the decision-making process of some enterprise (for
example, regarding marketing strategies and production plans). This view is one of utility, in which patterns
are considered interesting if they can be acted on. Enterprises are regarded as facing optimisation problems,
where the object is to maximise the utility or value of a decision. In this theory, data mining becomes a nonlinear
optimisation problem.
Inductive databases: According to this theory, a database schema consists of data and patterns that are stored in •
the database. Data mining is therefore the problem of performing induction on databases, where the task is to
query the data and the theory (that is patterns) of the database. This view is popular among many researchers
in database systems.
129/JNU OLE
These theories are not mutually exclusive. For example, pattern discovery can also be seen as a form of data reduction
or data compression. Ideally, a theoretical framework should be able to model typical data mining tasks (such as
association, classifcation, and clustering), have a probabilistic nature, be able to handle different forms of data, and
consider the iterative and interactive essence of data mining. Further efforts are required toward the establishment
of a well-defned framework for data mining, which satisfes these requirements.
6.14.2 Statistical Data mining
The data mining techniques described in this book are primarily database-oriented, that is, designed for the effcient
handling of huge amounts of data that are typically multidimensional and possibly of various complex types. There
are, however, many well-established statistical techniques for data analysis, particularly for numeric data. These
techniques have been applied extensively to some types of scientifc data (for example, data from experiments in
physics, engineering, manufacturing, psychology, and medicine), as well as to data from economics and the social
sciences.
Regression: In general, these methods are used to predict the value of a response (dependent) variable from one •
or more predictor (independent) variables where the variables are numeric. There are various forms of regression,
such as linear, multiple, weighted, polynomial, nonparametric, and robust (robust methods are useful when
errors fail to satisfy normalcy conditions or when the data contain signifcant outliers).
Generalised linear models: These models, and their generalisation (generalised additive models), allow a •
categorical response variable (or some transformation of it) to be related to a set of predictor variables in a
manner similar to the modelling of a numeric response variable using linear regression. Generalised linear
models include logistic regression and Poisson regression.
Analysis of variance: These techniques analyse experimental data for two or more populations described by a •
numeric response variable and one or more categorical variables (factors). In general, an ANOVA (single-factor
analysis of variance) problem involves a comparison of k population or treatment means to determine if at least
two of the means are different. More complex ANOVA problems also exist.
Mixed-effect models: These models are for analysing grouped data—data that can be classifed according to •
one or more grouping variables. They typically describe relationships between a response variable and some
covariates in data grouped according to one or more factors. Common areas of application include multilevel
data, repeated measures data, block designs, and longitudinal data.
Factor analysis: This method is used to determine which variables are combined to generate a given factor. For •
example, for many psychiatric data, it is not possible to measure a certain factor of interest directly (such as
intelligence); however, it is often possible to measure other quantities (such as student test scores) that refect
the factor of interest. Here, none of the variables are designated as dependent.
Discriminant analysis: This technique is used to predict a categorical response variable. Unlike generalised •
linear models, it assumes that the independent variables follow a multivariate normal distribution. The procedure
attempts to determine several discriminant functions (linear combinations of the independent variables) that
discriminate among the groups defned by the response variable. Discriminant analysis is commonly used in
social sciences.
Time series analysis: There are many statistical techniques for analysing time-series data, such as auto regression •
methods, univariate ARIMA (autoregressive integrated moving average) modelling, and long-memory time-
series modelling.
Survival analysis: Several well-established statistical techniques exist for survival analysis. These techniques •
originally were designed to predict the probability that a patient undergoing a medical treatment would survive
at least to time t. Methods for survival analysis, however, are also commonly applied to manufacturing settings
to estimate the life span of industrial equipment. Popular methods include Kaplan-Meier estimates of survival,
Cox proportional hazards regression models, and their extensions.
Quality control: Various statistics can be used to prepare charts for quality control, such as Shewhart charts •
and cusum charts (both of which display group summary statistics). These statistics include the mean, standard
deviation, range, count, moving average, moving standard deviation, and moving range.
Data Mining
130/JNU OLE
6.14.3 Visual and Audio Data Mining
Visual data mining discovers implicit and useful knowledge from large data sets using data and/or knowledge
visualisation techniques. The human visual system is controlled by the eyes and brain, the latter of which can be
thought of as a powerful, highly parallel processing and reasoning engine containing a large knowledge base. Visual
data mining essentially combines the power of these components, making it a highly attractive and an effective tool
for the comprehension of data distributions, patterns, clusters, and outliers in data.
Visual data mining can be viewed as an integration of two disciplines: data visualisation and data mining. It is also
closely related to computer graphics, multimedia systems, human computer interaction, pattern recognition, and
high-performance computing. In general, data visualisation and data mining can be integrated in the following
ways:
Data visualisation: Data in a database or data warehouse can be viewed at different levels of granularity or •
abstraction, or as different combinations of attributes or dimensions. Data can be presented in various visual
forms, such as boxplots, 3-D cubes, data distribution charts, curves, surfaces, link graphs, and so on. Visual
display can help give users a clear impression and overview of the data characteristics in a database.
Data mining result visualisation: Visualisation of data mining results is the presentation of the results or knowledge •
obtained from data mining in visual forms. Such forms may include scatter plots and boxplots (obtained from
descriptive data mining), as well as decision trees, association rules, clusters, outliers, generalised rules, and
so on.
Data mining process visualisation: This type of visualisation presents the various processes of data mining in •
visual forms so that users can see how the data are extracted and from which database or data warehouse they
are extracted, as well as how the selected data are cleaned, integrated, preprocessed, and mined. Moreover, it
may also show which method is selected for data mining, where the results are stored, and how they may be
viewed.
Interactive visual data mining: In (interactive) visual data mining, visualisation tools can be used in the data •
mining process to help users make smart data mining decisions. For example, the data distribution in a set
of attributes can be displayed using coloured sectors (where the whole space is represented by a circle). This
display helps users to determine which sector should frst be selected for classifcation and where a good split
point for this sector may be.
Audio data mining uses audio signals to indicate the patterns of data or the features of data mining results. Although
visual data mining may disclose interesting patterns using graphical displays, it requires users to concentrate on
watching patterns and identifying interesting or novel features within them. This can sometimes be quite tiresome.
If patterns can be transformed into sound and music, then instead of watching pictures, we can listen to pitches,
rhythms, tune, and melody in order to identify anything interesting or unusual. This may relieve some of the burden
of visual concentration and be more relaxing than visual mining. Therefore, audio data mining is an interesting
complement to visual mining.
6.14.4 Data Mining and Collaborative Filtering
Today’s consumers are faced with millions of goods and services when shopping on-line. Recommender systems
help consumers by making product recommendations during live customer transactions. A collaborative fltering
approach is commonly used, in which products are recommended based on the opinions of other customers.
Collaborative recommender systems may employ data mining or statistical techniques to search for similarities
among customer preferences.
A collaborative recommender system works by fnding a set of customers, referred to as neighbours, that have a
history of agreeing with the target customer (such as, they tend to buy similar sets of products, or give similar ratings
for certain products). Collaborative recommender systems face two major challenges: scalability and ensuring quality
recommendations to the consumer. Scalability is important, because e-commerce systems must be able to search
through millions of potential neighbours in real time. If the site is using browsing patterns as indications of product
preference, it may have thousands of data points for some of its customers. Ensuring quality recommendations
131/JNU OLE
is essential in order to gain consumers’ trust. If consumers follow a system recommendation but then do not end
up liking the product, they are less likely to use the recommender system again. As with classifcation systems,
recommender systems can make two types of errors: false negatives and false positives. Here, false negatives are
products that the system fails to recommend, although the consumer would like them. False positives are products
that are recommended, but which the consumer does not like. False positives are less desirable because they can
annoy or anger consumers.
An advantage of recommender systems is that they provide personalisation for customers of e-commerce, promoting
one-to-one marketing. Dimension reduction, association mining, clustering, and Bayesian learning are some of the
techniques that have been adapted for collaborative recommender systems. While collaborative fltering explores
the ratings of items provided by similar users, some recommender systems explore a content-based method that
provides recommendations based on the similarity of the contents contained in an item. Moreover, some systems
integrate both content-based and user-based methods to achieve further improved recommendations. Collaborative
recommender systems are a form of intelligent query answering, which consists of analysing the intent of a query
and providing generalised, neighbourhood, or associated information relevant to the query.
Data Mining
132/JNU OLE
Summary
Data mining is the process of extraction of interesting patterns or knowledge from huge amount of data. It is the •
set of activities used to fnd new, hidden or unexpected patterns in data or unusual patterns in data.
An important feature of object-relational and object-oriented databases is their capability of storing, accessing, and •
modelling complex structure-valued data, such as set- and list-valued data and data with nested structures.
Generalisation on a class composition hierarchy can be viewed as generalisation on a set of nested structured •
data.
A spatial database stores a large amount of space-related data, such as maps, preprocessed remote sensing or •
medical imaging data, and VLSI chip layout data. Spatial data mining refers to the extraction of knowledge,
spatial relationships, or other interesting patterns not explicitly stored in spatial databases.
Spatial database systems usually handle vector data that consist of points, lines, polygons (regions), and their •
compositions, such as networks or partitions.
Text Data Analysis and Information Retrieval Information retrieval (IR) is a feld that has been developing in •
parallel with database systems for many years.
Once an inverted index is created for a document collection, a retrieval system can answer a keyword query •
quickly by looking up which documents contain the query keywords.
The World Wide Web serves as a huge, widely distributed, global information service centre for news, •
advertisements, consumer information, fnancial management, education, government, e-commerce, and many
other information services.
Recent research in DNA analysis has led to the discovery of genetic causes for many diseases and disabilities •
as well as approaches for disease diagnosis, prevention and treatment.

References
Han, J., Kamber, M. and Pei, J., 2011. • Data Mining: Concepts and Techniques, 3rd ed., Elsevier.
Alexander, D., • Data Mining [Online] Available at: <http://www.laits.utexas.edu/~norman/BUS.FOR/course.
mat/Alex/>. [Accessed 9 September 2011].
Galeas • , Web mining [Online PDF} Available at: <http://www.galeas.de/webmining.html>. [Accessed 12
September 2011].
Springerlink, 2006. • Data Mining System Products and Research Prototypes [Online PDF] Available at: <http://
www.springerlink.com/content/2432076500506017/>. [Accessed 12 September 2011].
Dr. Kuonen, D., 2009. • Data Mining Applications in Pharma/BioPharma Product Development [Video Online]
Available at: <http://www.youtube.com/watch?v=kkRPW5wSwNc>. [Accessed 12 September 2011].
• SalientMgmtCompany, 2011. Salient Visual Data Mining [Video Online] Available at: < http://www.youtube.
com/watch?v=fosnA_vTU0g>. [Accessed 12 September 2011].
Recommended Reading
Liu, B., 2011. • Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd ed., Springer.
Scime, A., 2005. • Web mining: applications and techniques, Idea Group Inc. (IGI).
Markov, Z. and Larose, D. T., 2007. • Data mining the Web: uncovering patterns in Web content, structure, and
usage, Wiley-Interscience.
133/JNU OLE
Self Assessment
Using information contained within__________, data mining can often provide answers to questions about an 1.
organisation that a decision maker has previously not thought to ask.
metadata a.
web mining b.
data warehouse c.
data extraction d.
Aggregation and approximation are another important means of ________. 2.
generalisation a.
visualisation b.
standardisation c.
organisation d.
Which of the following stores a large amount of space-related data, such as maps, preprocessed remote sensing 3.
or medical imaging data, and VLSI chip layout data?
Knowledge database a.
Spatial database b.
Data mining c.
Data warehouse d.
___________ systems usually handle vector data that consist of points, lines, polygons and their compositions, 4.
such as networks or partitions.
Knowledge database a.
Data mining b.
Data warehouse c.
Spatial database d.
____________ system stores and manages a large collection of multimedia data. 5.
Knowledge database a.
Multimedia data mining b.
Multimedia database c.
Spatial database d.
Which of the following can contain additional dimensions and measures for multimedia information, such as 6.
colour, texture, and shape?
Multimedia data cube a.
Multimedia data mining b.
Multimedia database c.
Spatial database d.
Data Mining
134/JNU OLE
What is used for mining multimedia data, especially in scientifc research, such as astronomy, seismology, and 7.
geoscientifc research?
Data warehousing a.
Classifcation and predictive modelling b.
Data extraction c.
Dimensional modelling d.
_________ is the percentage of retrieved documents that are in fact relevant to the query. 8.
Recall a.
Precision b.
Text mining c.
Information retrieval d.
Match the columns 9.
Application Exploration 1.
The WWW contains huge and dynamic collection hyper A.
linked information, providing a huge source for data mining.
Scalable data mining methods 2.
It is an effective way to discover knowledge from huge B.
amounts of data.
Visual data mining 3.
One important direction towards improving the repair eff- C.
ciency of the timing process while increasing user interaction
is constraint-based mining.
Web mining 4.
In addition for data mining for business continues to expand D.
as e-commerce and marketing becomes mainstream elements
of the retail industry.
1-A, 2-B, 3-C, 4-D a.
1-D, 2-C, 3-B, 4-A b.
1-B, 2-A, 3-C, 4-D c.
1-C, 2-B, 3-A, 4-D d.
In which of the following theories, the basis of data mining is to reduce the data representation? 10.
Data reduction a.
Data compression b.
Pattern discovery c.
Probability theory d.
135/JNU OLE
Chapter VII
Implementation and Maintenance
Aim
The aim of this chapter is to:
introduce the physical design steps •
analyse physical storage of data •
explore indexing the data warehouse •
Objectives
The objectives of this chapter are to:
explicate different performance enhancement techniques •
describe data warehouse deployment •
elucidate how to manage data warehouse •
Learning outcome
At the end of this chapter, you will be able to:
comprehend B-tree, clustered and bitmapped indexes •
enlist different models of data mining •
understand growth and maintenance of da • tabase
Data Mining
136/JNU OLE
7.1 Introduction
As you know, in an OLTP system, you have to perform a number of tasks for completing the physical model. The
logical model forms the primary basis for the physical model. But, in addition, a number of factors must be considered
before get to the physical model. You must determine where to place the database objects in physical storage. What
is the storage medium and what are its features? This information helps you to defne the storage parameters. Later
you need to plan for indexing, which is an important consideration on which columns in each table the indexes must
be built..? You need to look into other methods for improving performance. You have to examine the initialisation
parameters in the DBMS and decide how to set them. Similarly, in the data warehouse environment, you need to
consider many different factors to complete the physical model.
7.2 Physical Design Steps
Following is a pictorial representation of the steps in the physical design process for a data warehouse. Note the
steps indicated in the fgure. In the following subsections, we will broadly describe the activities within these steps.
You will understand how at the end of the process you arrive at the completed physical model.
Develop
Standards Create
Aggregates
Plan
Determine
Data
Portioning
Establish
Clustering
Options
Prepare
Indexing
Strategy
Assign Storage
Structures
Complete
Physical
Model
Fig. 7.1 Physical design process
7.2.1 Develop Standards
Many companies invest a lot of time and money to prescribe standards for information systems. The standards range
from how to name the felds in the database to how to conduct interviews with the user departments for requirements
defnition. A group in IT is designated to keep the standards up-to-date. In some companies, every revision must be
updated and authorised by the CIO. Through the standards group, the CIO makes sure that the standards are followed
correctly and strictly. Now the practice is to publish the standards on the company’s intranet. If your IT department
is one of the progressive ones giving due attention to standards, then be happy to embrace and adapt the standards
for the data warehouse. In the data warehouse environment, the scope of the standards expands to include additional
areas. Standards ensure consistency across the various areas. If you have the same way of indicating names of the
database objects, then you are leaving less room for ambiguity. Standards take on greater importance in the data
warehouse environment. This is because the usage of the object names is not confned to the IT department. The
users will also be referring to the objects by names when they formulate and run their own queries.
137/JNU OLE
7.2.2 Create Aggregates Plan
If your data warehouse stores data only at the lowest level of granularity, every such query has to read through all
the detailed records and sum them up. Consider a query looking for total sales for the year, by product, for all the
stores. If you have detailed records keeping sales by individual calendar dates, by product, and by store, then this
query needs to read a large number of detailed records. So what is the best method to improve performance in such
cases? If you have higher levels of summary tables of products by store, the query could run faster. But how many
such summary tables must you create? What is the limit?
In this step, review the possibilities for building aggregate tables. You get clues from the requirements defnition.
Look at each dimension table and examine the hierarchical levels. Which of these levels are more important for
aggregation? Clearly assess the tradeoff. What you need is a comprehensive plan for aggregation. The plan must
spell out the exact types of aggregates you must build for each level of summarisation. It is possible that many of
the aggregates will be present in the OLAP system. If OLAP instances are not for universal use by all users, then
the necessary aggregates must be present in the main warehouse. The aggregate database tables must be laid out
and included in the physical model.
7.2.3 Determine the Data Partitioning Scheme
Consider the data volumes in the warehouse. What about the number of rows in a fact table? Let us make some rough
calculations. Assume there are four dimension tables with 50 rows each on average. Even with this limited number
of dimension table rows, the potential number of fact table rows exceeds six million. Fact tables are generally very
large. Large tables are not easy to manage. During the load process, the entire table must be closed to the users.
Again, back up and recovery of large tables pose diffculties because of their sheer sizes.
Partitioning divides large database tables into manageable parts. Always consider partitioning options for fact
tables. It is not just the decision to partition that counts. Based on your environment, the real decision is about
how exactly to partition the fact tables. Your data warehouse may be a conglomerate of conformed data marts. You
must consider partitioning options for each fact table. You may fnd that some of your dimension tables are also
candidates for partitioning. Product dimension tables are especially large. Examine each of your dimension tables
and determine which of these must be partitioned. In this step, come up with a defnite partitioning scheme. The
scheme must include:
The fact tables and the dimension tables selected for partitioning
The type of partitioning for each table—horizontal or vertical •
The number of partitions for each table •
The criteria for dividing each table (for example, by product groups) •
Description of how to make queries aware of partitions •
7.2.4 Establish Clustering Options
In the data warehouse, many of the data access patterns rely on sequential access of large quantities of data. Whenever
you have this type of access and processing, you will realise much performance improvement from clustering. This
technique involves placing and managing related units of data in the same physical block of storage. This arrangement
causes the related units of data to be retrieved together in a single input operation. You need to establish the proper
clustering options before completing the physical model. Examine the tables, table by table, and fnd pairs that are
related. This means that rows from the related tables are usually accessed together for processing in many cases.
Then make plans to store the related tables close together in the same fle on the medium.
For two related tables, you may want to store the records from both fles interleaved. A record from one table is
followed by all the related records in the other table while storing in the same fle.
Data Mining
138/JNU OLE
7.2.5 Prepare an Indexing Strategy
This is a crucial step in the physical design. Unlike OLTP systems, the data warehouse is query-centric. As you know,
indexing is perhaps the most effective mechanism for improving performance. A solid indexing strategy results
in enormous benefts. The strategy must lay down the index plan for each table, indicating the columns selected
for indexing. The sequence of the attributes in each index also plays a critical role in performance. Scrutinise the
attributes in each table to determine, which attributes qualify for bit-mapped indexes. Prepare a comprehensive
indexing plan. The plan must indicate the indexes for each table. Further, for each table, present the sequence in
which the indexes will be created. Describe the indexes that are expected to be built in the very frst instance of the
database. Many indexes can wait until you have monitored the data warehouse for some time. Spend enough time
on the indexing plan.
7.2.6 Assign Storage Structures
Where do you want to place the data on the physical storage medium? What are the physical fles? What is the
plan for assigning each table to specifc fles? How do you want to divide each physical fle into blocks of data?
Answers to questions like these go into the data storage plan. In an OLTP system, all data resides in the operational
database. When you assign the storage structures in an OLTP system, your effort is confned to the operational
tables accessed by the user applications. In a data warehouse, you are not just concerned with the physical fles for
the data warehouse tables. Your storage assignment plan must include other types of storage such as the temporary
data extract fles, the staging area, and any storage needed for front-end applications. Let the plan include all the
types of storage structures in the various storage areas.
7.2.7 Complete Physical Model
This fnal step reviews and confrms the completion of the prior activities and tasks. By the time you reach this step,
you have the standards for naming the database objects. You have determined which aggregate tables are necessary
and how you are going to partition the large tables. You have completed the indexing strategy and have planned
for other performance options. You also know where to put the physical fles. All the information from the prior
steps allows you to complete the physical model. The result is the creation of the physical schema. You can code
the data defnition language statements (DDL) in the chosen RDBMS and create the physical structure in the data
dictionary.
7.3 Physical Storage
Consider the processing of a query. After the query is verifed for syntax and checked against the data dictionary
for authorisation, the DBMS translates the query statements to determine what data is requested. From the data
entries about the tables, rows, and columns desired, the DBMS maps the requests to the physical storage where
the data access take place. The query gets fltered down to physical storage and this is where the input operations
begin. The effciency of the data retrieval is closely tied to where the data is stored in physical storage and how it
is stored there.
What are the various physical data structures in the storage area? What is the storage medium and what are its
characteristics? Do the features of the medium support any effcient storage or retrieval techniques? We will explore
answers to questions such as these. From the answers you will derive methods for improving performance.
7.3.1 Storage Area Data Structures
Take an overall look at all the data related to the data warehouse. First, you have the data in the staging area. Though
you may look for effciency in storage and loading, arrangement of the data in the staging area does not contribute
to the performance of the data warehouse from the point of view of the users. Looking further, the other sets of data
relate to the data content of the warehouse. These are the data and index tables in the data warehouse. How you
arrange and store these tables defnitely has an impact on the performance. Next, you have the multidimensional
data in the OLAP system. In most cases, the supporting proprietary software dictates the storage and the retrieval
of data in the OLAP system.
139/JNU OLE
You can see following fgure showing the physical data structures in the data warehouse. Observe the different levels
of data. Notice the detail and summary data structures. Think further how the data structures are implemented in
physical storage as fles, blocks, and records.
Data Staging Area
Data extract fat fles?
Partitioned
physical
fles
Detailed data
and light
summaries Relational database data fles
(transformed data)
Relational database data fles
(warehouse data)
Physical fles in
proprietary matrix format storing
multidimensional cubes of data
OLAP system
Relational database index
fles
Relational database index
fles
Load image fat fles
Data Warehouse Repository
Fig. 7.2 Data structures in the warehouse
7.3.2 Optimising Storage
You have reviewed the physical storage structures. When you break each data structure down to the physical storage
level, you fnd that the structure is stored as fles in the physical storage medium. Take the example of the customer
dimension and the salesperson dimension tables. You have basically two choices for storing the data of these two
dimension tables. Store records from each table in one physical fle. Or, if the records from these tables are retrieved
together most of the time, then store records from both the tables in a single physical fle. In either case, records
are stored in a fle. A collection of records in a fle forms a block. In other words, a fle comprises blocks and each
block contains records.
Remember, any optimising at the physical level is tied to the features and functions available in the DBMS. You
have to relate the techniques discussed here with the workings of your DBMS. Please study the following optimising
techniques.

Set the correct block size
A data block in a fle is the fundamental unit of input/output transfer from the database to memory where the data
gets manipulated. Each block contains a block header that holds control information. The block header is not open
to keep data. Too many block headers means too much wasted space.
Data Mining
140/JNU OLE
More records or rows will ft into a single block. Because more records may be fetched in one read, larger block sizes
decrease the number of reads. Another advantage relates to space utilisation by the block headers. As a percentage
of the space in a block, the block header occupies less space in a larger block. Therefore, overall, all the block
headers put together occupy less space. But here is the downside of larger block sizes. Even when a smaller number
of records are needed, the operating system reads too much extra information into memory, thereby impacting
memory management.
However, because most data warehouse queries request large numbers of rows, memory management as indicated
rarely poses a problem. There is another aspect of data warehouse tables that could cause some concern. Data
warehouse tables are denormalised and therefore, the records tend to be large. Sometimes, a record may be too
large to ft in a single block. Then the record has to be split across more than one block. The broken parts have to be
connected with pointers or physical addresses. Such pointer chains affect performance to a large extent. Consider
all the factors and set the block size at the appropriate size. Generally, increased block size gives better performance
but you have to fnd the proper size.
Set the proper block usage parameters
Most of the leading DBMSs allow you to set block usage parameters at appropriate values and derive performance
improvement. You will fnd that these usage parameters themselves and the methods for setting them are dependent
on the database software. Generally, two parameters govern the usage of the blocks, and proper usage of the blocks
improves performance.
Manage data migration
When a record in a block is updated and there is no enough space in the same block for storing the expanded record,
then most DBMSs move the entire updated record to another block and create a pointer to the migrated record.
Such migration affects the performance, requiring multiple blocks to be read. This problem may be resolved by
adjusting the block percent free parameter. However, migration is not a major problem in data warehouses because
of the negligible number of updates.
Manage block utilisation
Performance degenerates when data blocks contain excessive amounts of free space. Whenever a query calls for a
full table scan, performance suffers because of the need to read too many blocks. Manage block underutilisation by
adjusting the block percent free parameter downward and the block percent used parameter upward.
Resolve dynamic extension
When the current extent on disk storage for a fle is full, the DBMS fnds a new extent and allows an insert of a
new record. This task of fnding a new extension on the fy is referred to as dynamic extension. However, dynamic
extension comes with signifcant overhead. Reduce dynamic extension by allocation of large initial extents.
Employ fle striping techniques
You perform fle striping when you split the data into multiple physical parts and store these individual
parts on separate physical devices. File striping allows concurrent input/output operations and improves
fle access performance substantially.
7.3.3 Using RAID Technology
Redundant array of inexpensive disks (RAID) technology has become common to the extent that almost all of today’s
data warehouses make good use of this technology. These disks are found on large servers. The arrays enable the
server to continue operation even while they are recovering from the failure of any single disk. The underlying
technique that gives the primary beneft of RAID breaks the data into parts and writes the parts to multiple disks in
a striping fashion. The technology can recover data when a disk fails and reconstruct the data. RAID is very fault-
tolerant. Here are the basic features of the technology:
141/JNU OLE
Disk mirroring—writing the same data to two disk drives connected to the same controller •
Disk duplexing—similar to mirroring, except here each drive has its own distinct controller •
Parity checking—addition of a parity bit to the data to ensure correct data transmission •
Disk striping—data spread across multiple disks by sectors or bytes •
RAID is implemented at six different levels: RAID 0 through RAID 5. Please turn to Figure 7.3, which gives you
a brief description of RAID. Note the advantages and disadvantages. The lowest level confguration RAID 0 will
provide data striping. At the other end of the range, RAID 5 is a very valuable arrangement.
Data records striped across
multiple disks without
redundancy
RAID 0
RAID 3 RAID 4 RAID 5
RAID 1 RAID 2
High performance, less
expensive--entire array out
even with single disk failure
Data interleaved across
disks by bit or block, one
drive stores parity data
Data records interleaved
across disks by sectors, one
drive store parity data
Data records sector-inter-
leaved across groups of
drives, most popular
Dedicated parity drive un-
necessary, works with 2 or
more drives--poor write
Can handle multiple I/Os
from sophisticated OS--used
with only two drives
High performance for large
blocks of data--on the fy
recovery not guaranteed
Disk mirroring with data
written redundantly to pairs
of drives
Data interleaved across disks
by bit or block, extra drives
store correction code
High performance, corrects
1-bit errors on the fy,
detects 2-bit errors—costly
High read performance
and availability--expensive
because of data duplication

Fig. 7.3 RAID technology
Estimating storage sizes
No discussion of physical storage is complete without a reference to estimation of storage sizes. Every action in
the physical model takes place in physical storage. You need to know how much of storage space must be made
available initially as the data warehouse expands. Here are a few tips on estimating storage sizes:

For each database table, determine
Initial estimate of the number of rows •
Average length of the row •
Anticipated monthly increase in the number of rows •
Initial size of the table in megabytes (MB) •
Calculated table sizes in 6 months and in 12 months •
For all tables, determine
The total number of indexes •
Space needed for indexes, initially, in 6 months, and in 12 months •
Data Mining
142/JNU OLE
Estimate
Temporary work space for sorting, merging •
Temporary fles in the staging area •
Permanent fles in the staging area •
7.4 Indexing the Data Warehouse
In a query-centric system like the data warehouse environment, the need to process queries faster dominates. There
is no perfect way of turning your users away from the data warehouse than by unreasonably slow queries. For the
user in an analysis session going through a rapid succession of complex queries, you have to match the pace of the
query results with the speed of thought. Among the various methods to improve performance, indexing ranks are
very high.
What types of indexes must you build in your data warehouse? The DBMS vendors offer a variety of choices. The
choice is no longer confned to sequential index fles. All vendors support B-Tree indexes for effcient data retrieval.
Another option is the bitmapped index. As we will see later in this section, this indexing technique is very appropriate
for the data warehouse environment. Some vendors are extending the power of indexing to specifc requirements.
These include indexes on partitioned tables and index-organised tables.
7.4.1 B-Tree Index
Most database management systems have the B-Tree Index technique as the default indexing method. When you
code statements using the data defnition language of the database software to create an index, the system creates a
B-Tree index. RDBMSs also create B-Tree indexes automatically on primary key values. The B-Tree index technique
supercedes other techniques because of its data retrieval speed, ease of maintenance, and simplicity. Refer to the
following fgure showing an example of a B-Tree Index. Notice the tree structure with the root at the top. The index
consists of a B-Tree (a balanced binary tree) structure based on the values of the indexed column. In the example,
the indexed column is Name. This B-Tree is created using all the existing names that are the values of the indexed
column. Observe the upper blocks that contain index data pointing to the next lower block. Think of a B-Tree index
as containing hierarchical levels of blocks. The lowest-level blocks or the leaf blocks point to the rows in the data
table. Note the data addresses in the leaf blocks.
If a column in a table has many unique values, then the selectivity of the column is said to be high. In a territory
dimension table, the column for City contains many unique values. This column is therefore highly selective. B-Tree
indexes are most suitable for highly selective columns. Because the values at the leaf nodes will be unique they will
lead to distinct data rows and not to a chain of rows. What if a single column is not highly selective?
143/JNU OLE
A-K
L-Z
A-D
E-G
H-K
ALLEN
BUSH
CLYNE
DUNNE
ENGEL
FARIS
GORE
ENGEL -- address
FARIS -- address
GORE -- address
Pointers
to data
rows
HAIG
IGNAR
JONES
KUMAR
LOEWE
MAHER
NIXON
OTTO
PAINE
QUINN
RAJ
SEGEL
TOTO
VETRI
WILLS
L-O

P-R

S-Z

Fig. 7.4 B-Tree index example
Indexes grow in direct proportion to the growth of the indexed data table. Wherever indexes contain concatenation of
multiple columns, they tend to sharply increase in size. As the data warehouse deals with large volumes of data, the
size of the index fles can be cause for concern. What can we say about the selectivity of the data in the warehouse?
Are most of the columns highly selective? Not really. If you inspect the columns in the dimension tables, you will
notice a number of columns that contain low-selectivity data. B-Tree indexes do not work well with data whose
selectivity is low. What is the alternative? That leads us to another type of indexing technique.
7.4.2 Bitmapped Index
Bitmapped indexes are ideally suitable for low-selectivity data. A bitmap is an ordered series of bits, one for each
distinct value of the indexed column. Assume that the column for colour has three distinct colours, namely, white,
almond, and black. Construct a bitmap using these three distinct values. Each entry in the bitmap contains three bits.
Let us say the frst bit refers to white, the second to almond, and the third to black. If a product is white in colour,
the bitmap entry for that product consists of three bits, where the frst bit is set to 1, the second bit is set to 0, and the
third bit is set to 0. If a product is almond in colour, the bitmap entry for that product consists of three bits, where
the frst bit is set to 0, the second bit is set to 1, and the third bit is set to 0..
7.4.3 Clustered Indexes
Some RDBMSs offer a new type of indexing technique. In the B-Tree, bitmapped, or any sequential indexing method,
you have a data segment where the values of all columns are stored and an index segment where index entries are
kept. The index segment repeats the column values for the indexed columns and also holds the addresses for the
entries in the data segment. Clustered tables combine the data segment and the index segments; the two segments
are one. Data is the index and index is the data. Clustered tables improve performance considerably because in one
read you get the index and the data segments. Using the traditional indexing techniques, you need one read to get
the index segment and a second read to get the data segment. Queries run faster with clustered tables when you are
looking for exact matches or searching for a range of values. If your RDBMS supports this type of indexing, make
use of this technique wherever you can in your environment.
Data Mining
144/JNU OLE
7.4.4 Indexing the Fact Table
The primary key of the fact table consists of the primary keys of all the connected dimensions. If you have four
dimension tables of store, product, time, and promotion, then the full primary key of the fact table is the concatenation
of the primary keys of the store, product, time, and promotion tables. What are the other columns? The other columns
are metrics such as sale units, sale dollars, cost dollars, and so on. These are the types of columns to be considered
for indexing the fact tables.

Please study the following tips and use them when planning to create indexes for the fact tables:
If the DBMS does not create an index on the primary key, deliberately create a B-Tree index on the full primary •
key.
Carefully design the order of individual key elements in the full concatenated key for indexing. In the high order •
of the concatenated key, place the keys of the dimension tables frequently referred to while querying.
Review the individual components of concatenated key. Create indexes on combinations based on query •
processing requirements.
If the DBMS supports intelligent combinations of indexes for access, then you may create indexes on each •
individual component of the concatenated key.
Do not overlook the possibilities of indexing the columns containing the metrics. For example, if many queries •
look for dollar sales within given ranges, then the column “dollar sales” is a candidate for indexing.
Bitmapped indexing does not apply to fact tables. There are hardly any low-selectivity columns. •
7.4.5 Indexing the Dimension Tables
Columns in the dimension tables are used in the predicates of queries. A query may run like this: How much are
the sales of Product A in the month of March for the Northern Division? Here, the columns product, month, and
division from three different dimension tables are candidates for indexing. Inspect the columns of each dimension
table carefully and plan the indexes for these tables. You may not be able to achieve performance improvement by
indexing the columns in the fact tables but the columns in the dimension tables offer tremendous possibilities to
improve performance through indexing. Here, are few tips on indexing the dimension tables:
Create a unique B-Tree index on the single-column primary key. •
Examine the columns that are commonly used to constrain the queries. These are candidates for bitmapped •
indexes.
Look for columns that are frequently accessed together in large dimension tables. Determine how these columns •
may be arranged and used to create multicolumn indexes. Remember that the columns that are more frequently
accessed or the columns that are at the higher hierarchical levels in the dimension table are placed at the high
order of the multicolumn indexes.
Individually index every column likely to be used frequently in join conditions. •
7.5 Performance Enhancement Techniques
Apart from the indexing techniques, a few other methods also improve performance in a data warehouse. For example,
physically compacting the data while writing to storage enables more data to be loaded into a single block. That
also means that more data may be retrieved in one read. Another method for improving performance is the merging
of tables. Again, this method enables more data to be retrieved in one read. If you purge unwanted and unnecessary
data from the warehouse in a regular manner, you can improve the overall performance. In the remainder of this
section, let us review a few other effective performance enhancement techniques. Many techniques are available
through the DBMS, and most of these techniques are especially suitable for the data warehouse environment.
7.5.1 Data Partitioning
Typically, the data warehouse holds some very large database tables. The fact tables run into millions of rows.
Dimension tables like the product and customer tables may also contain a huge number of rows. When you have
tables of such vast sizes, you face certain specifc problems. First, loading of large tables takes excessive time. Then,
building indexes for large tables also runs into several hours. What about processing of queries against large tables?
145/JNU OLE
Queries also run longer when attempting to sort through large volumes of data to obtain the result sets. Backing up
and recovery of huge tables takes an inordinately long time. Again, when you want to selectively purge and archive
records from a large table, wading through all the rows takes a long time.
Performing maintenance operations on smaller pieces is easier and faster. Partitioning is a crucial decision and must
be planned up front. Doing this after the data warehouse is deployed and goes into production is time-consuming
and diffcult. Partitioning means deliberate splitting of a table and its index data into manageable parts. The DBMS
supports and provides the mechanism for partitioning. When you defne the table, you can defne the partitions as
well. Each partition of a table is treated as a separate object. As the volume increases in one partition, you can split
that partition further. The partitions are spread across multiple disks to gain optimum performance. Each partition
in a table may have distinct physical attributes, but all partitions of the table have the same logical attributes.
As you observe, partitioning is an effective technique for storage management and improving performance. The
benefts are as follows.
A query needs to access only the necessary partitions. Applications can be given the choice to have partition •
transparency or they may explicitly request an individual partition. Queries run faster when accessing smaller
amounts of data.
An entire partition may be taken off-line for maintenance. You can separately schedule maintenance of partitions. •
Partitions promote concurrent maintenance operations.
Index building is faster. •
Loading data into the data warehouse is easy and manageable. •
Data corruption affects only a single partition. Backup and recovery on a single partition reduces downtime. •
The input–output load gets balanced by mapping different partitions to the various disk drives. •
7.5.2 Data Clustering
In the data warehouse, many queries require sequential access of huge volumes of data. The technique of data
clustering facilitates such sequential access. Clustering fosters sequential prefetch of related data. You achieve data
clustering by physically placing related tables close to each other in storage. When you declare a cluster of tables
to the DBMS, the tables are placed in neighbouring areas on disk. How you exercise data clustering depends on the
features of the DBMS. Review the features and take advantage of data clustering.
7.5.3 Parallel Processing
Consider a query that accesses large quantities of data, performs summations, and then makes a selection based on
multiple constraints. It is immediately obvious that you will achieve major performance improvement if you can split
the processing into components and execute the components in parallel. The simultaneous concurrent executions
will produce the result faster. Several DBMS vendors offer parallel processing features that are transparent to the
users. As a designer of the query, the user need not know how a specifc query must be broken down for parallel
processing. The DBMS will do that for the user. Parallel processing techniques may be applied to data loading and
data reorganisation. Parallel processing techniques work in conjunction with data partitioning schemes. The parallel
architecture of the server hardware also affects the way parallel processing options may be invoked. Some physical
options are critical for effective parallel processing. You have to assess propositions like placing two partitions on
the same storage device if you need to process them in parallel. Parallel processing and partitioning together provide
great potential for improved performance. However, the designer must decide how to use them effectively.
7.5.4 Summary Levels
Select the levels of granularity for the purpose of optimising the input–output operations. If the users frequently
request weekly sales information, then consider keeping another summary at the weekly level. On the other hand, if
you only keep weekly and monthly summaries and no daily details, every query for daily details cannot be satisfed
from the data warehouse. Choose your summary and detail levels carefully based on user requirements.
Data Mining
146/JNU OLE
In addition, rolling summary structures are especially useful in a data warehouse. Suppose in your data warehouse
you need to keep hourly data, daily data, weekly data, and monthly summaries, create mechanisms to roll the data
into the next higher levels automatically with the passage of time. Hourly data automatically gets summarised into
the daily data, daily data into the weekly data, and so on.
7.5.5 Referential Integrity Checks
Referential integrity constraints make sure the validity between two related tables. The referential integrity rules in
the relational model govern the values of the foreign key in the child table and the primary key in the parent table.
Every time a row is added or deleted, the DBMS verifes that the referential integrity is preserved. This verifcation
ensures that parent rows are not deleted while child rows exist and that child rows are not added without parent
rows. Referential integrity verifcation is critical in the OLTP systems, but it reduces performance. Now consider
the loading of data into the data warehouse. By the time the load images are created in the staging area, the data
structures have already gone through the phases of extraction, cleansing, and transformation. The data ready to be
loaded has already been verifed for correctness as far as parent and child rows are concerned. Therefore, there is no
need for further referential integrity verifcation while loading the data. Turning off referential integrity verifcation
produces signifcant performance gains.
7.5.6 Initialisation Parameters
DBMS installation signals the start of performance improvement. At the start of the installation of the database system,
you can carefully plan how to set the initialisation parameters. Most of the times you will realise that performance
degradation is to a substantial extent the result of inappropriate parameters. The data warehouse administrator has
a special responsibility to choose the right parameters.
7.5.7 Data Arrays
What are data arrays? Suppose in a fnancial data mart you need to keep monthly balances of individual line accounts.
In a normalised structure, the monthly balances for a year will be found in twelve separate table rows. Assume that in
many queries the users request for the balances for all the months together. How can you improve the performance?
You can create a data array or repeating group with twelve slots, each to contain the balance for one month.
Although creating arrays is a clear violation of normalisation principles, this technique yields tremendous performance
improvement. In the data warehouse, the time element is interwoven into all data. Frequently, users look for data
in a time series. Another example is the request for monthly sales fgures for 24 months for each salesperson. If
you analyse the common queries, you will be surprised to see how many need data that can be readily stored in
arrays.
7.6 Data Warehouse Deployment
Once the data warehouse has been designed, built and tested it needs to be deployed so it is available to the user
community. This process is also known as ‘roll-out’. This can vary in size from a single-server local deployment
(deployed across one country or one location) to a global distributed network involving several time zones and
translating data into many different languages.
It is never enough to simply deploy a solution then ‘leave’. Ongoing maintenance and future enhancements must be
managed; a programme of user training is often required, apart from the logistics of the deployment itself. Timing
of a deployment is critical. Allow too much time and you risk missing your deadlines, while you allow too less time
and you run in resourcing problems. As with most IT work, never underestimate the amount of work involved and
also the amount of time required.
The data warehouse might need to be customised for deployment to a particular country or location, where they
might use the general design but have their own data needs. It is not uncommon for different parts of the same
organisation to use different computer systems, particularly where mergers and acquisitions are involved, so the
data warehouse must be modifed to allow this as part of the deployment.
147/JNU OLE
Roll-out to production
This takes place after user acceptance testing (UAT) and includes: moving the data warehouse to the live servers,
loading all the live data – not just some of it for testing purposes, optimising the databases and implementing security.
All of this must involve minimum disruption to the system users. Needless to say, you need to be very confdent
everything is in place and working before going live – or you might fnd you have to do it all over again.
Scheduling jobs
In a production environment jobs such as data warehouse loading must be automated in scripts and scheduled to
run automatically. A suitable time slot must be found that does not confict with other tasks happening on the same
servers. Procedures must be in place to deal with unexpected events and failures.
Regression testing
This type of testing is part of the deployment process and searches for errors that were fxed at one point but have
somehow been introduced by the change in environment.
7.6.1 Data warehouse Deployment Lifecycle
Here is the typical lifecycle for data warehouse deployment project:
Project Scoping and Planning •
Project Triangle – Scope, Time and Resource ‚
Determine the scope of the project – what you would like to accomplish? This can be defned by questions ‚
to be answered. The number of logical star and number of the OLTP sources
Time – What is the target date for the system to be available to the users ‚
Resource – What is our budget? What is the role and profle requirement of the resources needed to make ‚
this happen?
Requirement •
What are the business questions? How does the answer of these questions can change the business decision ‚
or trigger actions?
What is the role of the users? How often do they use the system? Do they do any interactive reporting or ‚
just view the defned reports in guided navigation?
How do you measure? What are the metrics? ‚
Front-End Design •
The front end design needs for both interactive analysis and the designed analytics workfow. ‚
How does the user interact with the system? ‚
What is their analysis process? ‚
Warehouse Schema Design •
Dimensional modelling – defne the dimensions and fact and defne the grain of each star schema. ‚
Defne the physical schema – depending on the technology decision. If you use the relational technology, ‚
design the database tables.
OLTP to data warehouse mapping •
Logical mapping – table to table and column to column mapping. Also defne the transformation rules ‚
You may need to perform OLTP data profiling. How often the data changes? What is the data ‚
distribution?
ETL Design -include data staging and the detail ETL process fow. ‚
Implementation •
Create the warehouse and ETL staging schema. ‚
Develop the ETL programs. ‚
Create the logical to physical mapping in the repository. ‚
Build the end user dashboard and reports. ‚
Data Mining
148/JNU OLE
Deployment •
Install the Analytics reporting and the ETL tools. ‚
Specifc Setup and Confguration for OLTP, ETL, and data warehouse. ‚
Sizing of the system and database ‚
Performance Tuning and Optimisation ‚
Management and Maintenance of the system •
Ongoing support of the end-users, including security, training, and enhancing the system. ‚
You need to monitor the growth of the data. ‚
7.7 Growth and Maintenance
More data marts and more deployment versions have to follow. The team needs to ensure that it is well poised for
growth. You need to ensure that the monitoring functions are all in place to constantly keep the team informed of
the status. The training and support functions must be consolidated and streamlined. The team must confrm that all
the administrative functions are ready and working. Database tuning must continue at a regular pace.
Immediately following the initial deployment, the project team must conduct review sessions. Here are the major
review tasks:
Review the testing process and suggest recommendations. •
Review the goals and accomplishments of the pilots. •
Survey the methods used in the initial training sessions. •
Document highlights of the development process. •
Verify the results of the initial deployment, matching these with user expectations. •
The review sessions and their outcomes form the basis for improvement in the further releases of the data warehouse.
As you expand and produce further releases, let the business needs, modelling considerations, and infrastructure
factors remain as the guiding factors for growth. Follow each release close to the previous release. You can make
use of the data modelling done in the earlier release. Build each release as a logical next step. Avoid disconnected
releases. Build on the current infrastructure.
7.7.1 Monitoring the Data Warehouse
When you implement an OLTP system, you do not stop with the deployment. The database administrator continues
to inspect system performance. The project team continues to monitor how the new system matches up with the
requirements and delivers the results. Monitoring the data warehouse is comparable to what happens in an OLTP
system, except for one big difference. Monitoring an OLTP system dwindles in comparison with the monitoring
activity in a data warehouse environment. As you can easily perceive, the scope of the monitoring activity in the
data warehouse extends over many features and functions. Unless data warehouse monitoring takes place in a
formalised manner, desired results cannot be achieved. The results of the monitoring gives you the data needed to
plan for growth and to improve performance.
Following fgure presents the data warehousing monitoring activity and its usefulness. As you can observe, the
statistics serve as the life-blood of the monitoring activity. That leads into growth planning and fne-tuning of the
data warehouse.
149/JNU OLE
Warehouse Data
DATA
WAREHOUSE
Monitoring Statistics
Review statistics
for growth
planning and
performance
tuning
Statistical collection
Sampling
Sample data
warehouse activity
at specifc intervals
and gather statistics
Statistics Collection
Event-driven
Record statistics
whenever
specifed events
take place and
trigger statistics
DATA
WAREHOUSE
ADMINISTRATION
END-
USERS

Fig. 7.5 Data warehousing monitoring
7.7.2 Collection of Statistics
What we call monitoring statistics are indicators whose values provide information about data warehouse functions.
These indicators provide information on the utilisation of the hardware and software resources. From the indicators,
you determine how the data warehouse performs. The indicators present the growth trends. You understand how
well the servers function. You gain insights into the utility of the end-user tools. How do you collect statistics on
the working of the data warehouse? Two common methods apply to the collection process. Sampling methods and
event-driven methods are generally used. The sampling method measures specifc aspects of the system activity
at regular intervals. You can set the duration of the interval. If you set the interval as 10 minutes for monitoring
processor utilisation, then utilisation statistics are recorded every 10 minutes. The sampling method has minimal
impact on the system overhead. The event-driven methods work differently. The recording of the statistics does not
happen at intervals, but only when a specifed event takes place.
The tools that come with the database server and the host operating system are generally turned on to collect the
monitoring statistics. Over and above these, many third-party vendors supply tools especially useful in a data
warehouse environment. Most tools gather the values for the indicators and also interpret the results. The data
collector component collects the statistics while thethe analyser component does the interpretation. Most of the
monitoring of the system occurs in real time.
The following is a random list that includes statistics for different uses. You will fnd most of these applicable to
your environment.
Physical disk storage space utilisation •
Number of times the DBMS is looking for space in blocks or causes fragmentation •
Memory buffer activity •
Buffer cache usage •
Input–output performance •
Memory management •
Data Mining
150/JNU OLE
Profle of the warehouse content, giving number of distinct entity occurrences (example: number of customers, •
products, and so on)
Size of each database table •
Accesses to fact table records •
Usage statistics relating to subject areas •
Numbers of completed queries by time slots during the day •
Time each user stays online with the data warehouse •
Total number of distinct users per day •
Maximum number of users during time slots daily •
Duration of daily incremental loads •
Count of valid users •
Query response times •
Number of reports run each day •
Number of active tables in the database •
7.7.3 Using Statistics for Growth Planning
As you deploy more versions of the data warehouse, the number of user’s increases and the complexity of the queries
intensifes, you then need to plan for the obvious growth. But how do you know where the expansion is needed?
Why have the queries slowed down? Why has the response times degraded? Why was the warehouse down for
expanding the table spaces? The monitoring statistics provide you with clues as to what is happening in the data
warehouse and how you can prepare for the growth. Following are the types of action that are prompted by the
monitoring statistics:
Allocate more disk space to existing database tables •
Plan for new disk space for additional tables •
Modify fle block management parameters to minimise fragmentation •
Create more summary tables to handle large number of queries looking for summary information •
Reorganise the staging area fles to handle more data volume •
Add more memory buffers and enhance buffer management •
Upgrade database servers •
Offoad report generation to another middle tier •
Smooth out peak usage during the 24-hour cycle •
Partition tables to run loads in parallel and to manage backups •
7.7.4 Using Statistics for Fine-Tuning
The next best use of statistics relates to performance. You will fnd that a large number of monitoring statistics prove
to be useful for fne-tuning of the data warehouse. Following are the data warehouse functions that are normally
improved based on the information derived from the statistics:
Query performance •
Query formulation •
Incremental loads •
Frequency of OLAP loads •
OLAP system •
Data warehouse content browsing •
Report formatting •
Report generation •
151/JNU OLE
7.7.5 Publishing Trends for Users
This is a new concept not usually found in OLTP systems. In a data warehouse, the users must fnd their way into
the system and retrieve the information by themselves. They must know about the contents. Users must know about
the currency of the data in the warehouse. When was the last incremental load? What are the subject areas? What
is the count of distinct entities? The OLTP systems are quite different. These systems readily present the users with
routine and standardised information. Users of OLTP systems do not need the inside view. Look at the following
fgure listing the types of statistics that must be published for the users. In your data warehouse is Web-enabled,
use the company’s intranet to publish the statistics for the users. Otherwise, provide the ability to inquire into the
dataset where the statistics are kept.
WEB-ENABLES
DATA WAREHOUSE
INTRANET
WEB PAGE
STATISTICS AND I
NFORMATION

Warehouse subjects
Warehouse tables
Summary data
Warehouse navigation
Warehouse statistics
Predefned queries
Predefned reports
Last full load
Last incremental load
Scheduled downtime
Contacts for support
User tool upgrades
Warehouse data
Metadata
INTERNAL
END-USERS
Monitoring Statistics

Fig. 7.6 Statistics for the users
7.8 Managing the Data Warehouse
Data warehouse management is concerned with two principal functions. The frst is maintenance management.
The data warehouse administrative team must keep all the functions going in the best possible manner. The second
is change management. As new versions of the warehouse are deployed, as new releases of the tools become
available, as improvements and automation take place in the ETL functions, the administrative team’s focus includes
enhancements and revisions.

Postdeployment administration covers the following areas:
Performance monitoring and fne-tuning •
Data growth management •
Storage management •
Network management •
ETL management •
Management of future data mart releases •
Enhancements to information delivery •
Security administration •
Data Mining
152/JNU OLE
Backup and recovery management •
Web technology administration •
Platform upgrades •
Ongoing training •
User support •
7.8.1 Platform Upgrades
Your data warehouse deployment platform includes the infrastructure, the data transport component, end-user
information delivery, data storage, metadata, the database components, and the OLAP system components. More
often, a data warehouse is a comprehensive cross-platform environment. The components follow a path of dependency,
starting with computer hardware at the bottom, followed by the operating systems, communications systems, the
databases, GUIs, and then the application support software. As time goes on, upgrades to these components are
announced by the vendors. After the initial rollout, have a proper plan for applying the new releases of the platform
components. As you have probably experienced with OLTP systems, upgrades cause potentially serious interruption
to the normal work unless they are properly managed. Good planning minimises the disruption. Vendors try to force
you into upgrades on their schedule based on their new releases. If the timing is not convenient for you, resist the
initiatives from the vendors. Schedule the upgrades at your convenience and based on when your users can tolerate
interruptions.
7.8.2 Managing Data Growth
Managing data growth deserves special attention. In a data warehouse, unless you are vigilant about data growth, it
could get out of hand very soon and quite easily. Data warehouses already contain huge volumes of data. When you
start with a large volume of data, even a small percentage increase can result in substantial additional data. In the frst
place, a data warehouse may contain too much historical data. Data beyond 10 years may not produce meaningful
results for many companies because of the changed business conditions. End-users tend to opt for keeping detailed
data at the lowest grain. At least in the initial stages, the users continue to match results from the data warehouse
with those from the operational systems. Analysts produce many types of summaries in the course of their analysis
sessions. Quite often, the analysts want to store these intermediary datasets for use in similar analysis in the future.
Unplanned summaries and intermediary datasets add to the growth of data volumes. Here are just a few practical
suggestions to manage data growth:
Dispense with some detail levels of data and replace them with summary tables. •
Restrict unnecessary drill-down functions and eliminate the corresponding detail level data. •
Limit the volume of historical data. Archive old data promptly. •
Discourage analysts from holding unplanned summaries. •
Where genuinely needed, create additional summary tables. •
7.8.3 Storage Management
As the volume of data increases, so does the utilisation of storage. Because of the huge data volume in a data
warehouse, storage costs rank very high as a percentage of the total cost. Experts estimate that storage costs are almost
four or fve time’s software costs, yet you fnd that storage management does not receive suffcient attention from
data warehouse developers and managers. Here, are few tips on storage management to be used as guidelines:
Additional rollouts of the data warehouse versions require more storage capacity. Plan for the increase. •
Make sure that the storage confguration is fexible and scalable. You must be able to add more storage with •
minimum interruption to the current users.
Use modular storage systems. If not already in use, consider a switchover. •
If yours is a distributed environment with multiple servers having individual storage pools, consider connecting •
the servers to a single storage pool that can be intelligently accessed.
As usage increases, plan to spread data over multiple volumes to minimise access bottlenecks. •
153/JNU OLE
Ensure ability to shift data from bad storage sectors. •
Look for storage systems with diagnostics to prevent outages. •
7.8.4 ETL Management
This is a major ongoing administrative function, so attempt to automate most of it. Install an alert system to call
attention to exceptional conditions. The following are useful suggestions on ETL (data extraction, transformation,
loading) management:
Run daily extraction jobs on schedule. If source systems are not available under extraneous circumstances, •
reschedule extraction jobs.
If you employ data replication techniques, make sure that the result of the replication process checks out. •
Ensure that all reconciliation is complete between source system record counts and record counts in extracted •
fles.
Make sure all defned paths for data transformation and cleansing are traversed correctly. •
Resolve exceptions thrown out by the transformation and cleansing functions. •
Verify load image creation processes, including creation of the appropriate key values for the dimension and •
fact table rows.
Check out the proper handling of slowly changing dimensions. •
Ensure completion of daily incremental loads on time. •
7.8.5 Information Delivery Enhancements
As time goes on, you will notice that your users have outgrown the end-user tools they started out with. In the
course of time, the users become more profcient with locating and using the data. They get ready for more and
more complex queries. New end-user tools appear in the market all the time. Why deny your users the latest and the
best if they really can beneft from them? What are the implications of enhancing the end-user tools and adopting
a different tool set? Unlike a change to ETL, this change relates to the users directly, so plan the change carefully
and proceed with caution.
Please review the following tips:
Make sure the compatibility of the new tool set with all data warehouse components. •
If the new tool set is installed in addition to the existing one, switch your users over in stages. •
Ensure integration of end-user metadata. •
Schedule training on the new tool set. •
If there are any data-stores attached to the original tool set, plan for the migration of the data to the new tool •
set.
7.8.6 Ongoing Fine-Tuning
Techniques for fne-tuning OLTP systems are applied to the fne-tuning of the data warehouse. The techniques are
very much similar except for one big difference: the data warehouse contains a lot more, in fact, many times more
data than a typical OLTP system. The techniques will have to apply to an environment replete with mountains of
data.

There may not be any point in repeating the indexing and other techniques that you already know from the OLTP
environment. Following are a few practical suggestions:
Have a regular schedule to review the usage of indexes. Drop the indexes that are no longer used. •
Monitor query performance daily. Investigate long-running queries. Work with the user groups that seem to be •
executing long-running queries. Create indexes if needed.
Analyse the execution of all predefned queries on a regular basis. RDBMSs have query analysers for this •
purpose.
Data Mining
154/JNU OLE
Review the load distribution at different times per day. Determine the reasons for large variations. •
Although you have instituted a regular schedule for ongoing fne-tuning, from time to time, you will come •
across some queries that suddenly cause grief. You will hear complaints from a specifc group of users. Be
prepared for such ad hoc fne-tuning needs. The data administration team must have staff set apart for dealing
with these situations.
7.9 Models of Data Mining
Following are different models used for data mining explained in detail.
Claims fraud models
The number of challenges facing the Property and Casualty insurance industry seems to have grown geometrically
during the past decade. In the past, poor underwriting results and high loss ratio were compensated by excellent
returns on investments. However, the performance of fnancial markets today is not suffcient to deliver the level
of proftability that is necessary to support the traditional insurance business model. In order to survive in the bleak
economic conditions that dictate the terms of today’s merciless and competitive market, insurers must change
the way they operate to improve their underwriting results and proftability. An important element in the process
of defning the strategies that are essential to ensure the success and proftable results of insurers is the ability to
forecast the new directions in which claims management should be developed. This endeavour has become a crucial
and challenging undertaking for the insurance industry, given the dramatic events of the past years in the insurance
industry worldwide. We can check claims as they arrive and score them as to the likelihood of they are fraudulent.
This can results in large savings to the insurance companies that use these technologies.
Customer clone models
The process for selectively targeting prospects for your acquisition efforts often utilises a sophisticated analytical
technique called “best customer cloning.” These models estimate which prospects are most likely to respond based
on characteristics of the company’s “best customers”. To this end, we build the models or demographic profles that
allow you to select only the best prospects or “clones” for your acquisition programs. In a retail environment, we
can even identify the best prospects that are close in proximity to your stores or distribution channels. Customer
clone models are appropriate when insuffcient response data is available, providing an effective prospect ranking
mechanism when response models cannot be built.
Response models
The best method for identifying the customers or prospects to target for a specifc product offering is through the
use of a model developed specifcally to predict response. These models are used to identify the customers most
likely to exhibit the behaviour being targeted. Predictive response models allow organisations to fnd the patterns
that separate their customer base so the organisation can contact those customers or prospects most likely to take
the desired action. These models contribute to more effective marketing by ranking the best candidates for a specifc
product offering thus identifying the low hanging fruit.
Revenue and proft predictive models
Revenue and Proft Prediction models combine response/non-response likelihood with a revenue estimate, especially
if order sizes, monthly billings, or margins differ widely. Not all responses have equal value, and a model that
maximises responses does not necessarily maximise revenue or proft. Revenue and proft predictive models
indicate those respondents who are most likely to add a higher revenue or proft margin with their response than
other responders.
These models use a scoring algorithm specifcally calibrated to select revenue-producing customers and help identify
the key characteristics that best identify better customers. They can be used to fne-tune standard response models
or used in acquisition strategies.
155/JNU OLE
Cross-sell and up-sell models
Cross-sell/up-sell models identify customers who are the best prospects for the purchase of additional products and
services and for upgrading their existing products and services. The goal is to increase share of wallet. Revenue can
increase immediately, but loyalty is enhanced as well due to increased customer involvement.
Attrition models
Effcient, effective retention programs are critical in today’s competitive environment. While it is true that it is less
costly to retain an existing customer than to acquire a new one, the fact is that all customers are not created equal.
Attrition models enable you to identify customers who are likely to churn or switch to other providers thus allowing
you to take appropriate pre-emptive action. When planning retention programs, it is essential to be able to identify
best customers, how to optimise existing customers and how to build loyalty through “entanglement”. Attrition
models are best employed when there are specifc actions that the client can take to retard cancellation or cause
the customer to become substantially more committed. The modelling technique provides an effective method for
companies to identify characteristics of chumers for acquisition efforts and also to prevent or forestall cancellation
of customers.
Marketing effectiveness creative models
Often the message that is passed on to the customer is the one of the most important factors in the success of a
campaign. Models can be developed to target each customer or prospect with the most effective message. In direct
mail campaigns, this approach can be combined with response modelling to score each prospect with the likelihood
they will respond given that they are given the most effective creative message (that is the one that is recommended
by the model). In email campaigns this approach can be used to specify a customised creative message for each
recipient.
Real time web personalisation with eNuggets
Using our eNuggets real time data mining system websites can interact with site visitors in an intelligent manner to
achieve desired business goals. This type of application is useful for eCommerce and CRM sites. eNuggets is able
to transform Web sites from static pages to customised landing pages, built on the fy, that match a customer profle
so that the promise of true one-to-one marketing can be realised.
eNuggets is a revolutionary new business intelligence tool that can be used for web personalisation or other real
time business intelligence purposes. It can be easily integrated with existing systems such as CRM, outbound
telemarketing (that is intelligent scripting), insurance underwriting, stock forecasting, fraud detection, genetic
research and many others.
eNuggets
TM
uses historical (either from company transaction data or from outside data) data to extract information
in the form of English rules understandable by humans. The rules collectively form a model of the patterns in the
data that would not be evident to human analysis. When new data comes in, such as a stock transaction from ticker
data, e-Nuggets
TM
interrogates the model and fnds the most appropriate rule to suggest which course of action will
provide the best result (that is buy, sell or hold).
Data Mining
156/JNU OLE
Summary
The logical model forms the primary basis for the physical model. •
Many companies invest a lot of time and money to prescribe standards for information systems. The standards •
range from how to name the felds in the database to how to conduct interviews with the user departments for
requirements defnition.
Standards take on greater importance in the data warehouse environment. •
If the data warehouse stores data only at the lowest level of granularity, every such query has to read through •
all the detailed records and sum them up.
If OLAP instances are not for universal use by all users, then the necessary aggregates must be present in the •
main warehouse. The aggregate database tables must be laid out and included in the physical model.
During the load process, the entire table must be closed to the users. •
In the data warehouse, many of the data access patterns rely on sequential access of large quantities of data. •
Preparing an indexing strategy is a crucial step in the physical design. Unlike OLTP systems, the data warehouse •
is query-centric.
The effciency of the data retrieval is closely tied to where the data is stored in physical storage and how it is •
stored there.
Most of the leading DBMSs allow you to set block usage parameters at appropriate values and derive performance •
improvement.
Redundant array of inexpensive disks (RAID) technology has become common to the extent that almost all of •
today’s data warehouses make good use of this technology.
In a query-centric system like the data warehouse environment, the need to process queries faster dominates. •
Bitmapped indexes are ideally suitable for low-selectivity data. •
Once the data warehouse is designed, built and tested, it needs to be deployed so it is available to the user •
community.
Data warehouse management is concerned with two principal functions: maintenance management and change •
management.
References
Ponniah, P., 2001. • DATA WAREHOUSING FUNDAMENTALS-A Comprehensive Guide for IT Professionals,
Wiley-Interscience Publication.
Larose, D. T., 2006. • Data mining methods and models, John Wiley and Sons.
Wan, D., 2007. Typical data warehouse deployment lifecycle [Online] Available at: <http://dylanwan.wordpress. •
com/2007/11/02/typical-data-warehouse-deployment-lifecycle/>. [Accessed 12 September 2011].
Statsoft, • Data Mining Techniques [Online] Available at: <http://www.statsoft.com/textbook/data-mining-
techniques/>. [Accessed 12 September 2011].
StatSoft, 2010. • Data Mining, Model Deployment and Scoring - Session 30 [Video Online] Available at : < http://
www.youtube.com/watch?v=LDoQVbWpgKY>. [Accessed 12 September 2011].
OracleVideo, 2010. • Data Warehousing Best Practices Star Schemas [Video Online] Available at : < http://www.
youtube.com/watch?v=LfehTEyglrQ>. [Accessed 12 September 2011].
Recommended Reading
Kantardzic, M., 2001. • Data Mining: Concepts, Models, Methods, and Algorithms, 2nd ed., Wiley-IEEE.
Khan, A., 2003. • Data Warehousing 101: Concepts and Implementation, iUniverse.
Rainardi, V., 2007. • Building a data warehouse with examples in SQL Server, Apress.
157/JNU OLE
Self Assessment
The logical model forms the __________ basis for the physical model. 1.
primary a.
secondary b.
important c.
former d.
If _________ instances are not for universal use by all users, then the necessary aggregates must be present in 2.
the main warehouse.
KDD a.
DMKD b.
OLAP c.
SDMKD d.
What divides large database tables into manageable parts? 3.
Extraction a.
Division b.
Transformation c.
Partitioning d.
a.
Which of the following is a crucial step in the physical design? 4.
Preparing and indexing strategy a.
Preparing web mining strategy b.
Preparing data mining strategy c.
Preparing OLAP strategy d.
Which statement is false? 5.
In the data warehouse, many of the data access patterns rely on sequential access of large quantities of a.
data.
Unlike data mining systems, the data warehouse is query-centric b.
The sequence of the attributes in each index plays a critical role in performance. c.
Scrutinise the attributes in each table to determine which attributes qualify for bit-mapped indexes. d.
In most cases, the supporting proprietary software dictates the storage and the retrieval of data in the ________ 6.
system.
KDD a.
OLTP b.
SDMKD c.
OLAP d.
What is the full form of RAID? 7.
Redundant Arrangement of Inexpensive Disks a.
Redundant Array of Information Disks b.
Redundant Array of Inexpensive Disks c.
Redundant Array of Inexpensive Database d.
Data Mining
158/JNU OLE
Match the columns 8.
Disk mirroring 1.
similar to mirroring, except here each drive has its own distinct A.
controller
Disk duplexing 2.
writing the same data to two disk drives connected to the same B.
controller
Parity checking 3.
addition of a parity bit to the data to ensure correct data C.
transmission
Disk striping 4. data spread across multiple disks by sectors or bytes D.
1-A, 2-B, 3-C, 4-D a.
1-D, 2-C, 3-B, 4-A b.
1-B, 2-A, 3-C, 4-D c.
1-C, 2-B, 3-A, 4-D d.
Every action in the physical model takes place in_________. 9.
physical storage a.
data mining b.
data warehousing c.
disk mirroring d.
In a __________ system like the data warehouse environment, the need to process queries faster dominates. 10.
OLAP a.
query-centric b.
OLTP c.
B-Tree index d.
159/JNU OLE
Case study I
Logic-ITA student data
We have performed a number of queries on datasets collected by the Logic-ITA to assist teaching and learning.
The Logic-ITA is a web-based tutoring tool used at Sydney University since 2001, in a course taught by the second
author. Its purpose is to help students practice logic formal proofs and to inform the teacher of the class progress.
Context of use
Over the four years, around 860 students attended the course and used the tool, in which an exercise consists of a set
of formulas (called premises) and another formula (called the conclusion). The aim is to prove that the conclusion can
validly be derived from the premises. For this, the student has to construct new formulas, step by step, using logic
rules and formulas previously established in the proof, until the conclusion is derived. There is no unique solution
and any valid path is acceptable. Steps are checked on the fy and, if incorrect, an error message and possibly a tip
are displayed. Students used the tool at their own discretion. A consequence is that there is neither a fxed number
nor a fxed set of exercises done by all students.
Data stored
The tool’s teacher module collates all the student models into a database that the teacher can query and mine. Two
often queried tables of the database are the tables mistake and correct step. The most common variables are shown
in Table 1.1.
login the student’s login id
qid the question id
mistake the mistake made
rule the logic rule involved/used
line the line number in the proof
startdate date exercise was started
fnishdate date exercise was fnished (or 0 if unfnished)

Table 1.1 Common variables in tables mistake and correct step
(Source: Merceron, A. and Yacef, K., Educational Data Mining: a Case Study, )
Questions
What is Logic-ITA? What is the purpose of 1.
Answer
The Logic-ITA is a web-based tutoring tool used at Sydney University since 2001, in a course taught by the
second author. Its purpose is to help students practice logic formal proofs and to inform the teacher of the class
progress.
What is ‘premises’ and ‘conclusion’? 2.
Answer
Over the four years, around 860 students attended the course and used the tool, in which an exercise consists
of a set of formulas (called premises) and another formula (called the conclusion).
Data Mining
160/JNU OLE
State the most common variables of database. 3.
Answer
login the student’s login id
qid the question id
mistake the mistake made
rule the logic rule involved/used
line the line number in the proof
Start date date exercise was started
Finish date date exercise was fnished (or 0 if unfnished)
161/JNU OLE
Case Study II
A Case Study of Exploiting Data Mining Techniques for an Industrial Recommender System
In this case study, we aim to providing recommendations to the loyal customers of a chain of fashion retail stores
based in Spain. In particular, the retail stores would like to be able to generate targeted product recommendations
to loyal customers based on either customer demographics, customer transaction history, or item properties. A
comprehensive description of the available dataset with the above information is provided in the next subsection. The
transformation of this dataset into a format that can be exploited by Data Mining and Machine Learning techniques
is described in sections below.
Dataset
The dataset used for this case study contained data on customer demographics, transactions performed, and item
properties. The entire dataset covers the period of 01/01/2007 – 31/12/2007.
There were 1,794,664 purchase transactions by both loyal and non-loyal customers. The average value of a purchased
item was €35.69. We removed the transactions performed by non-loyal customers, which reduced the number of
purchase transactions to 387,903 by potentially 357,724 customers. We refer to this dataset as Loyal. The average
price of a purchased item was €37.81.
We then proceeded to remove all purchased items with a value of less than €0 because these represent refunds. This
reduced the number of purchase transactions to 208,481 by potentially 289,027 customers. We refer to this dataset
as Loyal-100.
Dataset Processing
We processed the Loyal dataset to remove incomplete data for the demographic, item, and purchase transaction
attributes.

Demographic Attributes
Table 2.1 shows the four demographic attributes we used for this case study. The average item price attribute was
not contained in the database; it was derived from the data.
Attribute Original Format processed Codifcation
Date of birth String Numeric Age
Address String Province category
Gender String Gender category
Avg. Item price N/A Derived numeric value
Table 2.1 Demographic attributes
The date of birth attribute was provided in seven different valid formats, alongside several invalid formats. The
invalid formats results in 17,125 users being removed from the Loyal dataset. The date of birth was further
processed to produce the age of the user in years. We considered an age of less than 18 to be invalid because of
the requirement for a loyal customer to be 18 years old to join the scheme; we also considered an age of more than
80 to be unusually old based on the life expectancy of a Spanish person. Customers with an age out with the 18 –
80 range were removed from the dataset. Customers without a gender, or a Not Applicable gender were removed
from the Loyal-100 dataset. Finally, users who did not perform at least one transaction between 01/01/2007 and
31/12/2007 were removed from the dataset. An overview of the number of customers removed from the Loyal-100
dataset can be seen in Table 2.2.
Data Mining
162/JNU OLE
Customer Data Issue Number Removed
Invalid birth date 17125
Too young 3926
Too old 243
Invalid province 44215
Invalid gender 3188
No transactions performed 227297
Total users removed 295994
Customers 61730

Table 2.2 Customer attribute issues in loyal dataset
Item Attributes
Table 2.3 presents the four item attributes we used for this case study
Attribute Original Format processed Codifcation
Designer String Designer category
Composition String Composition category
Price Decimal Numeric value
Release season String Release season category

Table 2.3 Item attributes

The item designer, composition, and release season identifers were translated to nominal categories. The price was
kept in the original format and binned using the Weka toolkit. Items lacking complete data on any of the attributes
were not including in the fnal dataset due to the problem of incomplete data.

Attribute Issues No. of items removed
Invalid season 9344
No designer 10497
No composition 2788
Total items removed 22629
Items 6414
Table 2.4 Item attributes issues in loyal dataset
163/JNU OLE
Purchase Transaction Attributes
Table 2.5 presents the two transaction attributes we used for this case study
Attribute Original Format processed Codifcation
Date String Calendar season category
Individual item price Decimal Numeric value

Table 2.5 Transaction attributes

The transaction date feld was provided in one valid format and presented no parsing problems. The date of a
transaction was codifed into a binary representation of the calendar season(s) according to the scheme shown in
Table 2.6. This codifcation scheme results in the “distance” between January and April being equivalent to the
“distance” between September and December, which is intuitive.
Spring Summer Autumn Winter
January 0 0 0 1
February 1 0 0 1
March 1 0 0 0
April 1 0 0 0
May 1 1 0 0
June 0 1 0 0
July 0 1 0 0
August 0 1 1 0
September 0 0 1 0
October 0 0 1 0
November 0 0 1 1
December 0 0 0 1

Table 2.6 Codifying transaction date to calendar season

The price of each item was kept in the original decimal format and binned using the Weka toolkit. We chose not to
remove discounted items from the dataset. Items with no corresponding user were encountered when the user had
been removed from the dataset due to an aspect of the user demographic attribute causing a problem. An overview
of the number of item transactions removed from the loyal dataset based on the processing and codifcation step
can be seen in Table 2.7.

Issue No. of transactions removed
Refund 2300
Too expensive 6591
No item record 74089
No user record 96442
Total item purchases removed 179422
Remaining item purchases 208481
Table 2.7 Transaction attributes issued in Loyal dataset
Data Mining
164/JNU OLE
As a result of performing these data processing and cleaning steps, we are left with a dataset we refer to as Loyal-
Clean. An overview of the All, Loyal, and the processed and codifed dataset, Loyal-Clean, is shown in Table 2.8.
All Loyal Loyal-Clean
Item transactions 1794664 387903 208481
Customers N/A 357724 61730
Total items 29043 29043 6414
Avg. Items per customer N/A 1.08 3.38
Avg. Item value € 35.69 € 37.81 € 36.35

Table 2.8 Database observations
(Source: Cantadore, I., Elliott, D. and jose, J. M. A Case Study of Exploiting Data Mining Techniques for an
Industrial Recommender System [PDF] Available at: <http://ir.ii.uam.es/publications/indrec09.pdf>. [Accessed 30
September 2011].)
Questions
Explain the dataset used in this case study. 1.
How was demographic attributes used in the above case study? 2.
Write a note on purchase transaction attributes used int his case study. 3.
165/JNU OLE
Case Study III
ABSTRACT
Data Mining is gaining popularity as an effective tool for increasing profts in a variety of industries. However,
the quality of the information resulting from the data mining exercise is only as good as the underlying data. The
importance of accurate, accessible data is paramount. A well designed data warehouse can greatly enhance the
effectiveness of the data mining process. This paper will discuss the planning and development of a data warehouse
for a credit card bank. While the discussion covers a number of aspects and uses of the data warehouse, a particular
focus will be on the critical needs for data access pertaining to targeting model development.
The case study will involve developing a Lifetime Value model from a variety of data sources including account
history, customer transactions, offer history and demographics. The paper will discuss the importance of some
aspects of the physical design and maintenance to the data mining process.
INTRODUCTION
One of the most critical steps in any data mining project is obtaining good data. Good data can mean many things:
clean, accurate, predictive, timely, accessible and/or actionable. This is especially true in the development of targeting
models. Targeting models are only as good as the data on which they are developed. Since the models are used to
select names for promotions, they can have a signifcant fnancial impact on a company’s bottom line.
The overall objectives of the data warehouse are to assist the bank in developing a totally data driven approach to
marketing, risk and customer relationship management. This would provide opportunities for targeted marketing
programs. The analysis capabilities would include:
Response Modelling and Analysis •
Risk or Approval Modelling and Analysis •
Activation or Usage Modelling and Analysis •
Lifetime Value or Net Present Value Modelling •
Segmentation and Profling •
Fraud Detection and Analysis •
List and Data Source Analysis •
Sales Management •
Customer Touchpoint Analysis •
Total Customer Proftability Analysis •

The case study objectives focus on the development of a targeting model using information and tools available
through the data warehouse. Anyone who has worked with target model development knows that data extraction
and preparation are often the most time consuming part of model development. Ask a group of analysts how much
of their time spent preparing data is. A majority of them will say over 50%.
Data Mining
166/JNU OLE
WHERE’S THE EFFORT
Business
Objectives
Development
Data
Preparation
Data Mining
Analysis of
Results and
Knowledge
Accumlation
60
50
40
30
20
10
0
Fig. 3.1 Analysis on time spent for preparing data
Over the last 10 years, the bank had amassed huge amounts of information about our customer and prospects. The
analysts and modellers knew there was a great amount of untapped value in the data. They just had to fgure out a
way to gain access to it. The goal was to design a warehouse that could bring together data from disparate sources
into one central repository.
THE TABLES
The frst challenge was to determine which tables should go into the data warehouse. We had a number of issues:
Capturing response information •
Storing transactions •
Defning date felds •
Response information
Responses begin to arrive about a week after an offer is mailed. Upon arrival, the response is put through a risk
screening process. During this time, the prospect is considered ‘Pending.’ Once the risk screening process is
complete, the prospect is either ‘Approved’ or ‘Declined.’ The bank considered two different options for storing
the information in the data warehouse.
The frst option was to store the data in one large table. The table would contain information about those approved •
as well as those declined. Traditionally, across all applications, they saw approval rates hover around 50%.
Therefore, whenever analyses was done on either the approved applications (with a risk management focus)
or on the declined population (with a marketing as well as risk management focus), every query needed to go
through nearly double the number of records as necessary.
The second option was to store the data in three small tables. This accommodated the daily updates and allowed •
for pending accounts to stay separate as they awaited information from either the applicant or another data
source.

With applications coming from e-commerce sources, the importance of the “pending” table increased. This table
was examined daily to determine which pending accounts could be approved quickly with the least amount of risk.
In today’s competitive market, quick decisions are becoming a competitive edge. Partitioning the large customer
profle table into three separate tables improved the speed of access for each of the three groups of marketing analysts
who had responsibility for customer management, reactivation and retention, and activation. The latter group was
responsible for both the one-time buyers and the prospect pools.
167/JNU OLE
FILE STRUCTURE ISSUES
Many of the tables presented design challenges. Structural features that provided ease of use for analysts could
complicate the data loading process for the IT staff. This was a particular problem when it came to transaction data.
This data is received on a monthly basis and consists of a string of transactions for each account for the month.
This includes transactions such as balances, purchases, returns and fees. In order to make use of the information at
a customer level it needs to be summarised. The question was how to best organize the monthly performance data
in the data warehouse. Two choices were considered:
Long skinny fle: this took the data into the warehouse in much the same form as it arrived.
Each month would enter the table as a separate record. Each year has a separate table. The felds represent the
following:
Month01 Cust_1 VarA VarB VarC VarD VarE CDate C#
Month01 Cust_1 VarA VarB VarC VarD VarE CDate C#
Month03 Cust_1 VarA VarB VarC VarD VarE CDate C#
Month04 Cust_1 VarA VarB VarC VarD VarE CDate C#
| | | | | |
Month12 Cust_1 VarA VarB VarC VarD VarE CDate C#
Month01 Cust_2 VarA VarB VarC VarD VarE CDate C#
Month02 Cust_2 VarA VarB VarC VarD VarE CDate C#
Month03 Cust_2 VarA VarB VarC VarD VarE CDate C#
Month04 Cust_2 VarA VarB VarC VarD VarE CDate C#
| | | | | |
Month12 Cust_2 VarA VarB VarC VarD VarE CDate C#
Wide fle: this design has a single row per customer. It is much more tedious to update. But in its fnal form, it is
much easier to analyze because the data has already been organized into a single customer record. Each year has
a separate table. The layout is as follows:
Cust_1 VarA01 VarA02 VarA03 … VarA12 VarB01
VarB02 VarB03 … VarB12 VarC01 VarC02 VarC03 …
VarC12 VarD01 VarD02 VarD03 … VarD12 VarE01
VarE02 VarE03 … VarE12 CDate C#
Cust_2 VarA01 VarA02 VarA03 … VarA12 VarB01
VarB02 VarB03 … VarB12 VarC01 VarC02 VarC03 …
VarC12 VarD01 VarD02 VarD03 … VarD12 VarE01
VarE02 VarE03 … VarE12 CDate C#
The fnal decision was to go with the wide fle or the single row per customer design. The argument was that the
manipulation to the customer level fle could be automated thus making the best use of the analyst’s time.
DATE ISSUES
Many analyses are performed using date values. In our previous situation, we saw how transactions are received and
updated on a monthly basis. This is useful when comparing values of the same vintage. However, another analyst
might need to compare balances at a certain stage in the customer lifestyle. For example, to track customer balance
cycles from multiple campaigns a feld that denotes the load date is needed.
The frst type of analysis was tracking monthly activity by the vintage acquisition campaign. For example, calculating
monthly trends of balances aggregated separately for those accounts booked in May 99 and September 99. This
required aggregating the data for each campaign by the “load date” which corresponded to the month in which the
transaction occurred.
Data Mining
168/JNU OLE
The second analyses focused on determining and evaluating trends in the customer life cycle. Typically, customers
who took a balance transfer at the time of acquisition showed balance run-off shortly after the introductory teaser
APR rate expired and the account was reprised to a higher rate. These are the dreaded “rate surfers.” Conversely, a
signifcant number of customers, who did not take a balance transfer at the time of acquisition, demonstrated balance
build. Over time these customers continued to have higher than average monthly balances. Some demonstrated
revolving behaviour: paying less than the full balance each month and a willingness to pay interest on the revolving
balance. The remainder in this group simply user their credit cards for convenience. Even though they built balances
through debit activity each month, they chose to pay their balances in full and avoid fnance charges. These are the
“transactors” or convenience users. The second type of analysis needed to use ‘Months on books’, regardless of
the source campaign. This analysis required computation of the account age by looking at both the date the account
was open as well as the “load date” of the transaction data. However, if the data mining task is to also understand
this behaviour in the context of campaign vintage which was mentioned earlier, there is a another consideration.
Prospects for the “May 99” campaign were solicited in May of 1999. However, many new customers did not use
their card until June or July of 1999. There were three main reasons:
some wanted to compare their offer to other offers; •
processing is slower during May and June; and •
some waited until a specifc event (e.g. purchase of a large present at the Christmas holidays) to use their card •
for the frst time.

At this point the data warehouse probably needs to store at least the following date information:
Date of the campaign •
Date the account frst opened •
Date of the frst transaction •
Load date for each month of data •
The difference between either “b” or “c” above and “d” can be used as the measure used to index account age or
month on books.
No single date feld is more important than another but multiple date fles are problem necessary if vintage as well
as customer life-cycle analyses are both to be performed.
Developing the Model
To develop a Lifetime Value model, we need to extract information from the Customer Information Table for risk
indices as well as the Offer History Table for demographic and previous offer information.
Customer information table
A Customer Information Table is typically designed with one record per customer. The customer table contains the
identifying information that can be linked to other tables such as a transaction table to obtain a current snapshot of
a customer’s performance. The following list details the key elements of the Customer Information Table:
Customer ID – a unique numeric or alpha-numeric code that identifes the customer throughout his entire lifecycle.
This element is especially critical in the credit card industry where the credit card number may change in the event
of a lost or stolen card. But it is essential in any table to effectively link and tract the behaviour of and actions taken
on an individual customer.
Household ID – a unique numeric or alpha-numeric code that identifes the household of the customer through his
or her entire lifecycle. This identifer is useful in some industries where products or services are shared by more
than one member of a household.
Account number – a unique numeric or alpha-numeric code that relates to a particular product or service. One
customer can have several account numbers.
169/JNU OLE
Customer name – the name of a person or a business. It is usually broken down into multiple felds: last name, frst
name, middle name or initial, salutation.
Address – the street address is typically broken into components such as number, street, suite or apartment number,
city, state, zip+4. Some customer tables have a line for a P.O. Box. With population mobility about 10% per year,
additional felds that contain former addresses are useful for tracking and matching customers to other fles.
Phone number – current and former numbers for home and work.
Demographics – characteristics such as gender, age, income, etc. may be stored for profling and modelling.
Products or services – the list of products and product identifcation numbers varies by company. An insurance
company may list all the policies along with policy numbers. A bank may list all the products across different
divisions of the bank including checking, savings, credit cards, investments, loans, and more. If the number of
products and product detail is extensive, this information may be stored in a separate table with a customer and
household identifer.
Offer detail – the date, type of offer, creative, source code, pricing, distribution channel (mail, telemarketing, sales
rep, e-mail), and any other details of an offer. Most companies look for opportunities to cross-sell or up-sell their
current customers. There could be numerous “offer detail” felds in a customer record, each representing an offer
for an additional product or service.
Model Scores – response, risk, attrition, proftability scores and/or any other scores that are created or purchased.
Transaction table
The Transaction Table contains records of customer activity. It is the richest and most predictive information but
can be the most diffcult to access. Each record represents a single transaction. So there are multiple records for
each customer. In order to use this data for modelling, it must be summarized and aggregated to a customer level.
The following lists key elements of the Transaction Table:
Customer ID – defned above.
Household ID – defned above.
Transaction Type – The type of credit card transaction such as charge, return, or fee (annual, overlimit, late).
Transaction Date– The date of the transaction
Transaction Amount – The dollar amount of the transaction.
Offer history table
The Offer History Table contains details about offers made to prospects, customers or both. The most useful format
is a unique record for each customer or prospect. Variables created from this table are often the most predictive in
response and activation targeting models. It seems logical that if you know someone has received your offer every
month for 6 months, they are less likely to respond than someone who is seeing your offer for the frst time. As
competition intensifes, this type of information is becoming increasing important. A Customer Offer History Table
contains all cross-sell, up-sell and retention offers. A Prospect Offer History Table contains all acquisition offers as
well as any predictive information from outside sources. It is also useful to store several addresses on the Prospect
Offer History Table.
With an average amount of solicitation activity, this type of table can become very large. It is important to perform
analysis to establish business rules that control the maintenance of this table. Fields like ‘date of frst offer’ is usually
correlated with response behaviour. The following list details some key elements in an Offer History Table:
prospect ID/Customer ID – as in the Customer Information Table, this is a unique numeric or alphanumeric code
that identifes the prospect for a specifc length of time. This element is especially critical in the credit card industry
where the credit card number may change in the event of a lost or stolen card. But it is essential in any table to
effectively tract the behaviour of and actions taken on an individual customer.
Data Mining
170/JNU OLE
Household ID – a unique numeric or alpha-numeric code that identifes the household of the customer through his
entire lifecycle. This identifer is useful in some industries where products or services are shared by more than one
member of a household.
Prospect name* – the name of a person or a business. It is usually broken down into multiple felds: last name, frst
name, middle name or initial, salutation.
Address* – the street address is typically broken into components such as number, street, suite or apartment number,
city, state, zip+4. As in the Customer Table, some prospect tables have a line for a P.O. Box. Additional felds that
contain former addresses are useful for matching prospects to outside fles.
Phone number – current and former numbers for home and work.
Offer Detail – includes the date, type of offer, creative, source code, pricing, distribution channel (mail, telemarketing,
sales rep, email) and any other details of the offer. There could be numerous groups of “offer detail” felds in a
prospect or customer record, each representing an offer for an additional product or service.
Offer summary – date of frst offer (for each offer type), best offer (unique to product or service), etc.
Model scores* – response, risk, attrition, profitability scores and/or any scores other that are created or
purchased.
Predictive data* – includes any demographic, psychographic or behavioural data. *These elements appear only
on a Prospect Offer History Table. The Customer Table would support the Customer Offer History Table with
additional data.
Defning the objective
The overall objective is to measure Lifetime Value (LTV) of a customer over a 3-year period. If we can predict which
prospects will be proftable, we can target our solicitations only to those prospects and reduce our mail expense.
LTV consists of four major components:
Activation - probability calculated by a model. Individual must respond, be approved by risk and incur a •
balance.
Risk – the probability of charge-off is derived from a risk model score. It is converted to an index. •
Expected Account Proft – expected purchase, fee and balance behaviour over a 3-year period. •
Marketing Expense - cost of package, mailing and processing (approval, fulflment). •
The data collection
Names from three campaigns over the last 12 months were extracted from the Offer History Table. All predictive
information was included in the extract: demographic and credit variables, risk scores and offer history. The expected
balance behaviour was developed using segmentation analysis. An index of expected performance is displayed in a
matrix of gender by marital status by age group (see Appendix A). The marketing expense which includes the mail
piece and postage is $.78.
To predict Lifetime Value, data was pulled from the Offer History Table from three campaigns with a total of 966,856
offers. To reduce the amount of data for analysis and maintain the most powerful information, a sample is created
using all of the ‘Activation’ and 1/25
th
of the remaining records. This includes non-responders and non-activating
responders. We defne an ACTIVE as a customer with a balance at three months. The following code creates the
sample dataset:
171/JNU OLE
DATA A B;
SET LIB.DATA;
IF 3MON_BAL > 0 THEN OUTPUT A;
ELSE OUTPUT B;
DATA LIB.SAMPDATA;
SET A B (WHERE=(RANUNI(5555) < .04));
SAMP_WGT = 25;
RUN
This code is putting into the sample dataset, all customers who activated and a 1/25
th
random sample of the balance
of accounts. It also creates a weight variable called SAMP_WGT with a value of 25.
The following table displays the sample characteristics:
Campaign Sample Weight
non Resp/non Active Resp 929075 37163 25
Responders/Active 37781 37781 1
Total 966856 74944
The non-responders and non-activated responders are grouped together since our target is active responders. This
gives us a manageable sample size of 74,944.
Model developmnt
The frst component of the LTV, the probability of activation, is based on a binary outcome, which is easily modelled
using logistic regression. Logistic regression uses continuous values to predict the odds of an event happening. The
log of the odds is a linear function of the predictors. The equation is similar to the one used in linear regression with
the exception of the use of a log transformation to the independent variable. The equation is as follows:
log(p/(1-p)) = B
0
+ B
1
X
1
+ B
2
X
2
+ …… + B
n
X
n
Variable preparation - Dependent
To defne the dependent variable, create the variable ACTIVATE defned as follows:
IF 3MOBAL > 0 THEN ACTIVATE = 1;
ELSE ACTIVATE = 0;
Variable preparation – previous offers
The bank has four product confgurations for credit card offers. Each product represents a different intro rate and
intro length combination. From our offer history table, we pull four variables for modelling that represent the number
of times each product was mailed in the last 6 months:
NPROD1, NPROD2, NPROD3, and NPROD4.
Through analysis, the following variables were determined to be the most predictive.
SAM_OFF1 – received the same offer one time in the past 6 months.
DIF_OFF1 – received a different offer one time in the past 6 months.
SAM_OFF2 – received the same offer more than one time in the past 6 months.
DIF_OFF2 – received a different offer more than one time in the past 6 months.
Data Mining
172/JNU OLE
The product being modelled is Product 2. The following code creates the variables for modelling:
SAM_OFF1 = (IF NPROD2 = 1);
SAM_OFF2 = (IF NPROD2 > 1);
DIF_OFF1 = (IF SUM(NPROD1, NPROD3, NPROD4) =
1);
DIF_OFF2 = (IF SUM(NPROD1, NPROD3, NPROD4) >
1);
If the prospect has never received an offer, then the values for the four named variables will all be 0.
Preparing credit variables
Since, logistic regression looks for a linear relationship between the independent variables and the log of the odds
of the dependent variable, transformations can be used to make the independent variables more linear. Examples
of transformations include the square, cube, square root, cube root, and the log. Some complex methods have
been developed to determine the most suitable transformations. However, with the increased computer speed, a
simpler method is as follows: create a list of common/favourite transformations; create new variables using every
transformation for each continuous variable; perform a logistic regression using all forms of each continuous variable
against the dependent variable. This allows the model to select which form or forms ft best. Occasionally, more
than one transformation is signifcant. After each continuous variable has been processed through this method, select
the one or two most signifcant forms for the fnal model. The following code demonstrates this technique for the
variable Total Balance (TOT_BAL):
PROC LOGISTIC LIB.DATA:
WEIGHT SMP_WGT;
MODEL ACTIVATE = TOT_BAL TOT_B_SQ TOT_B_CU
TOT_B_I TOT_B_LG / SELECTION=STEPWISE;
RUN
The logistic model output (see Appendix D) shows two forms of TOT_BAL to be signifcant in combination:
TOT_BAL TOT_B_SQ
These forms will be introduced into the fnal model.
Partition data
The data are partitioned into two datasets, one for model development, and one for validation. This is accomplished
by randomly splitting the data in half using the following SAS® code:
DATA LIB.MODEL LIB.VALID;
SET LIB.DATA;
IF RANUNI(0) < .5 THEN OUTPUT LIB.MODEL;
ELSE OUTPUT LIB.VALID;
RUN;
If the model performs well on the model data and not as well on the validation data, the model may be over-ftting the
data. This happens when the model memorizes the data and fts the models to unique characteristics of that particular
data. A good, robust model will score with comparable performance on both the model and validation datasets. As a
result of the variable preparation, a set of ‘candidate’ variables has been selected for the fnal model. The next step
is to choose the model options. The backward selection process is favoured by some modellers because it evaluates
all of the variables in relation to the dependent variable while considering interactions among the independent or
predictor variables. It begins by measuring the signifcance of all the variables and then removing one at a time until
only the signifcant variables remain.
173/JNU OLE
The sample weight must be included in the model code to recreate the original population dynamics. If you eliminate
the weight, the model will still produce correct ranking-ordering but the actual estimates for the probability of a
‘paid-sale’ will be incorrect. Since our LTV model uses actual estimates, we will include the weights. The following
code is used to build the fnal model.
PROC LOGISTIC LIB.MODEL:
WEIGHT SMP_WGT;
MODEL ACTIVATE = INQL6MO TOT_BAL TOT_B_SQ
SAM_OFF1 DIF_OFF1 SAM_OFF2 DIF_OFF2 INCOME
INC_LOG AGE_FILE NO30DAY TOTCRLIM POPDENS
MAIL_ORD// SELECTION=BACKWARD;
RUN;
The resulting model has 7 predictors. The parameter estimate is multiplied times the value of the variable to
create the fnal probability. The strength of the predictive power is distributed like a chi-square so we look to that
distribution for signifcance. The higher the chi-square, the lower is the probability of the event occurring randomly
(pr > chi-square). The strongest predictor is the variable DIFOFF2 which demonstrates the power of offer history
on the behaviour of a prospect. Introducing offer history variables into the acquisition modelling process has been
single most signifcant improvement in the last three years. The following equation shows how the probability is
calculated, once the parameter estimates have been calculated:
prob = exp(B
0
+ B
1
X
1
+ B
2
X
2
+ …… + B
n
X
n
)
(1+ exp(B
0
+ B
1
X
1
+ B
2
X
2
+ …… + B
n
X
n
))
This creates the fnal score, which can be evaluated using a gains table (see Appendix D). Sorting the dataset by
the score and dividing it into 10 groups of equal volume creates the gains table. This is called a Decile Analysis.
The validation dataset is also scored and evaluated in a gains table or Decile Analysis. Both of these tables show
strong rank ordering. This can be seen by the gradual decrease in predicted and actual probability of ‘Activation’
from the top decile to the bottom decile. The validation data shows similar results, which indicates a robust model.
To get a sense of the ‘lift’ created by the model, a gains chart is a powerful visual tool. The Y-axis represents the
% of ‘Activation’ captured by each model. The X-axis represents the % of the total population mailed. Without the
model, if you mail 50% of the fle, you get 50% of the potential ‘Activation’. If you use the model and mail the
same percentage, you capture over 97% of the ‘Activation’. This means that at 50% of the fle, the model provides
a ‘lift’ of 94% {(97-50)/50}.
Financial assessment
To get the fnal LTV we use the formula:
LTV = Pr(Paid Sale) * Risk Index Score* Expected
Account Proft - Marketing Expense
At this point, we apply the risk matrix score and expected account proft value. The fnancial assessment shows
the models ability to select the most proftable customers. Notice how the risk score index is lower for the most
responsive customers. This is common in direct response and demonstrates ‘adverse selection’. In other words, the
riskier prospects are often the most responsive. At some point in the process, a decision is made to mail a percent of
the fle. In this case, you could consider the fact that in decile 7, the LTV becomes negative and limit your selection
to deciles 1 through 6. Another decision criterion could be that you need to be above a certain ‘hurdle rate’ to cover
fxed expenses. In this case, you might look at the cumulative LTV to be above a certain amount such as $30.
Decisions are often made considering a combination of criteria.
Data Mining
174/JNU OLE
The fnal evaluation of your efforts may be measured in a couple of ways. You could determine the goal to mail fewer
pieces and capture the same LTV. If we mail the entire fle with random selection, we would capture $13,915,946
in LTV. This has a mail cost of $754,155. By mailing 5 deciles using the model, we would capture $14,042,255 in
LTV with a mail cost of only $377,074. In other words, with the model we could capture slightly more LTV and
cut our marketing cost in half. Or, we can compare similar mail volumes and increase LTV. With random selection
at 50% of the fle, we would capture $6,957,973 in LTV. Modelled, the LTV would climb to $14,042,255. This is a
lift of over 100% ((14042255-6957973)/ 6957973 = 1.018).
Conclusion
Successful data mining and predictive modelling depends on quality data that is easily accessible. A well constructed
data warehouse allows for the integration of Offer History which has an excellent predictor of Lifetime Value.
(Source: Rud, C, O., Data Warehousing for Data Mining: A Case Study [PDF] Available at: <http://www2.sas.com/
proceedings/sugi25/25/dw/25p119.pdf>. [Accessed 30 September 2011].)
Question
How many datasets is the data portioned into? 1.
Which is the frst challenge mentioned in the above case study? 2.
What are the analysis capabilities? 3.

175/JNU OLE
Bibliography
References
Adriaans, P., 1996. • Data Mining, Pearson Education India.
Alexander, D., • Data Mining [Online] Available at: <http://www.laits.utexas.edu/~norman/BUS.FOR/course.
mat/Alex/>. [Accessed 9 September 2011].
Berlingerio, M., 2009. • Temporal mining for interactive workfow data analysis [Video Online] Available at:
<http://videolectures.net/kdd09_berlingerio_tmiwda/>. [Accessed 12 September].
Dr. Krie, H. P., Spatial Data Mining [Online] Available at: <http://www.dbs.informatik.uni-muenchen.de/ •
Forschung/KDD/SpatialKDD/>. [Accessed 9 September 2011].
Dr. Kuonen, D., 2009. • Data Mining Applications in Pharma/BioPharma Product Development [Video Online]
Available at: <http://www.youtube.com/watch?v=kkRPW5wSwNc>. [Accessed 12 September 2011].
Galeas • , Web mining [Online PDF} Available at: <http://www.galeas.de/webmining.html>. [Accessed 12
September 2011].
Hadley, L., 2002. • Developing a Data Warehouse Architecture [Online] Available at: <http://www.users.qwest.
net/~lauramh/resume/thorn.htm>. [Accessed 8 September 2011].
Han, J. and Kamber, M., 2006. • Data Mining: Concepts and Techniques, 2nd ed., Diane Cerra.
Han, J., Kamber, M. and Pei, J., 2011. • Data Mining: Concepts and Techniques, 3rd ed., Elsevier.
http://nptel.iitm.ac.in, 2008 • . Lecture - 34 Data Mining and Knowledge Discovery [Video Online] Available at
:< http://www.youtube.com/watch?v=m5c27rQtD2E>. [Accessed 12 September 2011].
http://nptel.iitm.ac.in, 2008. • Lecture - 35 Data Mining and Knowledge Discovery Part II [Video Online] Available
at: <http://www.youtube.com/watch?v=0hnqxIsXcy4&feature=relmfu>. [Accessed 12 September 2011].
Humphries, M., Hawkins, M. W. And Dy, M. C., 1999. • Data warehousing: architecture and implementation,
Prentice Hall Profesional.
Intricity101, 2011. • What is OLAP? [Video Online] Available at: <http://www.youtube.com/watch?v=2ryG3Jy
6eIY&feature=related >. [Accessed 12 September 2011].
Kimball, R., 2006. The Data warehouse Lifecycle Toolkit, Wiley-India. •
Kumar, A., 2008. • Data Warehouse Layered Architecture 1 [Video Online] Available at: <http://www.youtube.
com/watch?v=epNENgd40T4>. [Accessed 11 September 2011].
Larose, D. T., 2006. • Data mining methods and models, John Wiley and Sons.
Learndatavault, 2009. • Business Data Warehouse (BDW) [Video Online] Available at: <http://www.youtube.
com/watch?v=OjIqP9si1LA&feature=related>. [Accessed 12 September 2011].
Lin W., Orgun M, A. and Williams G. J., • An Overview of Temporal Data Mining [Online PDF] Available at:
<http://togaware.redirectme.net/papers/adm02.pdf>. Accessed 9 September 2011].
Liu, B., 2007. • Web data mining: exploring hyperlinks, contents, and usage data, Springer.
Mailvaganam, H., 2007. • Data Warehouse Project Management [Online] Available at: <http://www.dwreview.
com/Articles/Project_Management.html>. [Accessed 8 September 2011].
Maimom, O. and Rokach, L., 2005. • Data mining and knowledge discovery handbook, Springer Science and
Business.
Maimom, O. and Rokach, L., • Introduction To Knowledge Discovery In Database [Online PDF] Available at:
<http://www.ise.bgu.ac.il/faculty/liorr/hbchap1.pdf>. [Accessed 9 September 2011].
Mento, B. and Rapple, B., 2003. • Data Mining and Warehousing [Online] Available at: <http://www.arl.org/
bm~doc/spec274webbook.pdf>. [Accessed 9 September 2011].
Mitsa, T., 2009. • Temporal Data Mining, Chapman & Hall/CRC.
OracleVideo, 2010. • Data Warehousing Best Practices Star Schemas [Video Online] Available at: < http://www.
youtube.com/watch?v=LfehTEyglrQ>. [Accessed 12 September 2011].
Data Mining
176/JNU OLE
Orli, R and Santos, F., 1996. • Data Extraction, Transformation, and Migration Tools [Online] Available at:
<http://www.kismeta.com/extract.html>. [Accessed 9 September 2011].
Ponniah, P., 2001. • DATA WAREHOUSING FUNDAMENTALS-A Comprehensive Guide for IT Professionals,
Wiley-Interscience Publication.
SalientMgmtCompany, 2011. • Salient Visual Data Mining [Video Online] Available at: <http://www.youtube.
com/watch?v=fosnA_vTU0g>. [Accessed 12 September 2011].
Seifert, J, W., 2004. • Data Mining: An Overview [Online PDF] Available at: <http://www.fas.org/irp/crs/RL31798.
pdf>. [Accessed 9 September 2011].
Springerlink, 2006. • Data Mining System Products and Research Prototypes [Online PDF] Available at: <http://
www.springerlink.com/content/2432076500506017/>. [Accessed 12 September 2011].
SQLUSA, 2009. • SQLUSA.com Data Warehouse and OLAP [Video Online] Available at: <http://www.youtube.
com/watch?v=OJb93PTHsHo>. [Accessed 12 September 2011].
StatSoft, 2010 • . Data Mining, Cluster Techniques - Session 28 [Video Online] Available at: <http://www.youtube.
com/watch?v=WvR_0Vs1U8w>. [Accessed 12 September 2011].
StatSoft, 2010. • Data Mining, Model Deployment and Scoring - Session 30 [Video Online] Available at: <http://
www.youtube.com/watch?v=LDoQVbWpgKY>. [Accessed 12 September 2011].
Statsoft, • Data Mining Techniques [Online] Available at: <http://www.statsoft.com/textbook/data-mining-
techniques/>. [Accessed 12 September 2011].
Swallacebithead, 2010. • Using Data Mining Techniques to Improve Forecasting [Video Online] Available at:
<http://www.youtube.com/watch?v=UYkf3i6LT3Q>. [Accessed 12 September 2011].
University of Magdeburg, 2007. • 3D Spatial Data Mining on Document Sets [Video Online] Available at: <http://
www.youtube.com/watch?v=jJWl4Jm-yqI>. [Accessed 12 September 2011].
Wan, D., 2007. Typical data warehouse deployment lifecycle [Online] Available at: <http://dylanwan.wordpress. •
com/2007/11/02/typical-data-warehouse-deployment-lifecycle/>. [Accessed 12 September 2011].
Zaptron, 1999. • Introduction to Knowledge-based Knowledge Discovery [Online] Available at: <http://www.
zaptron.com/knowledge/>. [Accessed 9 September 2011].
Recommended Reading
Chang, G., 2001. • Mining the World Wide Web: an information search approach, Springer.
Chattamvelli, R., 2011. • Data Mining Algorithms, Alpha Science International Ltd.
Han, J., Kamber, M. 2006. • Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann.
Jarke, M., 2003. • Fundamentals of data warehouses, 2nd ed., Springer.
Kantardzic, M., 2001. • Data Mining: Concepts, Models, Methods, and Algorithms, 2nd ed., Wiley-IEEE.
Khan, A., 2003. • Data Warehousing 101: Concepts and Implementation, iUniverse.
Liu, B., 2011. • Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd ed., Springer.
Markov, Z. and Larose, D. T., 2007. • Data mining the Web: uncovering patterns in Web content, structure, and
usage, Wiley-Interscience.
Parida, R., 2006. • Principles & Implementation of Data Warehousing, Firewell Media.
Ponniah, P., 2001. • Data Warehousing Fundamentals-A Comprehensive Guide for IT Professionals, Wiley-
Interscience Publication.
Ponniah, P., 2010. • Data Warehousing Fundamentals for IT Professionals, 2nd ed., John Wiley and Sons.
Prabhu, C. S. R., 2004. • Data warehousing: concepts, techniques, products and applications, 2nd ed., PHI
Learning Pvt. Ltd.
Pujari, A. K., 2001. • Data mining techniques, 4th ed., Universities Press.
Rainardi, V., 2007. • Building a data warehouse with examples in SQL Server, Apress.
Roddick, J. F. and Hornsby, K., 2001. • Temporal, spatial, and spatio-temporal data mining, Springer.
177/JNU OLE
Scime, A., 2005. Web mining: applications and techniques, Idea Group Inc (IGI). •
Stein, A., Shi, W. and Bijker, W., 2008. • Quality aspects in spatial data mining, CRC Press.
Thuraisingham, B. M., 1999. • Data mining: technologies, techniques, tools, and trends, CRC Press.
Witten, I. H. and Frank, E., 2005. • Data mining: practical machine learning tools and techniques, 2nd ed.,
Morgan Kaufmann.
Data Mining
178/JNU OLE
Self Assessment Answers
Chapter I
a 1.
d 2.
c 3.
d 4.
a 5.
d 6.
c 7.
b 8.
a 9.
d 10.
Chapter II
b 1.
a 2.
d 3.
a 4.
b 5.
b 6.
c 7.
c 8.
a 9.
a 10.
Chapter III
a 1.
c 2.
b 3.
d 4.
a 5.
d 6.
b 7.
b 8.
c 9.
a 10.
Chapter IV
a 1.
c 2.
a 3.
a 4.
c 5.
c 6.
a 7.
a 8.
c 9.
d 10.
a 11.
179/JNU OLE
Chapter V
d 1.
c 2.
a 3.
a 4.
a 5.
a 6.
b 7.
b 8.
d 9.
a 10.
Chapter VI
c 1.
a 2.
b 3.
d 4.
c 5.
a 6.
b 7.
b 8.
b 9.
a 10.
Chapter VII
a 1.
c 2.
d 3.
a 4.
b 5.
d 6.
c 7.
c 8.
a 9.
b 10.