You are on page 1of 192

NOTES

1
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
UNIT I
INTRODUCTION
INTRODUCTION
OBJECTIVES
Data mining is a promising and flourishing field now. It concerns about the extraction
of models and patterns from the data stored in various archives of the business organizations.
The extracted knowledge can be useful for organizations to discover new products,
improvise the business processes and to develop decision support systems that will be
useful for the organizations. This chapter starts by showing the relationships of data mining
with fields like machine learning, Statistics and database technology. This chapter also
provides an overview of data mining tasks, components of data mining algorithm and Data
mining architectures. This chapter also provides a holistic overview of data mining from all
perspectives like database, statistical, algorithmic and application perspectives.
LEARNING OBJECTIVES
- Explore the relationships among Data mining, Machine learning, Statistics
and Database technology
- To study the nature of data and data sources
- To study the components of a data mining algorithm
- To explore the functionalities of data mining
- Exploring the Knowledge discovery in Database (KDD) cycle
- Exploring the architecture of a typical data mining system
1.1 NEED FOR DATA MINING
The business organizations use huge amount of data for their daily activities. So the
modern business organizations have a large quantity of data and information. Also it is
easier now to capture, process, store, distribute and transmit the digital information. This
huge amount of data is growing at a phenomenal rate. Approximately it is estimated that
the data doubles at every 20 months.
DMC 1628
NOTES
2 ANNA UNIVERSITY CHENNAI
But full potential of this information and data is not realized fully due to two reasons
1. Information is scattered across different archive systems and most of the
organizations havent succeeded in integrating these sources fully.
2. There has been a lack of awareness about software tools that can help them to
unearth the useful information among data.
Here comes the need for data mining. With the declining hardware and software cost,
organizations feel the necessity of organizing the data in effective structures and to analyze
the data using flexible and scalable procedures that can help the organizations.
The motivations for the organizations for implementing data mining solutions are due
to the following factors
1. Advancement of Internet technology leads to ease of communication among
customers and organizations. Moreover this situation enables that data can be
effectively shared and distributed. Also new data can be added into the existing
data.
2. The competition among business organizations is so intense now. This fact forces
the organizations to discover new products, improvise their process and services
and so forth.
3. Hardware cost especially the storage cost is rapidly falling and there has been
growth in availability of robust and flexible algorithms.
Therefore data mining is considered as an important field of study and more effort is
spent nowadays by the business organizations for developing better tools in data mining.
1.2 NATURE OF DATA MINING
Data Mining uses the concepts of Statistics, Computer Science, Machine Learning,
Artificial Intelligence (AI), Database Technology, and Pattern Recognition. Each field has
its own distinct way of solving the problems. Data mining is the resultant of combined ideas
of diverse fields. Data mining original genesis is in the business. Like mining the earth, one
gets into precious resources, it is often believed that unearthing of the data produce hidden
information that otherwise would have eluded the attention of the management.
Defining a field is a difficult task because of many conflicting definitions. Especially in
data mining area, there are so many conflicting views of data mining because of its diverse
nature. Some view this as an extension of statistics and some view data mining as part of
machine learning and a large community of researchers view data mining as the logical
advancement of database technology.
A standard definition of Knowledge in Database is given by Fayyad, Piatesky-Shapiro,
and Smyth (1996) as Knowledge discovery in database is the nontrivial process of
selecting valid, novel, potentially useful and ultimately understandable patterns in
data
NOTES
3
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Hence crucial points are valid, novel and understandable patterns. By valid, it is
meant that the patterns generally hold in all circumstances. Novel means the discovered
pattern is new and not encountered before. By understandable, it is meant that the pattern
is interpretable.
For example, looking up of the address of a customer in a telephone directory is not
a data-mining task. But making a prediction of the sales of a particular item is the domain
of data mining.
The definition that effectively captures the gist of data mining ( D.J.Hand) is Data
mining (Consists in) the discovery of interesting, unexpected, or valuable structures
in large data sets.. This is shown as a diagram in Figure 1.1.
Figure 1.1 Nature of Data Mining Process
What is the difference between KDD and Data mining?
The term KDD means Knowledge Discovery in Databases. Many authors consider
both terms are synonymous. But it is better to remember that KDD is a whole process and
data mining is just the algorithmic component of the entire KDD process.
All these definitions infer that data mining assumes that there are treasures (or nuggets)
under the pile of data. The aim and goal of data mining is by using the specific tools of data
mining that can discover these treasures. This is not a new idea. Statistics has been using
exploratory data analysis and multivariate exploratory analysis for many years. But data
mining is different because manual methods of statistics and statisticians to provide summary
and generation of reports are applicable or feasible for only small set. But organization has
giga or even terra byte of data. For this huge pile of data, manual methods fail. Hence
automatic methods like data mining are of great use.
Data mining methods can be both descriptive and predictive. Descriptive methods
discover interesting patterns or relationships that are inherent in the data and predictive
methods predict the behavior of the models. Some of the areas that use data mining
extensively are banking, finance, medicine, security, and telecommunications to understand
the behavior of the data. Data mining gives more stress to scalability of the number of
features and instances, algorithms and architectures, automation of data handling.
DMC 1628
NOTES
4 ANNA UNIVERSITY CHENNAI
1.3 DATA, INFORMATION, AND KNOWLEDGE
It is better to have a clear idea about the distinctions between data, information and
knowledge. The knowledge pyramid is shown in Figure 1.2.
Figure 1.2 The Knowledge Pyramid.
Data
All facts are data. Data can be numbers or text that can be processed by a computer.
Today, organizations are accumulating vast and growing amounts of data. Billions of records,
gigabytes of data are not usual in larger business organizations and data is stored in different
data sources like flat files, databases or data warehouses and in different storage formats.
The types of data the organizations use may be operational or non-operational data.
Operational data is the data that one needs to carry out normal business procedures and
processes. For example daily sales data is operational data. But non-operational data is
strategic data reserved for taking decision. For example, macroeconomic data is a non-
operational data. Strategic data are of great important to the organizations for taking
decisions. Hence, normally they are not updated or modified or deleted at will by the user.
Data can be a meta data also. Meta data defines other data. Data that is present in
the data dictionary, logical database design are examples of meta data.
Information
The processed data is called information. This includes patterns, associations, or
elationships among data. For example the sales data can be analyzed to extract information
like which is the fast selling product.
Knowledge
Knowledge is condensed information. For example the historical patterns and
future rends obtained in the above sales data can be called knowledge. For example the
Wisdom
Intelligence
Pyramid Chart
Knowledge (Condensed
information)
Information (Processed Date)
Data (Mostly available as raw facts and
NOTES
5
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
behavior of the product selling and even the behavior of the customers buying behavior
help business organizations immensely. These are called knowledge because this is the
essence of the data. Many organizations are data rich but knowledge poor organizations.
Unless knowledge is extracted, mostly data are of no use.
The Knowledge that is encountered in data mining process is of four types
1. Shallow knowledge: These types of knowledge can be easily stored and manipulated
using query languages. For example using query languages like SQL one can extract
information from a table.
2. Multidimensional knowledge: This is also similar to the shallow knowledge. The
difference is multidimensional knowledge is associated with multidimensional data
and often use query languages like OLAP data cubes.
3. Hidden knowledge: This knowledge is present in the form of regular patterns or
regularity in the data. This type of knowledge cannot be extracted using query
languages like SQL. Data mining algorithms only can be used to extract this type
of knowledge.
4. Deep Knowledge: This type of knowledge is deep within and a sort of domain
knowledge and direction should be given to extract this kind of knowledge. Some
times even data mining algorithms are not successful in extracting this type of
knowledge.
Once the knowledge is extracted then it should be represented in some form so that
the users can understand it. The types of knowledge structures one may encounter in data
mining are
1. Decision table.
Decision table is a two dimensional table structure which gives the knowledge in the
form of values. A sample decision table is shown in Table 1.1.
Table 1.1: Sample decision Table
Based on the decision table, one can infer some knowledge of the attributes.
2. Decision Tree
Decision tree is a flowchart type structure that has nodes and edges. The nodes are
either root or internal nodes. The root is a special node and internal nodes represent
conditions that are used to test the attribute values. The terminal nodes represent the classes.
DMC 1628
NOTES
6 ANNA UNIVERSITY CHENNAI
3. Rules
Rules represent knowledge in the form of IF-THEN rules For example a rule can be
extracted from the above table and can be represented as
IF Number of hours worked >= 40 then Result = Pass
A special type of rules can involve exceptions also. The rules are of the form - IF
condition THEN result EXCEPTION condition. This structure can captures the exceptions
also.
4. Instances Space
The points can be represented in the instance space and can be grouped to convey
knowledge. The knowledge then can be expressed in the form of dendograms.
Intelligence and Wisdom
The applied knowledge is called Intelligence. For example, the knowledge extracted
can be put to use by the organizations in the form of intelligent systems or knowledge-
based systems. Finally the wisdom represents the ultimate maturity of mind. Still it is a long
way to go for computer programs to process wisdom.
1.4 DATA MINING IN RELATION TO OTHER FIELDS
DM methodology draws upon a collection of tools from mainly two areas Machine
Learning (ML) and Statistics. These tools include traditional statistical methods of multivariate
analysis, such as those of classification, clustering, contingency table analysis, principal
components analysis, correspondence analysis, multi-dimensional scaling, factor analysis,
and latent structure analysis. DM also uses tools beyond statistics like tree building, support
vector machines, link analysis, genetic algorithms, market-basket analysis, and neural
network analysis.
1.4.1 Data Mining and Machine learning
Machine learning is an important branch of AI and is concerned with finding relations
and regularities present in the data. Rosenblatt introduced first machine learning model
called perceptron. Around that time decision trees also have been proposed.
Machine learning is the automation of a learning process. This broad field includes not
only learning from examples, but also involve other areas like reinforcement learning, learning
with teacher, etc. Machine learning algorithm takes the data set and its accompanying
information as input and returns a concept. The concept represents the output of learning,
which the generalization of the data is. This generalization can also be used to test new
cases.
NOTES
7
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
What does learning mean? Learning like adaptation occurs as the result of interaction
of the program with its environment. It can be compared with the interaction between the
teacher and the student. This is the domain of Induction learning.
Learning takes place in two stages. During the first stage, the teacher communicates
the information to the student that the student is supposed to master. The student receives
the information, understands it. During this stage the teacher has no knowledge of whether
the information is grasped by the student. This leads to the second stage of learning. The
teacher then asks the student a set of questions to find out how much information has been
grasped by the student. Based on these questions, the student is tested and the teacher
informs the student his assessment. This kind of learning is typically called supervised
learning.
This process identifies classes to which the instance belongs. This is called classification
learning. This model learns from examples where a teacher helps the system construct a
model by defining classes and supplying examples of each class. The models learn the data
and generate concepts. The concepts are used to predict the class of previously unseen
objects. This is similar to discriminate analysis in statistics.
The second kind of learning is by self-instruction. Self-Instruction is the commonest
kind of learning process. Here the student comes into contact with the environment. Student
interacts with the environment through punishment/reward mechanism enabling him to learn
correctly. This process of undergoing self-instruction is based on the concept of learning
from mistakes. More mistakes the student makes and learns from it, better his learning
would be. This type of leaning is called unsupervised learning. Here the program is supplied
with objects but no classes are defined. The algorithm itself observes the examples and
recognizes patterns by itself based on the principles of grouping. . This is similar to cluster
analysis as in statistics. The quality of the model produced by inductive learning methods is
such that the model could be used to predict the outcome of unseen test cases.
By 1990s there has been considerable overlap among Machine learning and statistics
community. The term KDD was coined to describe the whole process of extracting
information from data. Data mining is considered to be just one component of KDD,
which mainly concerns about learning algorithms.
What is the difference between Machine learning and KDD?
Some subtle differences exist between KDD and Machine learning.
They are
1. KDD is concerned with finding understandable knowledge in a database while
Machine learning is concerned with improving performance of an agent.
2. KDD is concerned with very large, real-world databases, while Machine learning
typically involves smaller data sets.
DMC 1628
NOTES
8 ANNA UNIVERSITY CHENNAI
Machine learning is a broader field compared to KDD. Learning from examples is
just a methodology along with other methods like reinforcement learning, learning with
teacher, etc.
1.4.2 Data Mining and Database Technology
Data mining derives its strength from the development of databases technology. Typically
business organizations maintain many databases. But primarily they use two types of
databases
1. Operational database
2. Strategic database
Operational database is used to conduct the day-to-day operations of the organization.
This includes information about customers, transactions, products etc. But strategic databases
are databases that are used for taking decisions. Larger organizations have many operational
databases.
For decision making one may have to combine different operational databases.
Frequently organizations implement data warehouse technology for this strategic decision-
making. Data warehouse provides a centralized database that provides data for data mining
in a suitable form. A data warehouse is not a requirement for data mining but by using the
data warehouse solves many problems associated with the data like integrity and errors.
Data mining extract the data from the data warehouse into database or data mart.
Operational databases are hypothetical databases. A typical customer database is a
operational database. The well-defined and repetitive queries like What is the address of
Mr. X, what is the city Ms.Y lives? are the common questions that are answered with
traditional operational database. Operational database has to support large number of
such queries and updates on the contents of data. This type of database usage is called
Online Transaction Processing (OLTP).
But the decision support requires different types of queries. Aggregation is one such
query. These queries are like Find the sales of this product in Tamil nadu region year wise
and make a prediction for the next quarter are the domains of data mining. This kinds of
query is called Online Analytical Processing (OLAP). This query can be used to summarize
the data of the database and can be made to provide data in a suitable form to data mining
to make a prediction. Also OLAP gives graphical presentation of the empirical relationship
between the variables in the form of a multidimensional data cube to facilitate the decision
making.
OLAP and OLTP pose different kinds of requirement on the database administrators.
OLTP requires the data to be update. The queries may operate on the data and if necessary
makes modifications in the database. But OLAP is a deductive process and as an important
NOTES
9
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
tool of business intelligence OLAP doesnt modify the data but instead tries to answer the
question like why certain things are true.
OLAP requires the user to formulate the hypothesis as a query. The query is then
validated against the data for confirmation. In other words OLAP analyst generates a
series of hypothesis and validates using data warehouse data.
For a small dataset of fewer variables, OLAP produces results quickly. But for data
set involving hundreds and thousands of variables, the problem of speed and complexity is
crucial. Hence OLAP queries are complex and time consuming. The requirements of OLAP
and OLTP are so different that the organizations some time maintain two different sets of
applications.
OLTP and OLAP can complement each other. Before a query is generated, often the
analyst needs to explore the financial implications of the discovery of patterns that may be
useful for the organizations. However OLAP is not a substitute for data mining. But they
can complement each other. OLAP can help the user in the initial stages of the KDD
process. This helps data mining to focus on the most import data. Similarly the final data
mining results can be easily represented by a OLAP hypercube.
Hence the sequence would be query -> data retrieval -> OLAP -> Data mining.
Query is of the lowest information process capacity and data mining is of the highest
information processing capacity. Hence there should be a tradeoff between information
capacity and the ease of implementation. The business organizations should know exactly
what their requirements are.
1.4.3 Data Mining and Statistics
Statistics is a branch of mathematics that has a solid theoretical foundation regarding
statistical learning. But it requires knowledge of the statistical procedures and the guidance
of a good statistician. Data mining however allows the experts knowledge of the data and
the advanced analysis techniques by employing the computer systems that replaces
effectively the guidance of statistician.
Statistical methods are developed in relation to the data being analyzed. Also statistical
methods are coherent and rigorous. It has strong theoretical foundations. Many statisticians
consider data mining lacks solid theoretical model and data mining has too many competing
models. The contention is that there is always a model to fit the given data.
Secondly it is often believed that great amount of data often leads into some non-
existent relations. Often these issues are referred derogatively as Data Fishing, Data
dredging or Data Snooping. While it is worth considering all these issues, modern data
mining algorithms pay great attention to generalization of results.
DMC 1628
NOTES
10 ANNA UNIVERSITY CHENNAI
Other aspects that separate the data mining is that statistics generally concerns primarily
about primary data that is the data store in primary memory like RAM. But data mining
requires huge pile of data. Hence it focuses more on the secondary storage data. Therefore
while statistics content with experimental data, data mining requires more observed data.
But data mining is not a substitute for statistics. Statistics still have a role to play in
interpreting the results of the data mining algorithms. Hence statistics and data mining will
not replace each other but complement each other in data analysis.
1.4.4 Data Mining and Mathematical Programming
With the advent of Support Vector Machines (SVM), the relation between data mining
and mathematical programming is well established. It also provides additional insight that
majority of the problems of data mining can be formulated using mathematical programming
problems for which more efficient solutions can be obtained.
1.5 DATA AND ITS STORAGE STRUCTURES
Some of the application domains where huge data storage used are listed below.
- Digital library: This contains the text data as well as document images.
- Image archives: many archives contain larger image databases along with the numeric
and text data.
- Heath care: This sector uses extensive databases like patient databases, Health
insurance data, doctors information, and Bioinformatics information.
- Scientific domain: The domain has huge collections of experimental data like genomic
data. Biological data.
- WWW: Huge amount of data is distributed in the Internet. These data are
heterogeneous in nature.
Data mining algorithms can handle different media or data. However the algorithms
may be different when the algorithms handle different types of media or data. This task is a
great challenge to the researchers.
What are the types of data?
Data are of different types. But generally the data that is encountered in data mining
process can be of three types. They are
1. Record data
2. Graph data
3. Ordered data.
1. Record Data
Dataset is a collection of measurements taken from a process. We have a collection
of objects in a dataset. Each object has a set of measurements. The measurements can be
NOTES
11
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
arranged in the form of a matrix. Row represents an object and can be called as entities,
cases, or records. The columns of the dataset are called attributes, features or fields. The
table is filled with observed data. Also it is better to note the general jargons that are
associated with the dataset. Label is the term that is used to describe the individual
observations. All the input variables whose values are used to make a prediction are
called descriptors or predictor variables. The variable that is predicted is called response
variable.
Take for example, a dataset of medical records (Refer Table 1.2)
Table 1.2 Same Medical Data set.
Here each record is about a patient. Each column is an attribute of the patient record.
The record may have data in different forms like text, numeric or even image. The
measurement scale of the data is generally categorized as nominal, ordinal, interval and
ratio. Nominal is a measurement scale of a variable, which can assume limited number of
different values. For example patient location is (Chennai, Not Chennai). But it difficult to
find larger of these two as it is meaningless. Ordinal is a scale where variable can have an
ordered data. But the difference will not show the magnitude of the actual difference. For
example the degree of disease may be high, medium and low.
Definitely as a degree, high is greater than medium and medium is higher than low. But
the difference will not reveal anything about the magnitude. A variable that have value
where the interval has meaning is called Interval scale. But the interval scale does not
include zero. Finally ratio is a scale whose differences include natural zero, as ratios of
values are possible.
Special types of record types are
1. Transaction record
2. Data matrix
3. Sparse data matrix.
Transaction record (Refer Table 1.3) is a record that records the transactions on
regular basis. For example a general store can have a record where the store can record
the items that are purchased by the customers.
DMC 1628
NOTES
12 ANNA UNIVERSITY CHENNAI
Table 1.3 Sample Transaction Data set.
The characteristics of this data are that the record may have more zeroes. In other
words the data is of binary in nature and can be marked by more zeroes. Some times the
interest may be in analyzing the non-zero components only.
Data Matrix
Data matrix is a variation of the record type because it consists of numeric attributes.
The standard matrix operations can be applied on these data. The data is thought of as
points or vectors in the multidimensional space where every attribute is a dimension
describing the object.
Sparse data matrix
This is also a special type where only non-zero values are important.
2. Graph data
Graph data involves the relationships among objects. For example a web page can
refer to another web page. This can be modeled as a graph. The modes are the web pages
and the hyper link is an edge that connects the nodes.
A special type of graph data is that the node itself is another graph.
3. Ordered data
Ordered data objects involve attributes that have relationships that involve order in
time or space.
The examples of ordered data are
1. Sequential data are temporal data whose attributes are associated with time. For
example the customer purchasing patters during festival time is a sequential data.
2. Sequence data are similar to sequential data but does not have time stamps. This
data involve the sequence of words or letters. For example, DNA data is a sequence
of four characters A T G C
3. Time series data is a special type of sequence data where the data is a series of
measurements over time.
NOTES
13
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
4. Spatial data
Spatial data has attributes such as positions or areas. For example maps are spatial
data where the points are related by location.
Data Storage
Once the dataset is assembled, it has to be stored in a structure that is suitable for
data mining. The goal of data storage management is to make data available to support
processing and application for data mining algorithm. There are different approaches to
organize, manage data in storage files and systems from flat file to data warehouses. Each
one has its own unique way of capturing and storing the data.
Some are listed below
1. Flat files
2. Databases
3. Data warehouses
4. Special databases like Object-relational databases, object oriented databases,
transactional databases, advanced databases like spatial databases, multimedia
databases, time-series databases and textual databases.
5. Unstructured and semi structured World Wide Web.
Let us review some of these data stores briefly so the input for the data mining can be
conditioned.
1. Flat Files
Flat files are the simplest and most commonly available data source. It is also the
cheapest way of organizing the data. These flat files are the files where data is stored in
plain ASCII or EBCDIC format.
Simple programs like notepad or spreadsheet can be used to create a flat file data
source for data mining at the lowest level. The flat files may be simple data files or files in
the binary format. But the data-mining algorithm should be aware of the structure of the flat
file for processing.
The flat file is called flat because same fields are located in the same offset with each
record. Minor changes of data in flat files affect the results of the data mining algorithms.
Hence flat file is suitable only for storing small data set and not desirable if the dataset
becomes larger.
2. Database System
A Database system normally consists of database files and Database management
system (DBMS). The Database files contain original data and metadata. DBMS aims to
DMC 1628
NOTES
14 ANNA UNIVERSITY CHENNAI
manage data and improve operator performance by including various tools like database
administrator, query processing and transaction manager.
A relational database consists of sets of tables. The tables have rows and columns.
The columns represent the attributes and rows represent the tuples. The tuple corresponds
to either an object or relationship between objects.
The user can access and manipulate the data in the database using SQL. Data mining
algorthms can use the SQL for taking advantage of the structure that is present in the data
but data mining algorithms go beyond SQL by performing tasks such as classification,
prediction and deviation analysis.
3. Transactional Databases
A transactional database is a collection of transactional records. Each record is a
transaction. A transaction may have a time stamp, identifier and a set of items, which may
have links to other tables. Normally transactional databases are created for performing
associational analysis that indicates the correlation among the items.
4. Data Warehouses
A data warehouse is decision support database. It provides historical data for data
mining algorithm.
Data mining is a subject-oriented centralized data source. It is constructed by
integrating multiple heterogeneous data sources like flat files, databases in a standard format.
It provides subject oriented, integrated, time variant and non-volatile collection of data for
decision support systems and data mining.
By subject oriented we mean that the data is stored around subjects of the organization
like customer, sales etc. By integrated, we mean that the data is from a multiple data
sources. The data is cleaned, preprocessed and a reliable data for mining. By non-volatile,
we mean that the data provided by the data warehouse is historical in nature and normally
no updates, modifications and deletions are carried out. Thus the data warehouse is like a
container that has all the necessary data that is necessary for carrying out the business
intelligence operations.
There are two ways in which the data warehouse can be created. First as a centralized
archive that collects all the information from all the data sources. Second approach is by
integrating different data marts to create a local data warehouse. In this approach, initially
the data marts are not connected. But slowly the integration is carried out so that a centralized
data source can be created. Thus a data warehouse has centralized data, a metadata
structure and a series of specific data marts that are accessible by the users.
NOTES
15
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Modern data warehouses for web are called Data Webhouses. With the advent of
web, data warehouse becomes web data warehouse or simply data webhouse. The web
offers most heterogeneous and dynamic repository, which includes different kinds of data
like raw data, image, video that can be accessed by browser. The web data has three
components The content of the web, which includes the web page, web structure, which
includes the relationship between the documents, and the web usage information. The data
warehouse has integrated browser to access the data. These data can be put into
conventional sources of data for data mining algorithms. Speed is very crucial for the
success of Webhouses.
Another variation of data warehouse is Data marts. A data mart is a thematic database
which is a database that is completely oriented towards managing customers. Many data
marts can be created for managing customers with a specific goal like marketing. These
data marts then slowly interconnected so as to create a data warehouse.
OLAP is used to produce fast, interactive answers for user queries for data warehouses.
A data cube allows such a multidimensional data to be effectively modeled and viewed in
N-dimension. Some of the typical data mining queries are like summarization of data or
other tasks like classification, prediction and so forth.
5. Object-oriented databases
This is an extension of relational model by providing facilities for complex objects
using object orientation. The entity in this model is called objects. For example, a patient
record in a medical database is a object.
The object includes a set of variables, which are analogous to attributes. The object
can send or receive messages with other objects for communication and each object has a
set of methods. Methods can receive a message and returns a value. Similar objects can
be grouped into a class and all the objects inherit all the variables that belong to the super
class. Data mining should have provisions to handle complex object structures, data types,
class hierarchy and object orientations like inheritance, methods and procedures.
6. Temporal Databases
Many data mining applications are content to deal with static databases where the
data has no timing relationships. But significantly for some cases, records associated with
the time stamps greatly enhance the mined knowledge.
Hence temporal databases stores time related data. The attributes may have time
stamps of different semantics. Already many databases use fields like data created, data
modified etc.
DMC 1628
NOTES
16 ANNA UNIVERSITY CHENNAI
Sequential databases stores sequences of ordered events with or without the notion
of time
Time-series database stores time related information like log files. This data represent
the sequences of data, which represent values or events obtained over a period (Say
hourly, weekly or yearly) or repeated time span. Observation of sales of product continuously
may yield a time-series data. Data mining also performs trend analysis upon these temporal,
sequence and time series database.
7. Spatial Databases and Spatiotemporal Databases
Spatial databases contain spatial information in a raster format or vector format. Raster
formats are either bitmaps or pixel maps. For example images can be stored as a raster
data. On the other hand, vector format can be used to store maps because maps use basic
geometric primitives like points, lines, polygons and so forth.
Many of the geographic databases are used by data mining algorithms to predict
location of telephone cables, pipes etc. The applications using data mining for spatial data
include clustering, classification, association and outlier analysis.
Spatiotemporal databases use spatial data along with time information.
8. Text Database
Text is one of the most commonly used multimedia data type. It is also a natural
choice of communication among users. The text is present in documents, emails, and Internet
chat and also in gray domains whose content is available for selected audiences.
Text databases contain word descriptions for objects. These are long sentences or
paragraphs. The text data can be unstructured (Like documents in HTML or XML), semi
structured (Like emails) and structured (Like library catalogues). Most of the text data is in
compressed form also.
Retrieving the text is generally considered to be part of the Information retrieval study.
But text mining is proving to be quite popular with the advent of Internet technology with
applications like text clustering, text retrieval and text classifying systems. Also use of text
compression for improving the efficiency of the text mining systems is a great challenge.
9. Multimedia databases
Multimedia databases are specialized databases that contain high dimensionality
multimedia data like images, video, and audio. The data may be stored either in a compressed
format or in uncompressed format. Multimedia databases are typically very large databases
and data mining of these data is quite useful in applications like content based retrieval,
visualization.
NOTES
17
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
10. Heterogeneous and Legacy databases
Many business organizations are truly multinational organizations spanning across
many continents. Their database structure is interconnection of many heterogeneous
databases because each database can have different types of objects and different types
of message formats to do effective processing. This task is bit complicated because of the
underlying diverse semantics.
Many enterprises have existing infrastructure where diverse data storage forms like
flat files, relational databases, and other forms of data storage forms connected by networks.
This sort of systems is called legacy systems.
11. Data Stream
Data stream are dynamic data, which flow in and out of the observing environment.
The typical characteristics of data stream are huge volume of data, dynamic, fixed order
movement and its real time constraints. Also capturing and storing these dynamic data is a
great challenge.
12. World Wide Web
WWW provide a diverse, world wide, online information source. The objective of
mining these data is to mine interesting patterns of information.
WWW is a huge collection of documents that comprises information of semi structured
information (HTM, XML), hyper link information, access and usage information and
dynamically changing contents of the web pages. Web mining refers to the process of web
for knowledge extraction.
1.6 COMPONENTS OF A DATA MINING ALGORITHM
Data mining algorithm consists of the following components
1. Model or Pattern
2. Preference or scoring function
3. Search algorithm
4. Data strategy
All data mining models either produce a model or pattern. It is better to have a clear
idea of the difference between a model and a pattern.
1. A model is a high level, global description of data. Data mining algorithm fits
models for given data. Choosing a suitable model is an important step for data
mining process. The model may be descriptive (it can summarize the data in a
comprehensive manner) or predictive (it can infer new information). This allows
DMC 1628
NOTES
18 ANNA UNIVERSITY CHENNAI
the user to get a sense of the data and enables the user to make some statements
about the nature of data or predict some future data/information.
For example, in a regression analysis, a basic predictive model can be constructed as
a function of some form like
Y = a X + b
Here X is a predictor variable, Y is the response variable and the variables a and c
are parameters of the variable. This equation may be used to predict the expenditure of a
person given his annual income. This model may be perfect because for a person spending
may increase with the income. Hence the response variable is linearly related to the predictor
value.
In contrast to the models, pattern structures make statements about restricted regions
of the space spanned by the variables range. In other words, a pattern describes a small
part of the data.
For example, If Y > y, then Probability (X > x1) = p1.
These variables are arbitrary. This just introduces the constraints on the variables
connected by a probabilistic rule. Certainly not all the records obey this rule. Only the
records that feature these variables are affected by this rule. These sort of restricted
statements are characteristics of patterns.
Hence the model or pattern structures are associated with parameters, and the data
must be analyzed to estimate its values. This is the job of the data-mining algorithm. A good
model or pattern has specific, optimized value for its parameters.
Therefore, the distinctions between models and patterns are very important. Some
times it is often difficult to recognize a model or pattern as the boundary separating them is
so thin. But a careful analysis can help to distinguish a model from a pattern.
IBM has identified two types of model or modes of operation, which may be used to
unearth information of interest to the user.
Verification Model
The verification model takes a hypothesis from the user in the form of a query. Then it
tests the validity of the hypothesis against the data. The responsibility of formulating the
hypothesis and query belongs to the user. Here no new information is created. But only the
outputs of the queries are analyzed to verify or negate the hypothesis. The search is refined
till the user detects the hidden information.
NOTES
19
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Discovery Model
The discovery model discovers important information hidden in the data. The model
sifts the data for frequently occurring patterns, trends and generalizations. Often the user
role is very limited here. The aim of this model is to reveal the larger amount of hidden
information that is present in the dataset.
Some authors classify the models as
1. Descriptive models
The aim of these models is to describe the dataset in terms of groups. These are
called symmetrical, unsupervised or indirect methods.
2. Predictive models
The aim of these methods is to describe one or more variables in relation to all the
other attributes. These are called supervised, asymmetrical or direct methods. The methods
sift the data for hidden rules that can be used for classification or clustering.
3. Logical models
These models identify the particular characteristics related to the subset interests of
the database. The factor that distinguishes this method from others is that these methods
are local in nature. Association is an example of logical model.
4. Preference function (Score function)
The choice of selecting the models depends on the given data. Normally some form
of goodness-of-fit function is used to decide a good model. Normally preference or score
functions are used to specify the preference criterion. Ideally a good score function should
indicate the true expected benefits of the models. But in practice such a score function is
difficult to find. However, without the score functions, it is difficult to predict the quality of
the model.
Some of the score functions that are used commonly are likelihood, sum of squared
errors, misclassification rates.
For example, for the above model,
E (y(i) y(i))
2
can be used as the score function.
Here y(i) is the target or expected prediction and y(i) is the actual prediction for 1<=
I <= n. This scoring function estimates the error based on the difference between the
actual and expected values.
DMC 1628
NOTES
20 ANNA UNIVERSITY CHENNAI
Sometimes, it may be difficult to apply a scoring function for a very large set, but the
scoring functions can be moderated by the practicality of applying them. Also the scoring
functions must be robust. In other words if the scoring functions are very susceptible to the
changes in the dataset, small change in the dataset can dramatically change the estimate of
the model. Hence a robust scoring function is a necessity of a good data-mining algorithm.
5. Search Algorithm
The goal of the search algorithm is to determine the structure and parameter values to
achieve maximum or minimum (depending on the context) value of the score function.
Normally these algorithms are presented as an optimization problem or estimation problem.
The search algorithm finds specification of an algorithm for finding particular models or
patterns and parameters of the model for a given data set and score function.
Normally the problems of finding interesting pattern are often posed as a combinatorial
problem. Often heuristic search algorithms are utilized to find interesting patterns. For example
for the above linear regression problem, minimizing the least squares function must be
solved by the search problem. A good data-mining algorithm should have a good search
algorithm to achieve its primary objectives.
6. Data Management Strategy
Data management strategy deals with indexing and accessing of data. Many algorithms
assume that data is available in the primary memory. But for data mining it is inevitable that
massive datasets still reside in the secondary memory.
Many data mining algorithms scale very poorly when applied to larger dataset that
reside in the secondary store. Hence it is necessary to develop data mining algorithm with
an explicit specification of data management strategy.
1.7 STEPS IN DATA MINING PROCESS
In can be remembered that KDD is a complete process and Data mining is an
algorithmic component of the KDD process. The KDD process is outlined as below
1. Understanding the application domain
Defining the objectives for data mining is the preliminary step in the KDD process.
The business organization should have a clear objective for implementing data mining
process. These objectives should be translated into a set of clear tasks that can be
implemented. A clear-cut problem statement is required for implementing data mining
process. A good problem statement should have no room for doubts or uncertainties.
NOTES
21
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
2. Extracting the target dataset
First it is necessary to identify the data sources. The ideal source of the data mining
process is a data warehouse. A sort of exploratory analysis can be carried out to identify
the potential data mining tasks that can be performed.
3. Data Preprocessing
This is essential for improving the quality of the actual data. This step is required for
getting high quality information. This step requires data cleaning, data transformation, data
integration, data reduction or data compression.
a. Data cleaning: This process involves basic operations such as normalization, noise
removal, handling missing data, and reduction of redundancy. The real world data
is often erroneous; hence this step is vital for data mining.
b. Data integration: This operation involves integrating multiple, heterogeneous datasets
generated form different sources.
c. Data reduction and Projection: This operation includes identification of necessary
attributes for data mining, reducing the number of attributes using dimension
reduction, feature extraction and discretization methods.
4. Application of data mining Algorithm
This includes the data mining algorithms for the data. This algorithm is related to
performing tasks like classification, regression, clustering, summarization, and association.
Basically data-mining operations infer a model from the data. Basically the models can be
classified according to the aim of the analysis.
5. Interpretation of results
The objective is to extract the hidden, meaningful patterns that can be of interest and
hence useful for the organization. The usefulness is always determined by the metrics.
There are two types of metrics that can be used Objective and subjective metrics.
Objective metrics use the structure of the pattern and is quantitative in nature.
Subjective metrics takes into account the user rating of the knowledge obtained. This
includes
1. Unexpectedness: This is potentially new information that is previously unknown
to the user. This amounts to some sort of discovery of the fact.
2. Actionability: This is a factor the user can take advantage to fulfill the goal of the
organization.
DMC 1628
NOTES
22 ANNA UNIVERSITY CHENNAI
6. Using the results.
The business knowledge that is extracted should be integrated into the intelligent
systems that are employed by the organization to exploit its full potential. This process
should be done gradually and the process is continued till the intelligent system is achieving
its perfection.
The process of integration involves four phases
1. Strategic phase: This phase involves steps like 1. Identification of areas where
data mining could give benefits 2. Definition of objectives for a pilot data mining
project and 3. Evaluation of the pilot project using suitable criteria.
2. Training Phase: This phase is used to evaluate the data mining activity carefully. If
the pilot project is positive, then the evaluation and formulation of the prototype
data mining and data mining technique is right.
3. Creation Phase: If the results of the pilot project are satisfactory then a plan can be
formulated so that the business procedure can be reorganized to include data
mining activity. Then the project can be initiated to carry out such a task by allocating
additional personnel and time.
4. Migration phase: This phase includes training the personnel so that the organization
can be prepared for data mining integration. These steps are repetitive in nature
and are continued till the objectives are met.
1.8 DATA MINING FUNCTIONALITIES
Data mining software analyzes relationships and patterns in stored data. Generally it
identifies four kinds of relationships that are present in the data. They are as follows
Classes: Stored data is used to locate data in predetermined groups. This is
called categorization. For example the Sex is a category under which the students
of the class can be categorized.
Clusters: Data items are grouped according to logical relationships or similarities
that are present in the data.
Associations: Data can be mined to identify associations that exist between the
data. For example the person who buys item X also buy item Y often
Sequential patterns: Patterns that is used to anticipate behavior or trends; this
involves prediction of consumer purchase patterns.
Some of the model functions are listed below
1. Exploratory Data analysis
2. Classification
3. Regression
4. Clustering
5. Rule generation
6. Sequence analysis
NOTES
23
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
1.8.1 Exploratory Data Analysis
The main idea is to explore the given data without any clear ideas of what we are
looking for. Mostly these techniques are interactive in nature and produce output in visual
form. Before we do analysis, it is necessary to understand the data. Data are of several
types. They are numerical, Categorical (Tall/Short). Categorical data may further be divided
into ordinal (Data have some order high/middle/low) or nominal, which are unordered
data. Hence producing the summary reports that include the averages, standard deviations,
distributions of data make the data understandable for the user.
Graphing and visualizing tools are vital for EDA. Ultimately all the analyses are
performed for the users. Hence producing the output that can be easily understood by the
humans in the graphical form like charts, plots are a necessary requirement of EDA. Usually
for dataset of small dimensions, graphical outputs are easy but for multiple dimensions
visualization becomes a difficult task. Hence the higher dimensionality data are displayed
at lower resolution while attempting to retain as much information as possible.
The kinds of exploratory analysis that are useful to the data mining are
1. Data characterization involves the summarization of data. For example an attribute
can be summarized. OLAP is a good example of data characterization. The output
of the data characterization is generally in the form of a graph like pie chart, bar
chart, curves or OLAP cubes.
2. Data discrimination involves the comparison of target variables with other attributes
of the object.
Both data characterization and discrimination can be combined to get a better view
of the data before data mining process.
1.8.2 Classification
Classification model is used to classify the test instances. The input for the classification
model is typically a dataset of training records with different attributes. The attributes x1,
x2, , xn are the input attributes. These are called predictor or independent variables.
What are predicting the class labels are called response, dependent or target variables.
The classification model predicts the class to which these input records are categorized.
The resulting model is then used to classify the new instances or test instances.
Some of the examples of the classification tasks can be given as
1. Classification of dataset of disease symptoms for disease
2. Classification of customer signatures as valid or Invalid categories
3. Classification of customer financial and general background for loan approval.
The classification models can be categorized based on the implementation technology.
DMC 1628
NOTES
24 ANNA UNIVERSITY CHENNAI
1. Decision Trees: The classification model can generate decision trees as output.
The decision trees can be interpreted to classification rules, which can be
incorporated into an intelligent system or can be used to augment expert systems.
2. Probabilistic methods: These include models, which use statistical properties like
Bayes Theorem.
3. Nearest neighbor classifiers which use distance measures.
4. Regression methods, which can be linear or polynomial.
5. Soft computing approaches: These soft computing approaches include neural
networks, genetic algorithms and rough set theory.
Regression is a special type of classification that uses existing values to forecast new
values. Most of the problems are not linear projections of the previous values. For example
share market data will have fluctuations and most of the values that need to be predicted
are numerical. Hence more complicated techniques are required to forecast future values.
1.8.3 Association Analysis
A good example of association function is Market Basket Analysis. All transactions
are taken as input and the output generates shows a sort of analysis that shows the association
between items of the input data.
Association data mining algorithms produce the results of the form X->Y where X is
called antecedent and Y the subsequent.
Association rule mining algorithms calculates the frequency with which a particular
association appears in the database is called support or prevalence.
The relative frequency of occurrence of items and their combinations That is given
the occurrence of items X, how often the consequent is B is referred as confidence.
Confidence is a measure which is used by the association analysis is the ratio between
frequency of X and Y together divided by the frequency of A.
Lift is also a measure of the power of association. It is calculated as the ratio of
Confidence of X -> Y/ Frequency of Y. Higher is the lift , then greater is the influence that
the occurrence of X has on the likelihood that B will occur.
Graph tools are used to visualize the structure of links. Thicker lines can be used to
indicate the stronger associations and these linkage diagrams show analysis that is termed
as link analysis.
The Sequential/temporal pattern functions analyze a collection of records over a period
of time to identify the trends. These models take into account the distinct properties of time
and periods (Like Week, calendar year). These business transactions are analyzed frequently
for a collection of related records of the same structure.
NOTES
25
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
The records are related by the identity of the customer who did the repeated purchases.
Such analyze can assist the business organizations to target a specific group often.
1.8.4 Clustering/Segmentation
Clustering and segmentation are the processes of creating partitions. All the data
objects of the partitions are similar in some aspect and vary from the data objects in the
other partitions significantly. This is also called the unsupervised classification where there
are no predefined classes.
Some of the examples of clustering process are
1. Segmentation of a region of Interest in an image
2. Detection of abnormal growth in a medical image
3. Determining clusters of signatures in a gene database.
The quality of the clustering algorithm depends of the similarity measures it uses and
its implementation. A good segmentation algorithm should generate a cluster that has high
intra-class similarity and low interclass similarity. Also the clustering process should detect
the hidden pattern that is present in the dataset.
Clustering algorithms can be categorized as
1. Partitional: These algorithms create a initial cluster and then objective is optimized
using an iterative control strategy.
2. Hierarchical: These algorithms produce hierarchical relationships that exist in the
dataset.
3. Density based: The algorithms use connectivity and density functions to produce
clusters.
4. Grid-based: The algorithms produce a multiple level granular structure by quantizing
the feature space in terms of finite cells.
1.8.5 Outlier Analysis
Outliers express deviation from the normal behavior or expectations. These deviations
may be due to noise also. Hence data mining algorithms should be very careful whether the
deviations are real or is it due to noise. Detection of truly unusual behavior in a given
application context is called outlier analysis.
The applications where outliers prove to be very useful are forecasting, fraud detection
and customer abnormal behavior detection.
1.8.6 Evolution Analysis
Evolution analysis models and describes the trends of an object over a period of
time. The distinct features of the analysis include time-series data analysis. Sequence or
periodicity pattern matching and similarity based analysis.
DMC 1628
NOTES
26 ANNA UNIVERSITY CHENNAI
1.9 DATA MINING PROCESS
The data mining process involves four types
The steps are
1. Assembling the data from various sources
2. Present the data to data mining algorithm
3. Analyze the output and interpret the results
4. Apply the results to new situations.
The emerging process model for the data mining solutions for business organizations
is CRISP-DM. This model stands for Cross Industry Standard Process Data Mining.
This process involves six steps
The steps are listed below
1. Understanding the business
This step involves the understanding the objectives and requirements of the business
organization. Generally a single data mining algorithm is enough for the giving the solutions.
This step also involves the formulation of the problem statement for the data mining process.
2. Understanding the data
This step involves the steps like
Data collection
Study of the characteristics of the data
Formulation of hypothesis
Matching of patterns to the selected hypothesis
3. Preparation of data
This step involves in producing the final data set by cleaning the raw data and the
preparation of data for the data mining process.
4. Modeling
This step involves in the application of data mining algorithm for the data to obtain
model or pattern.
5. Evaluate
This step involves the evaluation of the data mining results using statistical analysis and
visualization methods.
NOTES
27
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
6. Deployment
This step involves the deployment of the results of the data mining algorithm to improve
the existing process or for a new situation.
1.10 ARCHITECTURE OF A TYPICAL DATA MINING SYSTEM
The architecture of a typical data mining system implements the data mining process.
It has necessary components to implement the data mining process. The typical components
of a data mining system are listed below
1. Database: A typical system may have one or more databases. The data present in
the database is cleaned and integrated.
2. Database server: The database server is responsible for fetching the data as per
the use query.
3. Knowledgebase: This is the domain knowledge that is used to guide the search or
evaluate the outcomes of the query based on the criterion of interestingness.
4. Data mining engine: This represents the heart of the system. This contains the
necessary functional modules for classification, prediction, association or deviational
analysis.
5. Pattern evaluation module: This module uses the necessary metrics to indicate the
quality of the results. The system may use measures like Interestingness to indicate
the quality of the results.
6. GUI: This provides the necessary interface between the user and the data mining
system. This allows the user to interact with the system, to provide the user
query and to perform the necessary mining tasks. The interface also converts the
results using the visualization methods for the user for interpretation.
One of the major decisions that a organization needs to take is to decide the kind on
interaction the data mining system has with the existing system. If there is no interconnection
exists between the data mining system and database or data warehouse, it is called no
coupling or it can be said based on the degree of connection as loose coupling, Semitight
coupling and tight coupling. Let us review some of the systems briefly
No coupling
No coupling means there is no interconnection between the data mining system and
the existing systems. The data mining system can fetch data from any data source for data
mining tasks and produce the results in the same data storage system. No coupling is a
poor design strategy. Some of the major disadvantages of this kind of arrangement are
The data mining algorithms needs to spend a lot of time in extracting the data from
multiple source, to clean it, transform and to integrate it so that results are reliable. This is
a big responsibility on the data mining algorithms.
DMC 1628
NOTES
28 ANNA UNIVERSITY CHENNAI
Scalable and flexible algorithms and structures of the database and data warehouse
are not utilized. This leads to the degradation of the performance of the algorithms.
Loose coupling
This design facilitates the data mining system to use some facilities of the existing
database or data warehouse. Most of the responsibility of supplying good quality data is
undertaken by the existing database or data warehouse thus making data mining algorithms
concentrate on other aspects like scalability, flexibility and efficiency. One of the major
problems associated with the loose coupling system is that these loose coupling systems
are memory based. So handling larger set makes it difficult for these systems to achieve
high scalability and good performance.
Semitight coupling
For these systems, there is a interconnection between the data mining system and the
existing database or data warehouse. Also few data mining primitives are provided by the
database or data warehouse. These primitives provide some basic functionality like sorting,
aggregation etc. Moreover some of the frequently used intermediate results of the data
mining system are stored in database or data warehouse for reuse. This strategy leads
enhance the performance of the system.
Tight coupling
Here the data mining system and the existing database or data warehouse is completely
integrated. Database and data warehouse optimizes the data mining queries using existing
technology. This approach is highly desirable as integrated environment provides a superior
performance. But not all the organizations possess a completely integrated environment. A
lot of effort is required to design this environment and the data-mining primitives.
Task primitives query
A data-mining task is some form of data analysis. For example classification is a data-
mining task. A data-mining query represents the data-mining primitive. The user
communicates with data mining system with a query. The data mining system executes the
query and produces the result in a user understandable form generally in the form of graphics.
The data mining task primitive is in the form
1. Set of task relevant data
This specifies the portions of database or the data in which the user is interested.
2. Kinds of knowledge
This specifies the kind of data mining functions
NOTES
29
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
3. Background knowledge
This is knowledge about the domain. This is used to guide the data mining process
effectively. Concept hierarchies are popular way of representing the background knowledge.
The user can also specify the beliefs in the form of constraints and thresholds.
4. Interesting measures and Thresholds
These measures are used to guide and evaluate the results of data mining. For example
the association results can be evaluated using Interestingness criteria like support and
confidence.
5. Representation
The user can specify the form in which the results of the mining can be represented.
These may include rules, tables, charts or any form that user wishes to see mined patterns.
Data mining query can incorporate these primitives. Designing complete data mining
language is a quite challenging task as every data mining task has its own requirements.
Some efforts are being done in this respect. For example Microsoft OLE DB includes
DMX. DMX is a XML styled data mining language and there exists some form data
mining languages like PMML (Programming Data Model Markup language) and CRISP-
DM (Cross-Industry Standard Process for Data Mining) standards for data mining language
requirement.
1.11 CLASSIFICATION OF DATA MINING SYSTEM
The data mining systems can be categorized under various criteria
1. Classification according to the kinds of databases use. The databases itself are
classified according to the data models like relational, transactional, object-oriented,
or data warehousing system.
2. Classification according to the special kinds of data handled. Based on this the
data mining system can be categorized as spatial, time-series, text or multimedia
or Web data mining system. There can be also other types like heterogeneous
data mining systems or legacy data mining systems.
3. Classification according to the kinds of knowledge mined: Based on this the data
mining system can be classified based on the functionalities like classification,
prediction, clustering, trend, and deviation analysis. A system can provide a single
or more functionality. The knowledge from such systems can also vary based on
the granularity or levels of abstraction like the general knowledge, primitive level
knowledge or knowledge at multiple levels.
4. The systems can be classified based on the level of user interactions such as
autonomous systems, interactive exploratory systems or query driven systems.
DMC 1628
NOTES
30 ANNA UNIVERSITY CHENNAI
Summary
- Data mining is an automated way of converting the data into knowledge
- Data is about facts. Information is processed data in the form of patterns,
associations or relationships among data. Knowledge is processed information as
historical patterns and trends. Intelligence is application of knowledge.
- Learning is like adaptation occurs as the results of interaction of the program with
its environment
- Data storage plays an important role for designing the data mining algorithms. Typical
data storage systems include flat files, databases, object relational databases,
Transactional databases, Data warehouses, Multimedia databases and spatial
databases.
- Typical data mining has components models/patterns, score or preference function
and search mechanism.
- Models of two types. They are verification model and discovery model.
- KDD process includes steps like understanding of domains, Extraction of dataset,
data preprocessing, and application of data mining model, Interpretation and
application of results.
- Data mining tasks include exploratory data analysis, classification, prediction, and
association and deviation analysis.
- Data mining system architecture should include data repository, knowledgebase,
data mining engine, pattern evaluation module, and an graphic interface.
- Data mining can be classified based on various criteria like based on databases
used, kinds of data handled, kinds of knowledge mined and levels of user
interactions.
DID YOU KNOW?
1. What is the difference between data and Information?
2. What is the difference between knowledge and Intelligence?
3. What is the difference between KDD and data mining?
4. What is the difference between model and pattern?
Short questions
1. What does data mining mean?
2. Distinguish between the terms: Data, Information, Knowledge and
Intelligence.
3. What is the relationship between KDD and data mining?
4. What is the difference between operational and Strategic database?
5. What is the difference between OLAP and OLTP?
6. What is the difference between Statistics and Data mining?
7. What is the role of data storage systems for data mining?
8. What is the specialty of a Data warehouse?
9. What are the types of data mining models?
10. What is the difference between a model and pattern?
NOTES
31
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
11. What are the functionalities of data mining?
12. What is the difference between classification and prediction?
13. What are the components of a data mining architecture?
14. How data mining systems are classified?
Long Questions
1. Explain in detail the KDD process.
2. Explain in detail the functionalities of data mining model with examples.
3. Explain in detail the components of the data mining architecture.
4. Explain the ways in which the data mining system can be integrated in modern
business environment.
5. Write short notes on
- Text mining
- Multimedia mining
- Web mining
DMC 1628
NOTES
32 ANNA UNIVERSITY CHENNAI
NOTES
33
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
UNIT II
DATA PREPROCESSING AND
ASSOCIATION RULES
INTRODUCTION
The raw data are often incomplete, invalid and inaccurate. This kind of error data is
also referred as dirty data. It causes problems for many organizations because the information
or knowledge obtained from the dirty data will be unreliable. Researchers have found that
one-third of the dirty data causes delay or even scrapping of the existing projects incurring
the huge loss to the organization. There is a proverb GIGO. If garbage is in then the
output is also garbage. So the bad data has to be preprocessed so that the quality of the
results can be improved. This chapter presents the concepts related to data processing
and the methodologies for performing association analysis.
Learning Objectives
- To understand the characteristics of the data
- To explore the methodologies for performing data cleaning, data integration and
data transformation
- To provide an overview of exploratory data analysis.
- To explore associative rule mining for performing market basket analysis.
2.1 UNDER STANDING THE DATASET
Exploratory data Analysis (EDA) usually does not have any priori notions of the
expected relationships among the variables. This is true especially because data mining
algorithms involve large data sets. Hence the main goal of EDA is to
- Understand the data set
- Examine the relationships that exists among attributes
- Identify the target set
- Have some initial ideas of data set.
DMC 1628
NOTES
34 ANNA UNIVERSITY CHENNAI
The initial requirement of EDA is data understanding and data preparation.
Data Collection
data set can be assumed to be a collection of data objects. The data objects may be
records, points, vectors, patterns, events, cases, samples or observations. These records
contain many attributes. Attribute can be defined as the property or characteristics of an
object.
For example, consider the following database shown in sample Table 2.1.
Table 2.1: Sample Table
very attribute should be associated with a value. This process is called measurement
process. Measurement process associates every attribute with a value. The type of the
attributes determines the kinds of values the attributes can take. This is often referred as
measurement scale types.
Attribute Types
The attributes can be classified into two types
- Categorical or qualitative data
- Numerical or quantitative data
The categorical data can be divided into two types. They are nominal type and ordinal
type.
In the above table, patient ID is a categorical data. Categorical data are symbols and
cannot be processed just like a number. For example, the average of an patient ID does
not make any statistical sense.
Nominal data type provides only information but contains no order. Only operations
like (Equal to and not Equal to) are meaningful for these data. For example, the patient
data can be checked for equality and nothing else.
NOTES
35
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Ordinal data provides enough information about order. For example, Fever = {
Low, Medium, High} is a ordinal data. Certainly Low is less than medium and medium is
less than high irrespective of the value. Any transformation can be applied to these data to
get a new value.
Numeric or qualitative data can be divided into two categories. They are Interval
type and ratio type.
Interval data is a numeric data for which the differences between values are meaningful.
For example, there is a difference between 30 degree and 40 degree. Only the permissible
operations are + and -.
For ratio attributes, both differences and ratio are meaningful. The difference between
the ratio and Interval data is the position of zero in the scale. For example take the centigrade-
Fahrenheit conversion. Zero centigrade is not equal to zero Fahrenheit. The zeroes of both
scales do not match. Hence these are Interval data.
The data types can also be classified as discrete and continuous values. Discrete
variables have a finite set of values. The values can be both categorical or numbers. Binary
attributes are special attributes that have only two values true or false. Binary attributes
where only non-zero elements play an important role is called asymmetric binary attributes.
Continuous attribute is one whose values are real numbers.
The characteristics of the larger datasets
Data mining involve very large datasets. The general characteristics of the larger
datasets are listed below.
1. Dimensionality
The number of attributes that the data objects possess is called data dimension. Data
objects with less number of dimensions may not cause many problems. But the higher
dimensionality data poses much problems.
2. Sparseness
Sometimes, only few non-zero elements play an important role. Hence it is sufficient
to store only these non-zero elements and to process it in data mining algorithms.
Sometimes in the case of images and video databases, resolution also plays an
important role. Coarse resolution eliminates certain information. On the other hand, if the
resolution is too high, the algorithms become more computational intensive and data may
be buried in noise.
DMC 1628
NOTES
36 ANNA UNIVERSITY CHENNAI
2.2 DATA QUALITY
What are the requirements of the good quality data?
While it is understood that good quality data yields good quality results, it is often very
difficult to pinpoint what constitutes the good quality data. Some of the properties that are
considered desirable are
1. Timeliness
Decay of data should not be there. Say after a period of time, the business organization
data may become stale and obsolete.
2. Relevancy
The data should be relevant and ready for the data mining algorithm. All the necessary
information should be available and there should be no bias in the data.
3. Knowledge about the data
The data should be understandable, Interpretable and should be self sufficient for the
required application as desired by the domain knowledge engineer.
The data quality issues involve both measurement error and data collection problems.
The detection and correction of data is called data cleaning. Hence the data must have
necessary data quality.
Measurement errors refer to the problem of
- Noise
- Artifacts
- Bias
- Precision / Reliability
- Accuracy.
Data collection problems involve
- Outliers
- Missing values
- Inconsistent values
- Duplicate data
Outliers are data that exhibit the characteristics that are different from other data and
whose values are quite unusual.
It is often desirable to distinguish between noise and outlier data. Outliers may be
legitimate data and some times are of interest to the data mining algorithms.
NOTES
37
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Measurement Error
Some of the attribute values differ from the true value. The numerical difference between
the measured and true value is called error. This error results from the improper measuring
process. On the other hand, errors can arise from omission, duplication of attributes also.
This is known as data collection error.
Errors arise due to various reasons like noise and artifacts.
Noise is a random component and involves the distortion of a value or the introduction
of spurious objects. Often the noise is used if the data is spatial or temporal component.
Certain deterministic distortions in the form of a streak are known as artifacts.
The data quality of the numeric attributes are determined by the following factors
- Precision
- Bias
- Accuracy
Precision is defined as the closeness of the repeated measurements. Often standard
deviation is used to measure the precision.
Bias is measured as the difference between the mean of the set of values and the true
value of the quality.
Accuracy is the degree of measurement errors that refers to to the closeness of
measurements to the true value of the quantity. Normally the significant digits used to store
and manipulate indicate the accuracy of the measurement.
Data collection Problems
These problems are introduced at the capturing stage of the data warehouse.
Sometimes the data warehouse itself may be bad existing data and the captured may
become cumulative because of the existing bad data. Also the dirty data can enter at the
data capturing stage through inconsistencies during combining and integrating data from
different streams.
Types of bad data
The bad or dirty data can be of the following types
- Incomplete data
- Inaccurate data
The real world data may contain errors or outliers. Outliers are values that are different
from the normal or expected data. These may be fault of the data capturing instrument,
errors in transmission, technology problems.
DMC 1628
NOTES
38 ANNA UNIVERSITY CHENNAI
- Inconsistent data: These data are due to the problems in conversions, inconsistent
formats, difference in units.
- Invalid data (Includes illegal naming convention/ Inconsistent field size)
- Contradictory data
- Dummy data
Some of these errors are due to human errors like typographical errors or may be due
to measurement process and structural mistakes like improper data formats.
Stages of Data management
It is also often difficult to track down these dirty data, evaluate them and to remove
them. Hence a suitable data management policy should be adopted to avoid the data
collection errors.
The stages of the data management consist of the following steps. The steps are as
shown below
- Evaluation of quality of the data
- Establish procedures to prevent dirty data
- Fixing bad quality data at the operational level
- Training people to manage the data
- Focusing the critical data
- Establishing the business rule in the organizations for handling dirty data
- Developing a standard for meta data
This is an iterative process and this process is carried out on a permanent basis to
ensure that data is suitable for data mining. Hence data preprocessing routines should be
applied to clean the data so that they will make data mining algorithms to produce correct
and reliable results. Data preprocessing improve the quality of the data mining techniques.
2.3 DATA PREPROCESSING
To remove the measurement and data collection errors, the raw data must be
preprocessed to give accurate results. Some of the important techniques of data
preprocessing are listed below.
Data Cleaning
These are routines that remove the noise and solve inconsistency problems
Data Integration
These routines merge data from multiple sources into a single data source.
NOTES
39
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Data Transformation
These routines perform operations like normalization to improve the performance of
the data mining algorithm by improving the accuracy of the results and efficiency of the
results.
Data reduction
These routines reduce the data size by removing the redundant data and features.
Data cleaning routines clean up the data by solving the problems of missing data by
filling it up, removing noise (by smoothening), remove outliers and solve inconsistencies
associated with the data. This enables data mining to avoid Overfitting of the models.
Data Integration routines integrate data from multiple sources of data. One of the
major problems of data integration is that many data sources may have same data under
different headings. So this leads to redundant data. The main goal of data integration is to
detect and remove redundancies that arise from integration.
Data transformation routines normalize the attribute values to a range say [-1 - 1.0].
This is required by some of the data mining algorithms to produce better results. Neural
networks have built-in mechanisms for normalizing the data. For other algorithms
normalization should be carried out separately.
Data reduction reduces the data size but still produces the same results. There are
different ways in which the data reduction can be carried out. Some of them are listed
below
- Data aggregation
- Attribute feature selection
- Dimensionality reduction
- Numerosity reduction
- Generalization
- Data discretization
2.4 DATA SUMMARIZATION
Descriptive statistics is a branch of summarizing the dataset. It is used to summarize
and describe data. Any other number we choose to use for data mining algorithms also
must be summarized and described. Descriptive statistics are just descriptive and do not
go beyond that or in other words descriptive statistics dont bother too much about
generalizing.
These techniques help to understand the nature of the data which helps us to determine
the kinds of data mining tasks that can be applied to the data. The details that help us to
understand the data are measures of central tendency like mean, median, mode and mid
DMC 1628
NOTES
40 ANNA UNIVERSITY CHENNAI
range. Data dispersion measures like quartiles, Interquartile range and variances helps to
understand data better.
Data mining process involves typically larger data set. Hence enormous amount of
time is needed to understand the results. Hence the need of the hour is data mining algorithms
should be scalable. By scalable, we mean that the data mining algorithm should be able to
partition that data perform the operations on the partitions and should have the ability to
combine the results of the partitions.
This makes mining algorithms to work on a larger data set and also compute the
results in a faster manner.
Measures of the Central tendencies
The measures of the central tendency include mean, median, mode and midrange.
Measures of the data dispersion include quartiles, Interquartile range and variance.
The three most commonly-used measures of central tendency are the following.
Mean is the average of all the values in the sample (population) is denoted as.
Sometimes the data is associated with a weight Wi for I ranges from 1 to N. This
gives a weighted mean. The problem of mean is its extreme sensitiveness to noise. Even
small changes in the input affects mean drastically. Hence often the top 2% is chopped off
and then mean is calculated for a larger data set.
Median is the value where given Xi is divided into two equal halves, with half of the
values being lower than the median and half higher than the median.
The procedure for obtaining the mean is to sort the values of the given Xi into the
ascending order. Then if the given sequence has odd number of values, the median is the
middle value. Otherwise the median is the arithmetic mean of two middle values.
Mode is the value that occurs more frequently in the dataset. The procedure for
finding the mode is to calculate the frequencies for all of the values in the data and the mode
is the value (or values) with the highest frequency. Normally based on the mode, the
dataset is classified as unimodal, bimodal and trimodal. Any dataset which has more than
two modes is called bimodal.
Example Find the mean, median and mode for the following data (refer Table 2.2).
The patient age of a set of records is = {5,5,10,10,5,10,10,20,15}.
NOTES
41
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Table 2.2 : Sample Data
The mean of the patient age of this table is 90/9 = 10, median is 10 as it falls in the
middle and mode is 10 as it is the frequent item in the data set. It is sometimes convenient
to subdivide the data set using coordinates. Percentiles are about data that are less than
coordinate by some percentage of the total value. For example, median is 50
th
percentile
and can be denoted as Q
0.50.
The 25
th
percentile is called first quartile and the 75
th
percentile
is called third quartile. Another measure that is useful to measure dispersion is Interquartile
range.
Interquartile is defined by Q
0.75
Q
0.25
.
For example, for the previous patient age list, The median is in the fifth position. In
this case 25 is the median. The first quartile is median of the scores below the mean. Hence
its the median of the list below 25. In this case, the median is the average of the second
and third value. That is Q
0.25 =
12.5. Similarly the third quartile is the median of the values
above the median. So Q
0.75
is the average of the seventh and eighth score. In this case it is
37.5.
Hence the IQR = Q
0.75
Q
0.25
= 37.5 12.5 = 20.
Semi quartile range is = 0.5 * IQR
= 0.5 * 20 = 10.
Unimodal curves are slightly skewed and the empirical relation is
Mean Mode = 3 * (Mean Median)
The interpretation of the formula is that the mode for unimodal frequency curve is
moderately skewed. The mid range is also used to assess the central tendency of the
dataset. In a normal distribution, the mean, median, and mode are same. In Symmetrical
distributions, it is possible for the mean and median to be the same even though there may
be several modes. By contrast, in asymmetrical distributions the mean and median are not
DMC 1628
NOTES
42 ANNA UNIVERSITY CHENNAI
the same. These distributions are said to be skewed data where more than half the cases
are either above or below the mean. Often skewed data cause problems for data mining
algorithms.
Standard deviation and Variance
By far the most commonly used measures of dispersion measures are variance and
standard deviation.
The mean does not convey much more than a middle point. For example, the following
data sets {10,20,30} and {10,50,0} both have a mean of 20. The difference between
these two sets is the spread of data.
Standard deviation is the average distance from the mean of the data set to each
point. The formula for the standard deviation is given by
Some times, we divide the value by N 1 instead of N. The reason is that for larger
real-world the division by N-1 gives an answer closer to the actual value.
For example the for the above set, Here N = 3
For the data set 2
2
1
( ) / 1
N
i
Xi N
=
=

NOTES
43
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
The set with larger deviation is dataset 2 because the data is more spread out.
Variance
Variance is another measure of the spread of the data. It is the square of standard
deviation. While standard deviation is more common measure, variance also indicates the
spread of the data effectively.
Co-variance
The above two measures standard deviation and variance are one dimensional. But
data mining algorithms involve data of higher dimensions. So it becomes necessary to
analyze the relationship between two dimensions.
Co-variance is used to measure variance between two dimensions. The formula for
finding co-variance is
Conv (X,Y) =
The covariance indicates the relationship between dimensions using its sign. The sign
is more important than the actual value.
1. If the value is positive, it indicates that the dimensions increase together
2. If the value is negative, it indicates that while one dimension increases, the other
dimension decreases.
3. If the value is zero, then it indicates that both the dimensions are independent of
each other.
If the dimensions are correlated, then it is better to remove one dimension as it is a
redundant dimension. Also the covariance (X,Y) is same as the covariance (Y,X).
Covariance matrix
Data mining algorithms handle data of multiple dimensions. For example, if the data is
3D data, then the covariance calculations involve cov(x,y), cov(x,z), and cov(y,z). In fact
) 1 N /( ) Y Y ( ) X X (
n
1 i
i

=
DMC 1628
NOTES
44 ANNA UNIVERSITY CHENNAI
N factorial by factorial (n-2) * 2 calculations are required to calculate different covariance
values.
It is better to arrange the values in the form of a matrix
We can also note that the matrix is symmetrical about the main diagonal.
Shapes of Distributions
Skew
The given dataset may have an equal distribution of data. The data set may also have
either very high values or extremely low values. If the dataset has far higher values, then the
dataset is said to be skewed to the right. On the other hand, if the dataset has far more low
values then the dataset is said to be skewed towards left.
The implication of this is that if the data is skewed, then there is a greater chance of
outliers in the dataset. This affects the mean and median. Hence this may affect the
performance of the data mining algorithm.
Generally for highly-skewed distribution, the mean is more than median many times.
The relationship between skew and the relative size of the mean and median can be
summarized by a convenient numerical skew index.
Also the following measure is more commonly used to measure skewness.
Kurtosis
The kurtosis is measured using the formula given below. Kurtosis measures how fat
or thin the tails of a distribution. Often it is described as
Leptokurtic are distributions with long tail and distributions with short tails are called
platykurtic. Normal distributions have zero kurtosis.
The most common measures of the dispersion data are
- Range
- 5 Number summary
o
Mediam) 3(Mean

o

3
3
) X (


o

3
) x (
4
4
NOTES
45
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
- Interquartile range
- Standard deviation
Range is the difference between the maximum and the minimum of the values.
Kth percentile is the property that the k% of the data lie at or below Xi. In a set of
data, it is called kth percentile. Median is 50
th
percentile. Q1 is the first quartile that is 25
th
percentile. Q3 is third quartile represented by the 75
th
percentile. The IQR is the difference
between Q3 and Q1.
Interquartile percentile = Q3 Q1
Outliers are normally the values that is falling apart at least by the amount 1.5 * IQR
above the third quartile or below the first quartile.
The median, quartiles Q1, Q3 and minimum and maximum written in the order
<Minimum, Q1, Median, Q3, Maximum> is known as five point summary.
Box plots are the popular way for plotting the five number summaries.
2.5 VISUALIZATION METHODS
Visualization is an important aspect of data mining. Visualizing the data summaries as
a graph helps the user to recognize and interpret the results quickly. The important information
that is required by the data mining community shown graphically reduces the time required
for interpretation.
Stem and Leaf Plot
A stem and leaf plot is a display that help us to knows the shape and distribution of
the data. In this method, each value is split into a stem and a leaf. The last digit is
usually the leaf and mostly digits to the left of the leaf form stem. The stem and leaf plot for
the above data of example patient age is shown below (Refer Figure 2.1) below
Stems/leaves
2|0
1|5
1|0000
0|555
Figure 2.1 Stem and Leaf Plot.
DMC 1628
NOTES
46 ANNA UNIVERSITY CHENNAI
Box Plot
Box plot is also known as Box and whisker plot. It summarizes the following statistical
measures
- Median
- Upper and lower quartiles
- Maximum and Minimum data values
The box contains bulk of the data. These data are between first and third quartiles.
The line inside the box indicates location mostly median of the data. If the median is not
equidistant then the data is skewed. The whiskers project from the ends of the box indicates
the spread of the tails and indicates the maximum and minimum of the data value. The plot
of a Lung cancer data is given to illustrate the concepts. This data set is used for other types
of graphs also. This data is available as open data set and the data set was published in
along with the paper (Hong, Z.Q. and Yang, J.Y. Optimal Discriminant Plane for a Small
Number of Samples and Design Method of Classifier on the Plane, Pattern Recognition,
Vol. 24, No. 4, pp. 317-324, 1991). This data set uses (1 class attribute, 56 predictive
attributes.
A sample box plot of lung cancer data is shown in Figure 2.2.
Figure 2.2 Box plot of the attributes of Lung cancer data
The data outside the whisker is called outlier. Sometimes a diamond box inside the
box shows confidence interval also. Thus the advantages of the box plots are data summary
and its ability to show skewness in data. It also shows the outliers. Box and whisker plots
also can be used to compare the data sets. The negative side of this plot is its overemphasis
on the tails of the distribution.
NOTES
47
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Histogram
The histogram is a plot of classes and its frequency. The group of data is called
classes and its frequency count is the number of times the data appear in the dataset (refer
Figure 2.3). Histogram conveys useful information about the general shape of the distribution
and symmetry of the distribution. It can also convey information about the nature of data
(like unimodal, bimodal or multimodal).
Perhaps the biggest advantage of histogram is its ability to detect the skewness present
in the data. It can also provide clue to the process and measurement problems associated
with the data set.
The shape of the histogram is also dependent on the number of bins. If the bin is too
wide, then the important details may get omitted. On the other hand if the bins are too
narrow then the spurious data may appear to be genuine information. Hence the user must
try different number of bins to get appropriate bin width. In general, they range from 5 to
20 groups of data.
Histogram and scatter plot play an important role in data mining process by showing
the shape of the distribution (as indicated by the histogram) and other statistical properties
(as indicated by the box plot).
Figure 2.3 Sample histogram of variable V1 Vs the
count for Lung cancer data set.
DMC 1628
NOTES
48 ANNA UNIVERSITY CHENNAI
Scatter plot
Scatter plot is a plot of explanatory variable and response variable. It is a 2D graph
showing the relationship between two variables. The scatter plot (Refer Figure 2.4) indicates
- Strength
- Shape
- Direction
- Outliers presence
It is useful in exploratory data before actually calculating a correlation coefficient or
fitting regression curve.
Figure 2.4: Scatter plot of variable V2 Vs V1 of Lung cancer data
Figure 2.5 Sample Matrix Plot
Normal Quartile plots
Normal quartile plot is a 2D scatter plot of percentile of the data versus the percentile
of the population. It shows whether the data is from a normalized distribution or not. This
is considered important because many data mining algorithms assume that data follows
normal distributions. Hence the plot can be used to verify whether the data follows normal
distribution.
NOTES
49
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Quartile-Quartile plot
Q-Q plot is a 2D scatter plot of the quartiles of the first dataset and quartiles of
second dataset. The data need not have to be same. This is the greater advantage of the
Q-Q plot.
These plots check if the datasets can be fit with the same distribution. If two datasets
are really from the same distribution, then the point fall along the reference line (45 Degree).
If the deviation is more, then there is greater evidence that the datasets follow different
distribution. (Refer Figure 2.6 and Figure 2.7 which shows a Q-Q plot where deviations
are more)
Figure 2.6 Normal Q-Q Plot.
DMC 1628
NOTES
50 ANNA UNIVERSITY CHENNAI
Figure 2.7 Normal Q-Q Plot with more Deviations.
In short, this is a useful tool to compare and check the distributions of the dataset if
- They follow same distribution
- They have common location and scale
- They have similar distributional shape
- They have similar tail behavior.
2.6 DATA CLEANING
Data cleaning routines attempt to fill up the missing values, smoothen the noise while
identifying the outliers and correct the inconsistencies of the data. The procedures can do
the following steps to solve the problem of missing data.
1. Ignore the tuple
A tuple with missing data especially the class label is ignored. This method is not
effective when the percentage of the missing values increase.
NOTES
51
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
2. Filling in the values manually
Here the domain expert can analyze the data tables and carry out the analysis and fill
the values manually. But this is time consuming procedure and may not be feasible for the
larger sets.
1. A global constant can be used to fill in the missing attributes. The missing values
may be Unknown or Infinity. But some data mining results may give spurious
results by analyzing these labels.
2. The attribute value may be filled by the attribute value. Say the average income
can replace a missing value.
3. Use the attribute mean for all samples belonging to the same class. Here the average
value replaces the missing values of all tuples that falls in this group.
4. Use the most possible value to fill in the missing value. The most probable value
can be obtained from other methods like classification, decision tree prediction.
5. Some of the methods introduce bias in the data. The filled value may not be correct
value. It is just an estimated value. Hence the difference between the estimated
and the original value is called error or bias.
Noisy data
Noise is a random error or variance in a measured value. The noise may be removed
by the following methods.
- Binning method
- Regression
- Clustering
Binning is a method where the values are sorted and distributed into the bins. The
bins are also called as buckets. The binning method then uses the neighbor values to
smooth the noisy data.
Some of the techniques that are commonly used are
- Smoothing by means : Means replaces the values of the bins
- Smoothing by bin medians: The bin value is replaced by bin median.
- Smoothing by bin boundaries
Each bin value is replaced by the closest bin boundary. The maximum and minimum
values are called bin boundaries. Binning methods may be used as a discretization technique.
DMC 1628
NOTES
52 ANNA UNIVERSITY CHENNAI
Example
By equal-frequency bin method, the data should be distributed across bins. Let us
assume the bins of size 3 then
Consider the following set.
S = {12, 14, 19, 22, 24, 26, 28, 31, 34}
Bin 1 : 12 , 14, 19
Bin 2 : 22, 24, 26
Bin 3 : 28, 31, 32
By smoothing bins method, the bins are replaced by the bin means. This method
results in
Bin 1 : 15, 15, 15
Bin 2 : 24, 24, 24
Bin 3 : 31, 31, 31
Using smoothing by bin boundaries method, the bins value would be like
Bin 1 : 12, 12, 19
Bin 2 : 22, 22, 26
Bin 3 : 28, 32, 32
As per the method, the minimum and maximum values of the bin are determined.
Then the values are transformed to the nearest value.
Regression
Data can be smoothened by fitting the data to a function. Linear regression finds the
best line to fit this attributes. Multiple regressions use more than two variables to fit the
data in the multidimensional surface.
Clusters
Cluster analysis is used to detect the outliers. In clustering process all similar values
form a cluster. Technically all data that fall apart from normal data behavior is an outlier.
NOTES
53
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Data cleaning as a Process
The data cleaning process involves two steps. The first step is the phase of discrepancy
detection and in the second phase the errors are removed.
The discrepancies in the data are due to various reasons. They are due to reasons
like poorly designed forms, human errors and typographical errors. Some times the data
may not be present because either the data may not be available or the user may not desire
to enter the data for personal reasons. The errors are also due to deliberate errors, data
decay, data representation and inconsistent use, errors in the measuring instruments and
system errors. These can be detected with knowledge about the specific attributes called
meta data. Some of the errors are found out by the function overloading by superimposing
the structure on the data. For eg. The format of the date may be mm/dd/yyyy. All these
data are then analyzed by the business rules.
The business rules may be designed by the business organizations for data cleaning.
This may range from
- Unique rule : Which states that all the business rules are unique
- Consequence rule: This rule states that there cannot be any missing values.
- Null rule: These rules take care of how blanks, special characters should be handled.
The second phase involves removing the errors. Tools can be designed to remove the
errors. Some of the errors can be removed by data scrubber tools. The data scrubbing
tools use domain knowledge to remove the errors. Data auditing tools are used to find
discrepancies in rules and to find data that violate the rules.
2.7 DATA INTEGRATION
Data mining requires merging of data from multiple sources. One of the major problems
is known as entity identification problem. Much of the data may be present under various
headers in multiple data sources. For example the data present in a table under the attribute
patient name may be same as the data that is present in some other table under the title as
customer name. The merging of these sources creates redundancy and duplication. This
problem is called entity identification problem.
Identifying and removing the redundancy is a challenging problem. Such redundancies
can be easily found out by correlation analysis. Correlation analysis takes two attributes
and finds correlation coefficient. Correlation coefficient finds the correlation among the
attributes. This is applicable for numerical attributes (known as Pearson coefficient) and
categorical data (Chi-square test).
DMC 1628
NOTES
54 ANNA UNIVERSITY CHENNAI
Pearson Correlation coefficient
The Pearson correlation coefficient is the most common test for determining if there is
an association between two phenomena. It measures the strength and direction of a linear
relationship between the X and Y variables. If the given attributes are X = (x1, x2,xn)
and Y= {y1,y2,,yn} then we can say that
And the sample correlation coefficient is denoted by r. The formula for the sample
correlation coefficient is
Where
1. If r > 0, then X and Y are positively correlated. The higher is the values, and then
higher is the correlation. By positive correlation, we mean that if X increases then
Y also increases.
2. If r = 0, then it implies that attributes X and Y are independent and there exists no
correlation between X and Y.
3. If r < 0, then X and Y are negatively correlated. By negative correlation, we mean
that if X increases then Y decreases.
Strength of Correlation
Generally the strength of the correlation can be assumed as follows: The correlation
is strong if , moderate: 0.5<|r|<0.8 and weak if . The usefulness of this measure is that if
thre is a strong correlation correlation between two attributes, then one of the attribute can
be removed. The removal of the attributes results in data reduction.
Also the value of r does not depend upon the units of measurement. This means the
units like centimeter, meter are all irrelevant. Also the value of r does not depend upon
which variable is labeled X and which variable is labeled Y.
Correlation and Causation
Correlation does not mean that one factor causes other. For example the correlation
between economical background and marks scored does not imply that economic
background caused high scoring of marks. In short, correlation is different from causation.
y x
xy
yy xx
xy
s s
s
s s
s
r = =

=
=
=
=
=
=
n
1 i
i i xy
n
1 i
2
i yy
n
1 i
2
i xx
). y y )( x x ( s
) y y s
) x x ( s
NOTES
55
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Chi-Square test
Chi-Square test is a non-parametric test. This is used to compare the observed
frequency of some observation (Such as frequency of buying different brands of TV) with
the expected frequency of the observation (of buying of each TV by customers). This
comparison is used to calculate the value of the chi-square statistic. This is then compared
with the distribution of chi-square to make an inference about the problem.
X
2
=
Here E is the expected frequency, O is the observed frequency and the degree of
freedom is C 1 where C is number of categories.
The uses of the Chi-square tests are
- To check whether the distribution of measures are same
- To check goodness-of-fit.
Consider the following problem. Let us assume that in a college, both boys and girls
register for a data mining course. The survey details of 140 students are tabulated in Table
2.3. The condition to be tested out is whether any relationship exists between the gender
attribute and registrations of the data mining course.
Table 2.3 Contingency Table
The expected value can be calculated as per the formula
E11 = Count (Male) X Count (Data Mining) / N
= 120 X 80 / 210 = 45.1
E12 = Count (Female) X Count (Data Mining) / N
= 90 X 60 / 210 = 25.71
E11 = Count (Male) X Count (Not Data Mining) / N
= 120 X 40 / 210 = 22.86


E
E O
2
) (
DMC 1628
NOTES
56 ANNA UNIVERSITY CHENNAI
E11 = Count (Female) X Count (Not Data Mining) / N
= 90 X 30 / 210 = 12.86
Then X
2
= (80-45.71)
2
/45.71 + (60-25.71)
2
/25.71 +
(40-22.86)
2
/22.86 + (30-12.86)
2
/12.86
= 104.07 Degrees of freedom = (Row-1) (Col-1) = 1
Statistical tables can be referred for column for 0.05 for degree of freedom = 1, then
the critical value is 10.828.
Since the value 104.07 is greater than 10.828, there is no relationship between gender
and registering for the data mining course.
Now the reasoning process can be started like this
- State the Hypothesis
H0: O = E
H1: O is not equal to E
- Set the level
Alpha = 0.05
- Calculate the value for the appropriate statistic
X
2
= 10.828
Degree of freedom = 1
- Write the decision rule for rejecting the null hypothesis
Reject H0 if X
2
>= 10.828
Reject Ho for p < 0.05
Since the threshold value is greater than 10.828, we can reject the null hypothesis and
accept the alternate hypothesis.
The chi-square tests allow us to detect the duplication of data and help is to remove
the redundancy of values.
2.8 DATA TRANSFORMATION
Many data mining algorithms assume that variables are normally distributed. But the
real world data are not usually normally distributed. There may be valid reasons for the
data to exhibit non-normality. It is also due to things such as mistakes in data entry, and
missing data values. This significant violation of the assumption of normality can seriously
affect data mining algorithms. Hence these non-normal data must be conditioned so the
data mining algorithms functions properly.
NOTES
57
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
There are methods to deal with non-normal data. One such method is data
transformation which is the application of a mathematical function to modify the data.
These modifications to the values of a variable include operations like adding constants,
multiplying by constants, squaring or rising to a power. Some data transformations convert
the data to logarithmic scales, inverts, reflect and some times apply trigonometric
transformations to make it suitable for data mining.
Some of the data transformations are given below
Smoothing: These algorithms remove noise. This is achieved by methods like binning,
regression and clustering.
Aggregation: This involves applying the aggregation operators to summarize the data.
Generalization: This technique involves replacing the low level primitive by a higher
level concept.
Normalization: The attribute values are scaled to fit in a range (say 0-1) to improve
the performance of the data mining algorithm.
Attribute construction: In this method, new attributes are constructed from the given
attributes to help the mining process. In attribute construction, new attributes are constructed
from given attributes which may give better performance. Say the customer height and
weight may give a new attribute called height-weight ratio which may provide further details
and stability as its variance is less compared to the original attributes.
Some of the procedures used are
1. Min-Max normalization
2. Z-Score normalization.
Min-Max Procedure
Min-Max procedure is a normalization technique where each variable X is normalized
by its difference with the minimum value divided by the range.
Consider the set S = {5,5,6,6,6,6,88,90}
X* = X min(X) / Range (X)
= X Min (X)/ Max(X) Min(X)
For example, for the above set the age 88 is normalized as 0.976(88-5/85). Thus the
Min-Max normalization range is between 0 and 1.
DMC 1628
NOTES
58 ANNA UNIVERSITY CHENNAI
Z-Score normalization
This procedure works by taking the difference between the field value and mean
value and by scaling this difference by standard deviation of the attribute.
X* = X Mean / Standard Deviation (X)
For example, the patient age for the point 88, the Z-score is
X* = 88 mean(X) / Standard deviation (X)
The range of Z-score is between -4 to +4 with the mean values having the Z-score of
zero.
Z-scores are used to detect outlier detection. If the data value z-score function is
either less than -3 or greater than +3, then it is possibly an outlier. The major disadvantage
of z-score function is that it is very sensitive to outliers as it is dependent on mean.
Deviation Scaling
This is a procedure where the decimal point is moved but still preserves the most
digital value. Here the data point is converted to a range of -1 to +1.
X* = X(I)/ 10
k
For example, the patient age 88 is normalized to 0.88 by choosing k=2. Data
transformations especially neural networks carry out data transformations inherently. But
other data mining algorithms, the data transformations should be carried out separately.
2.9 DATA REDUCTION
Sampling is a statistical procedure of data reduction. The statistics starts by identifying
the group for carrying out a study. For example, a company may want to know whether the
products are popular. It is difficult to survey the entire population living in an area. Hence
the company wants to identify a group for carrying out the study. It is called population.
Hence the strategy is to select a group of samples and choosing a representative data from
it. (Refer Figure 2.8).
NOTES
59
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Figure 2.8: Sampling Process
Samples are representatives of the population. If the samples are not representative,
then it will lead to bias and can misrepresent the facts by wrong inference.
How samples are selected?
It is better to avoid the volunteer sample. Because volunteer sample often lead to
bias. Convenience sample is a sample that researcher choose often because the samples
are available at the right time. Say, the company wants to select the people who live near
to the company. In this case the sample may be still representative.
What should be the ideal size of the sample?
The greater emphasis is on the representative nature of the samples. If the samples
are too less, the data would be biased. But if the samples are too large, then it defeats the
very concept of sampling. Hence it should be of optimal size and the size often affects
quality of the results.
Data reduction techniques reduce the dataset while maintaining the integrity of the
original dataset. The results are same as the original.
Data cube reduction
Data cubes store multidimensional, aggregated information. The base cuboids
correspond to the individual entity of interest. A cube at the highest direction is called apex
cuboid. Data cubes are created at different levels of abstraction are often referred as
cuboid. Every level reduces the dataset size. This helps in reducing the data size for mining
problems.
Attribute subset selection reduces the dataset size by removing irrelevant features
and construct a minimum set of attributes for data miners. For n attributes, there are possible
subsets. Choosing optimal attributes becomes a graph search problem. Typically feature
subset selection problem uses greedy approach by looking for the best choice at the time
using locally optimal choice while hoping that it would lead to global optimal solutions.
DMC 1628
NOTES
60 ANNA UNIVERSITY CHENNAI
Statistical tests are conducted at various levels to determine statistical significance for
attribute selection. Some basic techniques are given below
Stepwise forward selection
This procedure starts with an empty set of attributes. Every time an attribute is tested
for statistical significance for best quality and is added to the reduced set. This process is
continued till a good reduced set of attributes are obtained.
Stepwise backward elimination
This procedure starts with complete set of attributes. At every stage, the procedure
removes a worst attribute from the set leading to the reduced set.
Combined approach: Both forward and reverse methods can be combined so that
the procedure can add the best attribute and remove the worst attribute.
Decision Tree induction
The feature reduction algorithm may use the decision tree algorithms for determining
a good set. The decision tree is constructed such that decision tree uses information gain to
partition the data information into individual classes. The attributes that dont appear in the
tree are assumed irrelevant and removed from the reduced set.
Data Reduction
The data can be reduced using some standard techniques like principal component
analysis, factor analysis and multidimensional scaling.
The goal of PCA is to reduce the set of attributes to a newer smaller set that captures
the variability of the data. The variability is captured by a fewer components which would
give the same result compared to the original result with all the attributes.
The advantages of PCA are
1. It identifies the strongest patterns in the data.
2. It reduces the complexity by reducing the attributes
3. PCA can also remove the noise present in the data set.
The advantages of PCA are to satisfy the following properties
1. Each pair of distinct attributes have zero variance
2. PCA orders the attributes based on the variance of the data
3. The first attribute captures the most variance of the data, the second component
much less and so on based on the orthogonal requirement.
NOTES
61
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
The PCA algorithm is stated below
1. Get the target data set.
2. Subtract the mean. The mean of the data is subtracted from the data dimensions.
That is is subtracted from X and is subtracted from Y values. This produces a
data set with zero mean.
3. Covariance is calculated.
4. Eigen values and Eigen vectors of the covariance matrix are calculated.
5. Eigen vector of the highest Eigen value is the principal component of the data set.
The eigen values are ordered in the descending order. Then the feature vector is
formed with these eigen vectors in the columns.
Feature Vector = {Eigen vector
1
, Eigen Vector
2
, , Eigen Vector
n}
6. The component that represents the variability of the data is preserved .
The final data is = {Row Feature Vector} X Row Data Adjust
Row feature vector is the eigen vectors in the columns transposed and Row data
adjust is mean adjusted data transposed matrix.
At any time, if the original data is desired, It can be obtained using the formula
Original Data = { Row Feature vector T X Final Data} + Mean
The new data is a dimensionally reduced matrix that represents the original data.
Therefore PCA is effective in removing the attributes that does not contribute much. Also
if the original data is desired , it can be obtained thereby no information is lost.
Scree plot is a visualization technique to visualize the principal components visually.
The Scree plot of the Lung cancer data is given in Figure 2.9.
Figure 2.9: Scree Plot of Lung Cancer Datat
DMC 1628
NOTES
62 ANNA UNIVERSITY CHENNAI
It can be seen that the majority of the data is represented by first three attributes
more. For example, the contribution of the sixth attribute is less compared to the first three
attributes (Refer 2.10).
Figure 2.10: Few Components of the Attribute.
Scree plot illustrates the distribution of the data for the first components.
Factor analysis
Unlike PCA, Factor analysis expresses the original attribute as linear combinations of
a smaller number of latent or hidden attributes. Factor analysis assumes that the observed
correlation is due to some underlying patterns. The contributions of those factors are split
into two common components. One component is the factor that is underlying all the factors
and another component is special noise component.
The sum of these components is called communality. Unique variability is excluded
from the analysis and common variability that is splitted are studied in factor analysis.
Let f1, f2,,fn be the latent factors. Let the original matrix be M X N and the latent
factors be M X P matrix.
The standard factor analysis assumes the relation
D
i*
T
= A i*
T
+
where D is a original matrix and i* is ith row of the matrix. Fi* is the corresponding
row of the new data matrix F. A is an N X P matrix that indicates the factor loadings which
expresses the relationship between original values and latent factors. is the error that
accounts for error that is not accounted by the common factors.
e
NOTES
63
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Multidimensional Scaling
The goal of Multidimensional Scaling (MDS) is to reduce the dimensions like PCA
and FA. It is done by a projection technique where data is mapped to a lower dimensional
space that preserves the distance between the objects.
The MDS approach accepts the dissimilarity matrix as input and projects it to a p-
dimensional space. The input matrix entry dij is the distance between ith object and jth
object. The MDS projects the data to a new data such that the stress is minimized.
Stress = /
The Euclidean distance based MDS is similar to PCA.
Numerosity reduction
Parametric models
These models are used to estimate the data parameters. Then the data parameters
are stored instead of actual data. Log linear model is a good example of these types.
Non parametric models store reduced representations of the data that includes
histograms, clustering and sampling.
Regression and log-linear models
This model can approximate data. In linear regression, the data can be modeled to fit
the line as
Y = AX + B where Y is a response variable and X is a predictor variable.
The parameters a and b are slope and y-intercept respectively. These are called
regression coefficients. The values of the coefficients can be obtained by using the method
of least squares by minimizing the error between the actual and predicted data. Multiple
regressions involve two or more predictor variables. Regression methods are inexpensive,
inexpensive and can handle skewed data very well.
Log-linear models construct a higher dimensional space from lower dimensions. A
tuple can be considered as a point in N-dimensional space. Log-linear models estimate the
probability of each point for a set of points based on a smaller subset of dimensions.
Hence these are used for the dimension reduction.
Histograms
Histograms use bins to represent data dimensions. Histogram partitions the attribute
A into disjoint buckets. If the bucket has single value, it is called singleton buckets. But
often buckets hold continuous ranges for a attribute.
,
( (( ' ) )2
i j
dij dij

2 dij

DMC 1628
NOTES
64 ANNA UNIVERSITY CHENNAI
The buckets and the attribute values partitioned based on the following criteria.
Equal width: The width of the bucket is uniform
Equal frequency: Frequency of each bucket is considered
V-optimal: Based on the least variance, the histogram is constructed. The histogram
variance is a weighted sum of the bucket values. The bucket weight is equal to the number
of values in the bucket.
MaxDiff: The difference between each pair of adjacent values.
MDS histograms can capture dependencies between attributes.
Clustering
They partition the objects into groups or clusters. All objects within the cluster are
similar and objects in one cluster are dissimilar to the instances of the other clusters. The
cluster quality is determined by its diameter, centroid and distance. In data reduction, the
clusters tend to replace the actual data. The nature of data determines the effectiveness of
this method.
Sampling
Sampling can be used as a technique for reducing the data. It takes random samples
from the total data. The sample chosen should be representative in nature.
A simple random sample without replacement
This is done by drawing S of the N tuples from D. (S < N), where the tuple can be
chosen equally likely.
A Simple Random sample with replacement
This is same as the previous technique but the replaced tuple is recorded. This replaced
tuple may be drawn from the sample again.
Cluster sample
Here the samples are grouped into disjoint clusters. Then the samples are drawn from
the clusters.
Stratified sample
Here the population D is divided into disjoint strata. Say the customer database is
divided into different strata based on age. Then stratified sample are drawn from each
strata. This helps to ensure a representative sample is chosen. This technique is ideally
suited for skewed data.
NOTES
65
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
The advantages of the sampling are
1. Reduced set of processing and
2. Reduced complexity.
DATA DISCRETIZATION AND CONCEPT HIERARCHY
GENERATION TECHNIQUES
The process of converting the continuous attribute to discrete attribute is called
discretization. This leads to a concise and easy to use knowledge representation.
Discretization can be categorized based on
1. Direction as Top down Vs Bottom up
2. Class Information as Supervised Vs Unsupervised
Unsupervised methods starts by finding one or few points called split points or cut
points. These points are used to split the continuous variable range into discrete range.
This procedure is recursively applied on the resulting intervals. Bottom up methods on the
other hand considers all of the continuous values as potential split points, removes some
neighboring points by merging them to form intervals and this procedure is recursively
applied till a stable intervals are obtained.
Concept hierarchy is then applied for the discretized value. Concept hierarchy gives
a concept for discrete values. This approach leads to loss of information, but the generalized
data is more meaningful and easy for interpretation.
Manual Methods:
Here the prior knowledge of the feature is used to determine
1. Cut off points
2. To select the representatives of the intervals.
For example, the grades can be used to split the continuous marks as (Refer Table 2.9).
Table 2.4 : Sample Discretization Measure
DMC 1628
NOTES
66 ANNA UNIVERSITY CHENNAI
Without the prior knowledge, it is difficult to discretize. This facilitates the reduction in
computational complexity and many data mining algorithm works only with discretized
values. However for a large databases and applications for which there is no prior
knowledge, these manual methods are not feasible.
Binning
Binning does not use class information. So it is a unsupervised technique. Here the
concept of equal-width or equal frequency binning is used. The bins are replaced by the
bin means or bin median. These techniques are used to partition the bin recursively leading
to a concept hierarchy (Refer Table 2.11).
Figure 2.11: Binning Process
Entropy based discretization
This method is supervised, top-down splitting technique. It uses class information and
determines the split point. The criterion used for selecting the split-point is based on minimum
entropy
Each value of A can be considered as potential split point. A split point is decided and
partitions the tuples satisfying the condition. The tuples satisfying the condition A <= split
point and A > split point. This process is called binary discretization.
The method of splitting based on entropy is given as below
Let us assume that there are two classes C1 and C2. The tuples from C1 form one
partition and tuples from C2 form another partition. But this is unlikely as the tuples may
have mixed tuples. So the splitting point for perfect calculation is based on expected
information requirement.
For the above case it is given by
InfoA (D) = Mod(D1)/Mod(D) * Entropy (D1) + Mod(D2)/Mod(D) * Entropy (D2)
NOTES
67
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
D1 and D2 are the tuples that satisfy the condition A <= Split point and A > Split
point. The entropy can be calculated as
Entropy (D) = -
Where pi is the probability of classes Ci determined by number of tuples divided by
the total number of tuples.
The split point would be min (InfoA (D))
This process is recursively applied to each partition till the minimum information
requirement on all candidates is lesser than or small threshold value or when the number
of intervals is greater than a threshold value
Interval merging with Chi-square analysis
This algorithm analyzes the quality of multiple interval for a given feature using the chi-
square method.. The algorithm determines the similarity of the data of the adjacent intervals.
If the feature intervals are independent i.e., if the difference between the intervals are
statistically significant then there is no merger of the intervals.
The basic algorithm of the chi-merge is given as below
- Sort the data of the given feature in an ascending order
- Define the intervals so that every value of the feature is in different interval
- Repeat until no chi-square test results of the adjacent intervals is less than the
threshold value
Cluster analysis
Cluster analysis can be applied to group the values of A into clusters. Then the clustering
algorithm can generate a concept hierarchy by following either top-down approach or
bottom-up approach.
Intuitive partitioning
This algorithm also based on a rule called 3-4-5 rule. It is used to partition the numeric
ranges into intervals that is suitable for concept hierarchy generation.
- In the interval covers 3,6,7, or 9 distinct values in the most significant digit, then the
range is partitioned into three intervals.
- If the most significant digit covers 2,4, or 8 distinct values, then the range is partitioned
into four equal-width intervals.
- If the most significant digit covers 1,5, or 10 distinct values, then the range is
partitioned into five equal width intervals.
i
log pi p
DMC 1628
NOTES
68 ANNA UNIVERSITY CHENNAI
This rule is recursively applied to each interval creating the concept hierarchy. The
algorithm detects the 5
th
percentile as low and 95
th
percentile as high. The values of low
and high are rounded up or down based on the most significant digit. Then the interval
ranges are checked for distinct values and the rule of 3-4-5 is applied recursively.
Concept hierarchy for categorical data
Categorical data takes value from a finite number of distinct values. For example
location and job category are examples of categorical data. The concept hierarchy can be
generated by the following technique.
- Specification of values explicitly at the scheme level by the domain experts. Say
the revenue system can be specified as
Taluk-> District -> State -> region -> Country.
- Specification of a portion of a hierarchy by explicit data groups.
For a larger dataset, it is often difficult to define concept hierarchy.
So for a small portion of the intermediate level data, concept level
hierarchy can be specified.
- Specification of a set of attributes, but not their partial ordering
Based on the frequent level concept hierarchy is decided. For example country level
is more distinct than the street level. That is the instances of country are few compared to
the street.
In the absence of the specification of the user, the database schema may drag an
entire hierarchy.
2.10 MINING ASSOCIATION RULES IN LARGE DATABASES
The task of association rule mining is to find interesting association rules or relationships
that exist among a set of items present in the transactional data base or data bases. The
transactional database has a set of attributes having transactions ID and a set of items.
This task of discovering association rules was first introduced in 1993 and is usually
called as Market basket analysis. For example, shop owner always speculate about
customer buying behaviors. The business organizations are interested in knowing the facts
like groups of customers, buying behavior etc. For example a customer who buys milk
likely to buy bread also. Similar associations between diapers and beer are also noted.
This helps supermarkets to organize the shop layout so that these items can be kept together
to increases sales.
This analysis is quite valuable for many cross-marketing applications. Also this helps
the shops to organize the catalogues, design stores layout and customer segmentation. It is
quite useful in other areas say in medical diagnosis where crucial associations between
attributes like effect of drugs on patients cure can be discovered.
NOTES
69
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
If we mine the transactional data base , we get a rule of the form A -> B. This
indicates that there is a consequent of B for every occurrence of A. Patterns that exists
frequently is called frequent patterns. The set of items that is present together is called
frequent itemset. The subsequence, say a user buy item1 after item2 frequently is called
sequence pattern. The underlying structure that is often present in the transactional database
in the form a subgraphs, subtrees is called structure pattern. ARM (Association rule mining)
thus discovers the hidden or uncover the associations, correlations, and other important
associations that is present in the data.
The association relationships are expressed in the form of IF THEN rules. Each
rule is associated with two measurements support and confidence. Confidence is a
measure of strength of the rule and support indicates the statistical significance.
Major problems of association mining are
- Mining a very large database is computationally intensive process
- The associational rules may be spurious
Binary representations
Market basket data is stored in a binary form where each row corresponds to a
transaction and each column corresponds to an item. The item is treated as a binary value.
Its values are one if the item is present and zero if the item is absent. This is a simplistic view
of the market basket data.
Formally the problem can be stated as below.
Let I = { I1, I2,In} be a set of literals called items. Let D be a database which is a
collection of transactions T = {t1,t2,,tn} where each transaction is a set of items such
that T I. Associated with each item is transaction ID.
As association rule is of the form x Y, where X and Y are I and X Y = . X is called
antecedent and Y is called consequent.
Every association rule is associated with a user specified parameters, support and
confidence. Support indicates the frequency of the pattern and strength of the rule is indicated
by confidence.
Support for an association rule X Y is the percentage of the transactions that
contains X Y in the data base.
The confidence of an association rule is the ratio of the number of transactions that
contains X Y to the number of transactions that contains X. This is also known as strength.
The common approach of the association rules problem consists of two phases
1. Find large Itemsets
DMC 1628
NOTES
70 ANNA UNIVERSITY CHENNAI
2. Generate the rule from frequent itemset selected in phase one.
But the main problem here is that the computational explosion of the values. For
examples, an itemset of five nodes can generate 2
5
R = 3
d
- 2
d
+ 1= 32 rules. In general, if
there are d unique items, we can form R number of rules where
The complexity is reduced by reducing the candidate Itemsets and by reducing the
number of comparisons. If X Y, then every item of X is contained in Y but there is atleast
one item of Y that is not present in X.
An itemset X is called closed in a dataset D if there is no proper super itemset Y such
that Y has the same support count as X in D. An itemset X is called closed frequent itemset
on set D if X is both closed as well as frequent in D. An itemset X is a maximal frequent
itemset, if there is no super itemset Y and Y is frequent in D. A set is called closed if its
support is different from support of its superset.
The frequent itemsets algorithms can be classified based on many criteria. They are as
shown below
2.10.1 Apriori Algorithm
Phase One: Enumerate all itemsets in the transaction database where support is greater
than a minimum support. These itemsets are called frequent itemsets and minimum support
is user defined percentage.
Phase two: The algorithm generates all rules from the frequent itemsets whose minimum
confidence threshold which the user has specified.
The Apriori principle is if the item set is frequent, then all of its subsets are also frequent.
The strategy of pruning the exponential search space is based on support is called
support-based pruning. Also the support never exceeds the support of the subsets. This is
called anti-monotone property of the support measure.
K = 1 C1 = { {I} / I I }
While Ck do begin
Count the support of Ck in D
Frequency k = { S Ck / Support (S) > minimum support};
Ck+1 = Candidate generation (Fk)
K:=K+1
End
NOTES
71
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
F = F1 F2 F3 F k-1
Ck = { Possible frequent itemsets of size K}
Where D is database and Fk is the itemset whose minimum support is reached.
Rules generation Phase
Once the frequent itemsets are generated, then it is used to generate rules. The objective
is to create for every frequent itemset Z and its subsets X, a rule X -> Z X and include it
in the result if Support (Z)/ Support (X) > minimum confidence.
For all Z F do
R = (Resulting rules}
C1 = {{i} / I Y} // Candidates of 1-consequants
K = 1
While Ck do begin
Fk = {X Ck / Confidence of (X Z-Z is > Minimum confidence}
R = R {X Z X / X Fk}
Ck+1 = Candidate generation (Fk);
K = k+1
End
End
Consider a transaction table (Refer Table 2.5)
Table 2.5: Transaction Item Table
DMC 1628
NOTES
72 ANNA UNIVERSITY CHENNAI
C1 = {A}, {B}, {C},{D},{E}
Support value = 3/5, 4/5, 4/5,1/5,4/5
Minimum support is 2/5 then the resultant is
F1 = { {A}, {b},{C},{E}} The possible set of two items are listed in Table 2.6.
Table 2.6: Two Items set.
Again based on the minimum confidence, the emerging itemsets are tabulated in Table 2.7.
Table 2.7: Possible Three Item set
{A,B,E} is rejected because its confidence is only 1/5. {A,B,C,E} is frequent only if
all its subsets are frequent. But {A,B,E} is not frequent. So the iteration stops.
Computational complexity of the algorithm can be affected by
Support threshold
Number of Items
Number of transactions
Average transaction width
2.10.2 Modifications of Apriori methods
FP-Tree algorithm is a modification of Apriori algorithm. The main idea of the FP-
Tree algorithm is to store all the transactions in a trie data structure. In this way every
transaction is stored and each item is a linked list linking all transactions where the item is
present. Hence the prefix that is shared by many transactions is stored only once. This
NOTES
73
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
advantage is called single prefix path. That is even if the prefix is removed; all subsets of
that prefix can still be found and added.
The algorithm for constructing the FP-Tree is given below
1. Create a root labeled Null
2. Scan the database
Process the items in each transaction in order
From the root add nodes from the transactions
Link the nodes representing the items along different branches (Refer Figure 2.12)
Figure 2.12: FP-Tree
The algorithm can be written as follows
Step 1 : Form a conditional pattern tree
Step 2 : Form a conditional FP-Tree
Step 3 : Recursively mine the FP-Tree
The above procedure is carried out like this. Start from the last item in the table. Find
all the paths containing the item by traversing all the links. Patterns in the paths with required
frequency is called conditional patterns. Then based on this the conditional tree is
constructed. Append all the items to all the paths in the tree generating the frequent pattern.
Thus the tree FP-Tree can be mined recursively and then the items are removed from the
table and tree.
Compact representation of frequent Itemsets
1. Compressed representation of input data
2. Reads a transaction and maps transactions onto a path in the FP-tree
3. If more than one paths overlap, then more compressed representation it will be.
concept hierarchy are used.
DMC 1628
NOTES
74 ANNA UNIVERSITY CHENNAI
If the FP-Tree can be fitted in the main memory, then the frequent itemsets can be
derived from the structure itself rather than by scanning the database.
The association analysis can be extended to
1. Correlation Analysis
2. Causality Analysis
3. Max patterns and frequent closed itemset
4. Constraint based mining
5. Sequential patterns
6. Periodic patterns
7. Structural patterns.
The frequent itemsets algorithms can be classified based on many criteria. They are as
shown below
For example (Age = X1
.
education = X2) -> Buys(X,Y)
NOTES
75
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
2.10.3 Evaluation Measures
The measures to evaluate the association rules can be classified into two types
1. Objective measure
2. Subjective measure
Subjective measures are difficult to incorporate into algorithm. Normally people use
visualization, template based approaches and subjective Interestingness measures like
concept hierarchy are used.
Objective measures use mathematical formulae to find the interesting measures.
Support and confidence are such measures. But the limitation of this approach is low
support may eliminate many interesting patterns and relationships. Also low confidence
ignores the support of the itemset in the rule consequent.
Contingency table can be made for finding interestingness. A sample contingency
table is shown here (Refer Table 2.8)
Table 2.8: Contingency Table
DMC 1628
NOTES
76 ANNA UNIVERSITY CHENNAI
Then from the contingency table the correlation can be obtained
This is called correlation relationship. It is -1 for negative correlation, +1 for positive
correlation and 0 if they are independent.
For binary variables
Interestingness (A,B) = S(A,B)/
= A .B /|A|*|B| = Cosine (A,B)
Also the geometric mean
Interestingness (A,B) =
=
Interest Factor
Lift = C(A B) / S(B)
= Rule confidence/ Support of the itemset in the rule consequent
For binary variables, this is called interest factor
I (A,B) = S(A,B)/ S(A) * S(B) = N * D11/ F1+ + F+1
I(A,B) = { 1 if A and B are independent
>1 if A and B are positively correlated
<1 if A and B are negatively correlated.
The advantage of the objective measures are that it has strong mathematical foundation
and hence suitable for data mining. The disadvantage is that the value can be very large
even for uncorrelated and negatively correlated patterns.
Summary
- Data Quality is crucial for the quality of results for data mining algorithms
- The types of dirty data are Incomplete data, Inaccurate data, duplicate data and
outliers
- Data preprocessing involves four crucial stages Data cleaning, Data Integration,
Data transformation and Data reduction.
- The measures of the central tendency include mean, median, mode and midrange.
- The most commonly used central tendencies are mean, median and mode.

0) )F 1)F0 )(F (F1
F F F F
As
10 01 00 11
+ + + +
=
( ( , ) / ( ))*( ( , ) / ( ) S A B S A S A B S B
( ) * ( ) C A B C B A
NOTES
77
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
- The most common dispersion measures range, 5-number summary, Interquartile
range and standard deviation.
- Visualization is an important aspect of data mining which helps the user to recognize
and interpret the results quickly.
- Noise is a random error or variance in a measured value and can be removed by
binning method, clustering and regression models.
- Redundant attributes can be removed by observing the correlation coefficient.
- Chi-square technique is a non-parametric test used to test the null hypothesis.
- Some of the important data transformations are smoothing, aggregation,
generalization, normalization and attribute construction.
- Data reduction techniques reduce the dataset while maintaining the integrity of the
original dataset.
- The continuous attributes can be converted to discrete attributes using discretization
- Association rule mining finds interesting association rules or relationships that exist
among the set of items present in the transactional database or databases.
- Support of an association rule X -> Y is the percentage of transitions that contains
X U Y in the database. The confidence of an association rule is the ratio of the
number of transactions that contains X U Y to the number of transactions that
contains X.
- The main idea of FP-Tree algorithm is to use trie data structure to reduce the
repeated database scan.
- Association rules can be evaluated using objective measures like Interestingness,
Interest factor or by subjective measures like utility.
DID YOU KNOW?
1. What are the data types?
2. What is the difference between noise and outlier?
3. What is the difference between variance and covariance?
4. What is the relationship between mining and data base scan?
Short Questions
1. What are the types of dirty data?
2. What are the stages of data management?
3. What are the types of data preprocessing?
4. What is the need for data preprocessing?
5. What is the need for data summarization?
6. What are the measures of central tendencies?
7. Why do central tendency and dispersion measures are important for data miners?
8. What are the measures of skewness and kurtosis?
9. How Interquartile range is useful in eliminating outliers?
10. List out the visualization aids available for exploratory data analysis?
11. What is the use of correlation coefficient for data mining?
12. What is meant by curse of dimensionality?
13. What are the methods available for data reduction?
DMC 1628
NOTES
78 ANNA UNIVERSITY CHENNAI
14. What is meant by numerosity reduction?
15. What is meant by sampling? What are the types of sampling?
16. What is meant by discretization?
17. What is meant by concept hierarchy?
18. What is meant by support and confidence in association rule mining?
19. How frequent itemsets algorithm can be classified?
20. State Apriori principle?
21. What are the factors affecting the computational complexity of apriori algorithm?
22. How does one evaluate the association rules?
23. What is the difference between objective and subjective measures?
24. What is meant by Interestingness?
25. What is meant by lift factor?
Long Questions
1. Explain in detail the various stages of data management.
2. Explain in detail the different categories of data preprocessing.
3. For a given set
S = {5, 10, 15, 20, 25, 30} of marks. Find mean, median, mode, standard deviation
and variance.
4. Consider the above set and can we say the number 90 outlier or not. How does
one prove that?
5. Consider an attribute S = {15,20,25,60,70,75,90,95,100}. Apply binning
techniques and remove the noise present in the data.
6. Consider an attribute whose values are S1= {5,10,20,40} and S2= {1,4,6,7}. It
is decided to merge the attributes. Can it be justified using correlation coefficient.
7. Consider the following contingency Table 2.9 of test taking behavior
Table 2.9: Test Taking Behaviour.
Is there any relationship between Gender and test taking behavior?
8. Explain in detail the principal component analysis method for data reduction.
9. Explain in detail the discretization methods.
10. Explain in detail apriori algorithm.
11. Consider the following transaction database Table 2.10.
NOTES
79
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Table 2.10: Sample Transaction table
Assume the minimum confidence and support level is 50% and generate the rules for
apriori algorithm.
12. Explain in detail the FP-Tree algorithm.
DMC 1628
NOTES
80 ANNA UNIVERSITY CHENNAI
NOTES
81
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
UNIT III
PREDICTIVE MODELING
INTRODUCTION
Classification is a supervised learning method. It attempts to discover the relationship
between the input attributes and the target attribute. Clustering is a method of grouping the
data to form clusters. This chapter focuses on the issues of design, implementation and
analyzing classification and clustering process.
Learning Objectives
- To Explore the types of classifiers
- To Explore classifier types like decision trees, Bayesian classifiers and other classifier
types
- To study about the evaluation of classifiers
- To explore the clustering algorithms
- To analyze and perform evaluation of clustering algorithms
3.1 CLASSIFICATION AND PREDICTION MODEL
Classification is a supervised learning method. The input attributes of the classification
algorithms are called independent variables. The target attribute is called dependent variable.
The relationship between the input and target variable is represented in the form of a
structure which is called a classification model. The classification problem can be stated
formally as
Formal Definition
Given a database D which consists of tuples { } 1 2 n t , t , , t . . The tuple has input
attributes { } 1 2 n a , a , , a . and a nominal target attribute Y from an unknown fixed
distribution D over a labeled space. The classification problem is to define a mapping
function which maps the D to C where each ti is assigned to a class with minimum
generalization errors.
DMC 1628
NOTES
82 ANNA UNIVERSITY CHENNAI
The classes must be
- Predefined
- Non-overlapping
- Partition the entire database
For example, an email classifier may classify incoming mails into valid and invalid mail
(spam). It can classify an incoming email into spam if
- It comes with unknown address
- Possibly involve Marketing source that user dont want to receive.
- Possibly mails may have viruses/Trojans that user dont want to receive.
- Possibly involve contents that are not suitable for the user.
Email classification program may classify mails as valid or spam based on these
attributes.
Classification model is both descriptive and predictive. The model can explain the
given dataset and it can be used to classify the unknown attributes also as a predictive
model.
Classification models can be constructed using the regression models. Regression
models map the input space into a real value domain. For example the regression models
can be used not only predict the class labels but also other variables which can be a real
valued. Estimation and prediction are viewed as types of classification. For example guessing
the grade of the student is a classification problem. Prediction is a task of classifying an
instance into a set of possible classes. While prediction involves the continuous variable,
classification generally restricts to the discrete variable.
The classification model is implemented in two steps.
1. Training phase
2. Testing phase
In phase 1, a suitable model is created with a large set of training data. The training
data has all possible combinations of data. Based on the data, a data-driven classification
model is created. Once a model is developed, the model is applied to classify the tuples
from the target database.
3.2 ISSUES REGARDING CLASSIFICATION AND PREDICTION
The major issues that are associated with the implementations of this model are
NOTES
83
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
1. Over fitting of the model
The quality of the model depends on the amount of good quality data. However if the
model fits the data exactly, then it may not be applicable to a broader population of data.
For example, a classifier should not be developed with training data more than necessary.
Otherwise it leads to generalization error.
2. Missing data
The missing values may cause problems during both training and testing phases. Missing
data forces classifiers to produce inaccurate results. This is a perennial problem for the
classification models. Hence suitable strategies should be adopted to handle missing data.
3. Performance
The performance of the classifier is determined by evaluating the accuracy of the
classifier. The process of classification is a fuzzy issue. For example, classification of emails
requires extensive domain knowledge and requires domain experts. Hence performance
of the classifier is very crucial.
3.3 TYPES OF CLASSIFIERS
There are different methodologies for constructing classifiers. Some of the methods
are given below
- Statistical methods
These methods use statistical information for classification. Typical examples are
Bayesian classifiers and linear regression methods.
- Distance based methods use distance measures or similarity measures for
classification.
- Decision tree methods are another popular category for classification using decision
trees.
- Rule based methods use IF-THEN rules for classification.
- Soft computing techniques like neural networks, genetic algorithms and rough set
theory are also suitable for classification purposes.
In this chapter, the statistical methods and decision tree methods are explored in a
detailed manner and an overview of other methods are briefly explained.
3.3.1 Decision Tree Induction
A Decision Tree is a classifier that recursively partition the tuples. For example, for
determining whether a mail is valid or spam, a set of questions are repeatedly asked like
DMC 1628
NOTES
84 ANNA UNIVERSITY CHENNAI
Is it from any legal source?
Is it contains any illegal program content?
Each question results in a partition. For example, Figure 3.1 shows a split.
Figure 3.1 Show of Split
The set of possible questions and answers are organized in the form a DT which is
a hierarchical structure of nodes and edges. Figure 3.2 shows the structure of a decision
tree. It can be noted that the tree has three types of nodes
Figure 3.2 Sample Decision Tree
Root
The root node is a special node of the tree. It has no incoming edges. It can have zero
or more outgoing edges.
Internal nodes:
These nodes, like root node are non-terminal nodes. It contains the attribute test
conditions to separate records.
Terminal nodes
Terminal nodes are also known as leaf nodes. These represent classes and have one
incoming edge and no outgoing edges.
NOTES
85
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
The basic questions involved for inducing DT are
1. How the records should be splitted?
A suitable attribute test condition based on some objective measures of quality should
be selected to divide the records into smaller subsets.
2. What is the stop criterion?
A possible answer to the procedure is repeated till all the leaves belong to the class.
Sometimes the procedures can be terminated earlier to gain some advantage.
Some of the attribute conditions and its corresponding outcomes for different attribute
types are given below
Binary attributes
The test condition for a binary attribute generates two outcomes. Here checking of
gender is a attribute test. The outcomes of binary attributes are always resulting in two
children. Figure 3.3 shows an example of a binary attribute.
Figure 3.3 Binary Attribute
DMC 1628
NOTES
86 ANNA UNIVERSITY CHENNAI
Nominal attributes
Nominal attributes have many values. This can be shown as multi-way split tree where
the number of outcomes depends on the distinct values of the continuous attributes. This
can also be shown as binary splits in 2
k-1
-1

ways. Some examples of the multi-way split is
shown in Figure 3.4 and Figure 3.5.
Figure 3.4 A Multi-way split
Or as a binary split as
Figure 3.5 Multiway Split as Binary Split
Ordinal types
Ordinal attributes produce binary or multi-way splits. Ordinal values can be grouped
as long as the basic order is not violated.
For example the values of the attribute {low, medium, high}. The different ways of
grouping can be any one of this type as shown in Figure 3.6.
Figure 3.6 Ordinal Type
NOTES
87
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Some of the grouping like {low, high} is invalid.
Continuous attributes
The continuous attributes can be expressed as a comparison with binary outcomes or
multi-way splits. For example a student test marks can be expressed as a binary split and
shown in Figure 3.7a and 3.7b.
Figure 3.7a Split of Continuous Attribute.
Figure 3.7b Split of Continuous Attribute.
Splitting criteria
The measures developed for selecting the best split are based on the degree of impurity
of the child nodes. If the degree of impurity is smaller, then the distribution is more skewed.
If the attribute conditions split the records or instances evenly, then the attribute is considered
to exhibit highest impurity.
The degree of impurity is measured by many factors like Entropy, Gini and Information
gain.
1. Information Entropy is measured as
Entropy (t) = -
Here, c is the number of classes for a given tuple.
1
1
log
c
i i
i
P P

DMC 1628
NOTES
88 ANNA UNIVERSITY CHENNAI
Entropy values ranges from zero to one. If the value is zero, then all the instances
belong to only one class and if the instances are equally distributed in all classes then the
value of entropy is close to one.
2. Gini index is expressed as
Gini (t) = 1 -
2
3. Information Gain
If tuple is partitioned into r subsets, then the gain is measured as
Gain (t
1
, t
2
, , t
r
) = I (t) - I(t
j
)
Where |t| is the cardinality of t and I(t) can be either entropy or Gini.
3.3.2 ID3 Algorithm
The algorithm ID3 is a top down approach and recursively develop a tree. It uses
theoretic measure Information gain to construct a decision tree. At every node the procedure
examines an attribute that provides the greatest gain or greatest decrease in entropy.
The althm can be stated as follows
1. Calculate the initial value of the entropy Entropy (t) = -
2. Select an attribute which results in the maximum decrease of entropy or gain in
information. This serves the root of the decision tree.
3. The next level of the decision tree is constructed on this criterion.
4. The steps 2 to 3 are repeated till a decision tree is constructed where all the instances
are assigned to a class or the entropy of the system is zero.
Example: consider a following data set shown in Table 3.1.
Table 3.1 Sample Data Set.
1
0
( / )
c
i
p i t

|
1
|
| |
r
j
j
t
t
=

1
1
log
c
i i
i
P P

NOTES
89
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
The decision tree can be constructed as
There are four C1 classes and five C2 classes. Hence the probability of the classes
P(C1) = 4/9 and P(c2)= 5/9
The entropy of the training sample is given as
Entropy = -4/9 log2 (4/9) 5/9 log2 (5/9) = 0.9911
The Information gain of the attributes A1 and A2 can be calculated as follows.
For the attribute A1, the corresponding contingency table is shown in Table 3.2.
Table 3.2 Contingency Table of A1
The entropy of the attribute A1 is
= 4/9 [ -(3/4) log (3/4) log (1/4)] + 5/9 [ -(1/5) log (1/5) (4/5) log (4/5)]
= 0.7616
Therefore the information gain for the attribute A1 is
= 0.9911 0.7616 = 0.2294
The same procedure can be repeated for the attribute A2
For the attribute A2, the corresponding contingency table is shown in Table 3.3.
Table 3.3 Contingency Table of A2
The entropy of the attribute A1 is
= 5/9 [ -(2/5) log (2/5) 3/5 log (3/5)] + 4/9 [ -(2/4) log (2/4) (2/4) log (2/4)]
= 0.9839
DMC 1628
NOTES
90 ANNA UNIVERSITY CHENNAI
Therefore the information gain for the attribute A1 is
= 0.9911 0.7616 = 0.2294
Hence the best attribute to split is A1
Gini Index Calculation
The Gini index for the attribute A1 is
= 4/9 [ 1 (3/4)
2
- (1/4)
2
) + 5/9 [ 1 (1/5)
2
(4/5)
2
] = 0.3444
The Gini index for the attribute A2 is
= 5/9 [ 1 (2/5)
2
- (3/5)
2
) + 5/9 [ 1 (2/4)
2
(2/4)
2
] = 0.4889
Since Gini Index for the attribute A1 is smaller, it is chosen first. The resultant decision
tree is shown in Figure 3.8.
Figure 3.8 Decision Tree
One of the inherent benefits of a decision tree model is that it can be understood by
many users as it resembles a flow chart. The advantages of the decision tree is its graphical
format and the interactability it provides to the user to explore the decision tree.
Decision tree output is often presented as a set of rules. The rules of the form If-Then
and it provides a more concise knowledge mechanism especially when the tree is too large
and understanding becomes more difficult..
NOTES
91
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Decision trees can be both predictive and descriptive models. It can help us to predict
case-by-case basis by navigating the tree. More often, Prediction is done on automatic
manner without much interference from the user and for multiple new cases through the
tree or rule set automatically and generating an output file with the predicted value. In an
exploratory mode, the decision tree shows the insight about relationships between
independent and dependent variables thus helping the data investigation.
Overfitting
A decision tree which performs well for the given training set but fails for test tuples is
said to lack the necessary generalization. This is called overfitting problem. Overfitting is
due to the presence of noise in the data or due to the presence of irrelevant data in the
training set. This is also due to the smaller number of training data. Overfitting reduces the
classifier accuracy drastically.
After a decision tree is formed, the decision tree must be explored. Sometimes the
exploration of the decision tree may reveal nodes or subtrees that are undesirable. This has
happened because of Overfitting Pruning is a common technique that is used to removes
splits and the subtrees created by Overfitting. Pruning is controlled by user defined
parameters that cause splits to be pruned. By controlling the parameters the users can
experiment with the tree induction to get an optimal tree ideal for effective prediction.
The methods that are used commonly to avoid over fitting are
1. Pre pruning: In this strategy the generalization of the tree size beyond certain
point is stopped based on the measures like chi-square or information gain. Decision
Tree is then assessed for the goodness of fit.
2. Post pruning: In this technique, the tree is allowed to grow. Then post pruning
techniques are applied to remove the unnecessary branches. This procedure leads
to a compact and reliable tree. The post-pruning measures involve
cross validation of data
Minimum Description Length (MDL)
compilation of statistical bounds.
In practice both the methods of pre-pruning and post-pruning are combined to achieve
the desired result. Post-pruning required more effort than pre-pruning.
Although the pruned trees are more compact than the original, still they may be large
and complex. Typically, two problems that are encountered are repetition and replication.
Repetition is a repeated testing of an attribute (like if age < 30 and then if age < 15 and so
on). In replication duplicate sub trees are present in the tree. A good use of multivariate
splits based on combined attributes would solve these problems.
DMC 1628
NOTES
92 ANNA UNIVERSITY CHENNAI
Testing a Tree
Decision tree should be tested and validated prior to integrating with the decision
support systems. It is also desirable to test the tree periodically over a larger period to
ensure that the decision tree maintains the desired accuracy as classifiers are known to be
unstable.
3.4 BAYES CLASSIFICATION
Bayesian classifier is a probabilistic model which estimates the class for a new data.
Bayes classification is both descriptive and predictive.
Let us assume that the training data has d attributes. The attributes can be numeric or
categorical. Let each point x
i
be the d-dimensional vector of the attribute x
i
=
{ }
Bayes classifier, maps x
j
to c
i
where c
i
is one of the k classes.
c = {c
1
, c
2
, .., c
k
}
Let be the conditional probability of assigning x
j
to class c
i
which has highest
probability.
The Bayes theorem can be given as
From the training set, the values that can be determined are p(x
i
), p(x
i
/ c
i
) and p(c
j
).
From these, Bayes theorem helps us to determine the posterior probability p(c
j
/ x
i
) .
The procedure is given as
- First determine priori probability p(c
j
) for each class by counting the classes of that
appear in the data set.
- Similarly the number of occurrences x
i
can be calculated to determine p(x
i
).
- Similarly p(x
i
| c
j
) can be estimated by counting how often instance x
i
belongs to c
j.
From these derived probabilities, a new tuple is classified using the Bayesian method
of combining the effects of different attribute values.
Suppose if the tuple t
i
has p independent variables {
1 2
, ,....,
i i ip
x x x } then we can
estimate p (t
i
/ c
j
) by
p (t
i
/c
j
) =
In other words t
i
is assigned to class c
j
2 2
1 2
, ,.....,
d
i
x x x
( )
j
j
c
p
x
( ). ( )
( )
( )
B
p P A
A
A
p
B P B
=
1
( | )
p
ik j
k
p x c
=
[
NOTES
93
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
iff
p(c
j
/ x) > p(c
i
/x)
Advantages
1. It is easy to use.
2. Only one scan of the database is used.
3. Missing values wont cause much problem as they are ignored.
4. In a dataset with simple relationships, this technique would produce good
results.
Disadvantages
1. Bayes classifier assumes that attributes are always independent. This assumption
will not work for real-time datasets.
2. Bayes classification will not work for continuous data.
Example
Let us assume a simple data set shown in Table 3.4. Let us apply the Bayesian
Classifier to predict X (1,1)
Table 3.4 Sample Data set
HERE c
1
= 2 and C
2
= 3
p(c
2
) = 1/3 ; p(c
1
) = 1/2 The conditional probability is
estimated.
p(a
1 = 1
/ c
1
) = 1/2 ; p(a
2
=1 / c
2
) = 1/2
p(a
2
= 0 / c
1
) = 1/2 ; p(a
2
=1 / c
2
) = 2/3
p(x/c
1
) = p(a
1
=1/ c
1
) * p(a
2
=1/c
1
)
= 1/2 * 1/2 = 1/4
DMC 1628
NOTES
94 ANNA UNIVERSITY CHENNAI
p(x/c
2
) = p(a
1
=1/c
2
) * p(a
2
=1/c
1
)
= 1/2 * 2/3 = 1/3
This is used to evaluate
p(c
1
/ x) = 1/4 * 1/2 = 1/8
p(c
2
/ x) = 1/3 * 1/3 = 1/9
p(c
1
/ x)> p(c
2
/ x)
Hence the sample is predicted to be in class c
1.
Bayesian Network
A Bayesian network is a graphical model. This model shows the probabilistic
relationships among variables of a system. A Bayesian network consists of vertices and
edges. The vertices represent the variables and their interrelationships are edges and
associated probability values. By using their conditional probabilities, we can reason and
calculate the unknown probabilities. Figure 3.9 shows a simple Bayesian network
Figure 3.9: Simple Bayesian Network.
Here the nodes or vertices are variables. The edge connects two nodes. If causal
relationship exists, that is A is the cause of B, then the edge would be directed. Otherwise
the relationship is just correlation, and then the edge would be undirected. In the above
example, there exists a correlation relationship between A and B, and B and C. There is no
direct relationship between A and C, but still there may be some impact since both are
connected to B.
In addition, each edge consists of a conditional probability table which stores all
probabilities that may be used to reason or make inferrences within the system.
NOTES
95
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
The probability here is a Bayesian probability, a subjective measure, indicates the
degree of belief in the event. The difference between the physical probability and Bayesian
probability is that for Bayesian probability, the repeated trails are not necessary. For example,
consider a game, the outcome of the game cannot be determined by earlier trials or the
trails cannot be repeated to make probability calculations. Hence it is a degree of belief
reflected as probability.
Apart from the Bayesian probability, the edge represents causal relationships between
variables. If a node X is manipulated by some action some times changes the values of Y.
Then X is said to be cause of Y.
Also the causal relationships indicate strength. This is done by associating a number
to edge.
The conditional probability for this model would be P(Xi/Pai). Here, Pai is the set of
parents that render Xi independent of all its other parents. Then the joint probability
distribution can be given as
P(x1,,xn) =
Using this, by passing the evidence up and down, a Bayesian network, known as
belief propagation, once can easily calculate the probability of the events.
Advantages
1. The bidirectional message passing architecture is inherent in the Bayesian network.
Learning by evidence is unsupervised learning.
2. It can handle incomplete or missing data because the Bayesian network model
only dependencies among the variables.
3. It can combine domain knowledge and data. Encoding of causal prior knowledge
is straight forward In Bayesian classifier.
4. By using the graphical structure, Bayesian network ease many of the theoretical
and computational difficulties of the rule-based system.
5. It simulates like a human like reasoning.
3.5 OTHER CLASSIFICATION METHODS
3.5.1 Rule Based Classification
One straight forward way to perform classification is to generate rules. Rules are of
the form IF condition THEN conclusion.
IF-Part contains rule antecedent or precondition. THEN part is rule consequent. If a
tuple of a database holds true for rule antecedent, the rule antecedent is satisfied and that
rule covers the tuple.
( / )
i
P Xi Pai
[
DMC 1628
NOTES
96 ANNA UNIVERSITY CHENNAI
Decision Rules are generated using a technique called Covering algorithm. The
algorithm chooses best attribute to perform classification based on the training data. The
algorithm chooses an attribute that minimizes the error and picks that attribute and generate
a rule. For the remaining attribute, the probability is recalculated, and the process is repeated.
Find the class that occurs most in the training data.
For each value of the attribute
Count how often the class appears
Find the frequent class
Form a rule like IF attributes value THEN Class
Calculate the error rate. Calculate the errors that occur in the training data, that is the
number of instances that do not have the majority class.
Choose the rules with the smallest error rate.
The quality of the rule is based on the measures Coverage and accuracy.
Let us assume the total number of tuples as D. If a rule covers n-tuples, then
Coverage (R) = n
covers /
D
Accuracy = n
correct
/ n
covers
n
correct
is the rule that is rules that is rule in real aspect as certified by the domain
expert. If a algorithm covers 14 rules and if 7 rules are correct as per the domain expert,
then the accuracy is 7/14 = 50%.
What is the difference between Decision Tree and Decision rule?
- Rules have no order while all decision trees have some implicit order
- Only one class is examined at a time for decision rules while all the classes are
examined for the decision tree
3.5.2 K-Nearest Neighbor Technique
The similarity measures can be used to determine alikeness of different tuples in the
databases.
The representative of every class is selected. The classification is performed by assigning
each tuple to the class to which it is more similar.
NOTES
97
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Let us assume that the classes are {c
1
,c
2
,,c
n
} and the database D has {t
1
,t
2
,,t
n
}
tuples. The K-nearest neighbor problem is to assign t
i
to the class C
j
such that Similarity
(t,C
j
) is greater than or equal to Similarity (t, C
j
) where C
i
is not equal to C
j
.
The algorithm can be stated as
1. Choose the representative of the class. Normally the center or centroid of the
class is chosen as the representative of the class.
2. Compare the test tuple and the center of the class
3. Classify the tuple to the appropriate class
One of the commonest scheme is K-nearest neighbor technique where K is the number
of nearest neighbors. This technique assumes that the entire data along with desired
classification. Thus it assumes the input data as model.
When the classification is required, only K nearest neighbors are considered. The
test tuple is placed in the class that contains the most items from the set of K closes items.
Thus KNN technique is extremely sensitive to the value of K. Normally it is chosen
such that K. NumberofTrainingItems s The default value is normally 10.
3.5.3 Neural Networks
Neural networks are also used to classify the data. It is a valuable tool for classification.
The classification problem using the neural network involves these steps
1. Determine the input nodes, output nodes and hidden nodes. This is determined by
the domain expert and the problem.
2. Determine the weights and functions
3. For each tuple, propagate it through neurons and obtain the result.
Evaluate the result and compare it with the actual result.
If the prediction is accurate, adjust the labels to ensure that this prediction has higher
output weight next time.
If the prediction is not accurate, then adjust the weights
4. Continue this process till the network makes the accurate classification.
DMC 1628
NOTES
98 ANNA UNIVERSITY CHENNAI
Hence the important parameters of the neural network are
- Number of Nodes
- Number of Hidden nodes
- Weighted functions
- Learning technique for adjusting weights
The advantages and the disadvantages of the neural network is tabulated in the Table 3.5.
Table 3.5 Pros and Cons of Neural Network based Classification.
Advantages Disadvantages
1. More robust in handling noise 1. Difficult to understand the rule
generation process
2. Improves performance by learning 2. Input values should be numeric
3. Low error rate 3. Generating rules is not a
straight forward process.
4. High accuracy

3.5.4 Support vector Machine
Support vector machine is a popular classification method for classifying both linear
and nonlinear data. A support vector machine uses a nonlinear mapping to transform training
data into a higher dimension. Then it searches for a linear separating hyperplane (or a
decision boundary) to separate the data. It searches for the hyperplane using support
vectors (essential training tuples and margins).
3.5.5 Genetic Algorithms
Genetic algorithm models natural evolution. An initial population is created consisting
of randomly generating rules. A rule of IF A and NOT B than C can be encoded as 101.
This is called a string. The group of strings forms a population.
Based on the fitness function, the best rule of the current population is selected. The
genetic algorithm deploys genetic operators like cross over and mutation to form a new
pairs of algorithms. Cross over swaps the substrings to form a new pair of rules. Mutation
simply inverts the bits of the rules string. This results in new population. This process of
generating new population is continued till a population satisfies a predefined fitness threshold.
3.5.6 Rough Set
Rough set theory is based on equivalence classes. It tries to establish equivalence
classes within given training data. Rough set tries to approximate the classes that cannot be
NOTES
99
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
distinguished in terms of available attributes. This class is approximated two sets lower
and higher approximation.
The lower approximation consists of tuples that belong to class C and upper
approximation consists of tuples that certainly dont belong to the class C. Decision rules
can then be generated for each class.
3.5.7 Fuzzy approach
Traditional systems are based on binary logic. But many terms like low, medium and
high are fuzzy. Often depends on the context, these terms take different values. Fuzzy logic
applies fuzzy thresholds or boundaries to be defined for each category. Unlike the traditional
set, here as item can belong to more than one class.
It is useful in dealing vague or imprecise measurement. It can be used by data mining
systems to classify tuples. Some times more than one fuzzy rule is applicable in the grain
context. Each rule contributes a vote for membership in the categories. The votes are
summed up and then the sums are combined. Later the fuzzy output is defuzzified to get
crisp output.
3.6 PREDICTION MODELS
3.6.1 Linear Regression
Numeric prediction of continuous variables for a given input is called prediction
problem. One of the commonest method of prediction is called Linear regression.
Regression analysis is used to model the relationship between one or more independent
variables and a dependent variable. The data can be displayed in a scatter plot to have a
feel of the data. The X-axis of the scatter plot is independent or input or predictor variables
and Y-axis of the scatter plot is output or dependent or predicted variable.
The scattered data could be fitted by a straight line. In the simplest form, the model
can be created as
Y = W
0
+ W
1
* X
Here W
0
and W
1
are weights and are called regression coefficients. This specifies the
Y- Intercept and slope of the line. The coefficients can be solved for by the method of least
squares. This estimates the best fitting line (Because many lines are possible!) that minimizes
the error between the actual data and the estimate.
If D is the training set consisting of all X values, then
W
1
=
1
( )( )
D
i
xi X yi Y
=

/
1
( )( )
D
i
Xi X Xi X
=

W
0
=
1
Y W X
DMC 1628
NOTES
100 ANNA UNIVERSITY CHENNAI
Where and are mean value of the X and Y data. The coefficients provide
approximations.
Let us consider an example (Refer Table 3.6)
Table 3.6 Sample Data for Regression
The data can be shown as a scatter plot in Figure 3.10.
Figure 3.10 Scatter Plot
Let us model the relationship as
Y = W
0
+ W
1
x
Here = 34/4
Y
= 324/4
substituting this in the equation we get the equation as
Y = 0.925 x + 0.194
Multiple Linear Regressions
This is an extension of linear regression problem. If there are n predictor variables or
attributes describing a tuple, then the output is a combination of predictor variables.
NOTES
101
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Hypothetically, if the attributes are A1, A2, and A3, then the linear regression model would
be Y = w
0
+ w
1
x
1
+ w
2
x
2
+ w
3
x
3
.
Using the same methods used above, the equations can be solved to get the output
variable value.
3.6.2 Non linear regression
In real world data, often there may not exists any linear dependence on data. In such
cases, one can apply transformations to convert nonlinear models to linear models.
For example, consider the cubic polynomial y = w
0
+ w
1
x + w
2
x
2
+ w
3
x
3
Where x
1
= x , x
2
= x
2
and x
3
= x
3
.
Other regression based methods are logistic regression models where probability of
an event occurring is a linear function of a set of predictor variables. Log-linear models
approximate multidimensional probability distributions.
3.7 TESTING AND EVALUATION OF CLASSIFIER AND PREDICTION
MODELS
The ability of the classification models to correctly determine the class of a randomly
selected data is known as accuracy. It can be said as the probability of correctly classifying
test data. Classification accuracy estimation is a difficult task as different training sets often
lead to somewhat different models and therefore testing is very important.
Accuracy can be determined using several metrics like sensitivity, specificity, precision
and accuracy. The methods for estimating errors include hold-out, random sub-sampling,
K-fold cross validation and leave one-out.
Let us assume that the test data has a total of T objects. If objects C are classified
correctly then the error rate is T C / T
A confusion matrix often used to represent the results of the tests.
Cross validation is a model evaluation method that gives an indication of how well the
learner will perform new predictions for test data. One way to accomplish cross validation
is to remove some data during the training phase. Then the data that was removed during
the training phase is used to test the performance of the classifier on the unseen data. This
is the basic idea of cross validation.
The holdout method is the simplest method of cross validation. Here the data set is
separated into two sets. One set is called the training set and another set is called testing
set. The function approximator fits a function using the training set. Then this function is
DMC 1628
NOTES
102 ANNA UNIVERSITY CHENNAI
asked to predict the unseen values of the testing test. The errors it makes are accumulated
to give the mean absolute test set error. This error is used to evaluate the model.
The advantages of this model are that it is easy to compute but the evaluation can
have a high variance. Also the method is dependent on how the division is made.
K-fold cross validation is an improvement over the previous method. Here the data
set is divided into k subsets. The holdout method is repeated k times. Each time, k-1 sets
are considered as the training set and the remaining one set is treated as test data set. Then
the process is repeated across all the data set for k trails. The overall performance of the
classifier is the average error across all k trials.
The advantage of this method is performance is not dependent on the method of
partition as every point is to serve as k-1 times as the training set and one time as test data.
The variance of the resulting estimate is reduced as the trails (k) increases. But the value of
k is normally set as 10. Hence this method is called as 10-fold cross validation also.
The disadvantage of this method is that it takes k times as much computation to make an
evaluation.
Leave-one-out cross validation is K-fold cross validation taken to its logical extreme.
Here K is equal to the total number of points in the data set. That is N. The function
approximator is trained on all the data except for one point in the trail. And later a prediction
is made for that point. Then the average error is computed and is used later to evaluate the
model.
3.7.1 Evaluation Metrics
There are several metrics that can be used to describe the quality and usefulness of a
classifier. Accuracy is one characteristic. Accuracy can be expressed through a matrix
called confusion matrix. The matrix informs us the classification that is carried out over the
data. For example consider the following confusion matrix.
TP=number of specimens that are classified correctly FP=number of specimens that
are classified wrongly. These are false alarms. FN=number of specimens that are classified
wrongly. These are misclassified samples. And TN=number of true negative specimens
that the classifier is able to diagnose correctly.
The results can be better understood by correlating this with the hypothetical classifier
for classifying cancer data .If the classifier result matches with the actual result for cancer
patients, it is called true positive. If the same results match for the non-cancer patients, it is
called true negative. False positive is a false alarm that classifier says that the patient has
cancer when in reality; he is not a cancer patient. False negative is one where the classifier
reports that a patient is normal when the patient is indeed a cancer patient. False negative
NOTES
103
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
has a lot of legal and sociological impacts and a serious error that the classifier should seek
to avoid.
Sensitivity
The sensitivity of a test is the probability that it will produce a true positive result when
used on a test data set. The sensitivity of a test can be determined by calculating:
TP / TP + TN
Specificity
The specificity of a test is the probability that a test will produce a true negative result
when used on test data set.
TN / TN + FP
Positive Predictive Value
The positive predictive value of a test is the probability that a object is classified
correctly when a positive test result is observed.
TP/ TP + FP
Negative Predictive Value
The negative predictive value of a test is the probability that a object is not classified
properly when a negative test result is observed.
TN / TN + FN
Precision
The precision is defined as Precision = TN / ( TN + FP)
It can be shown that the accuracy of the classifier can be shown in terms of sensitivity
and specificity.
Accuracy = Sensitivity * (TP/ TP + TN) + Specificity * (TN/ TP + TN).
Like classifier model, the predictor model accuracy is determined using loss functions.
The most common loss functions are
Absolute error = ' Yi Yi
Squared error =
2
( ') Yi Yi
When the error is carried to the entire set, it results in mean absolute error
DMC 1628
NOTES
104 ANNA UNIVERSITY CHENNAI
Mean Absolute Error =
'
d
i
Yi Yi
/ d
Mean squared Error =
2
1
( ')
d
i
Yi Yi
=

/ d
The advantage of Mean absolute error is that it exaggerates the presence of outliers.
The errors can be normalized by dividing by the total loss by mean.
Relative absolute error =
'
d
i
Yi Yi
/
1
d
i
Yi Y
=

Relative squared error =


2
1
( ')
d
i
Yi Yi
=

/
2
1
( )
d
i
Yi Y
=

The root relative squared error can be obtained by taking the square root of relative
squared error.
The other criteria for evaluating the classifier models apart from accuracy factor are
1. Speed
The time to construct model and time to learn to use the model is often referred
speed. This time should be minimized.
2. Robustness
The classifiers are unstable. The poor quality of the data often results in poor
classification. The ability of the classifier to provide good results in spite of some errors/
missing data is important criteria for evaluating the classifier.
3. Scalable
The algorithms should be able to handle large data sets. It is required to handle vary
large data efficiently.
4. Goodness of fit
The models should fit the problem nicely. For example, a decision tree should have
right size and compactness to have high accuracy.
3.7.2 Model Selection
ROC curves are useful tool for comparing classification models. ROC is an acronym
of Receiver Operating Characteristic. It is a plot of sensitivity and the false positive rate for
a given model.
The Y-axis is true positive rate and X-axis false positive rate. We start from the bottom
left-hand corner initially. If we have any true positive case, we move up and plot a point. If
NOTES
105
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
it is a false positive case, we move right and plot. This process is repeated till the complete
curve is drawn.
If the ROC curve is closer to the diagonal line then it shows the classifier to be less
accurate. The area under the curve indicates the accuracy of the model. A model is perfect
model if it has area under ROC curve as one. A sample ROC curve is shown in Figure
3.11.
Figure 3.11 Sample ROC curve.
3.8 CLUSTERING MODELS
Clustering is a technique of partitioning the objects that have many attributes into
meaningful disjoint subgroups. The objects in the subgroups are similar to each other while
differ from the objects in other clusters significantly. Clustering is a process also called
segmentation (or dissection). Hence the objective of clustering process is
- To discover nature of the sample
- To see whether the data falls into classes.
Cluster analysis is different from classification. In the classification process, the classes
are predefined. The user knows the class types and in the training data, samples of data
are given along with the class label. But cluster analysis is unsupervised learning where the
classes or clusters are not known to the user. Thus the salient points of clustering versus
classification is
- The best number of clusters is not known in clustering process
- No priori knowledge in clustering
- Cluster results are dynamic
A good example of clustering process is to group flowers in to a meaning group
hierarchy. In business, for example, the customer database can be grouped based on
many questions like
DMC 1628
NOTES
106 ANNA UNIVERSITY CHENNAI
a. What customers buy?
b. How they buy?
c. How much money they spend?
d. Customer Lifestyle
e. Purchasing behavior
f. Demographic characteristics to discover sub groups of diseases.
Problems of clustering
- Outlier handling is difficult as outliers themselves may form a solitary cluster. When
forced, clustering algorithms are force it as part of a cluster- hence cause problem.
- Dynamic data handling is difficult.
- Interpreting the cluster results require semantic meanings and evaluating clustering
results is a bit difficult task.
The aim of the clustering process is explanatory. It aims to group data with small in-
group variations and large between-group variations. Thus clustering process often finds
some interesting clusters.
The issues of the clustering process are
- Desired features of clustering algorithm
- Measurement of similarity and dissimilarity
- Categorization of clustering algorithms
- Clustering of very large data
- Evaluation of clustering algorithms
The desirable characteristics of clustering algorithms are
- Require no more than one scan of the database
- Online ability to say the present status and best answer so far.
- Suspendable | stoppable | resumable
- Incremental addition | deletion of instances
- work with limited memory
- performing different kinds of scanning the databases
- process each tuple only once
3.8.1 Types of Data and Clustering Measures
Clustering algorithms are based on similarity measures between objects. Similarity
measures can be obtained directly from the user or can be determined indirectly from
vectors or characteristics describing each object.
Therefore, it becomes necessary to have a clear notion of similarity. Often it is possible
to derive dissimilarity from similarity measure. For example, if s(i,k) denotes the similarity,
then the dissimilarity d(i,k) can be obtained from similarity using some transformations.
Some transformations are like
NOTES
107
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
s
ik
= 1
1
ik
d +
d
ik
= ik
2(1 S )
Where 1
>
S
ik
>
0 and S
ik
= 1 {height similarity} has the property of
distance.
The terms similarity and dissimilarity are often denoted together by the term proximity.
The similarity measures are referred by the term distance informally. Often the term
metric is used in this context. A metric is also a dissimilarity measure that satisfies the
following conditions.
1. D(i,j) 0 for all i and j
2. D(I,j) = 0 if i = j
3. D(i,j) = d(j,i) for all i and j.
4. D(i,j) d(i,k) + d(k,j) for all I,j and k
This is called triangle inequality.
Hence not all distance measures are metrics but all metrics are distances.
Distance measures
The distance measures vary depending on the types of data. The chapter two can be
referred for a detailed explanation. For convenience sake, the data types that are usually
encountered are given in Table 3.7.
Table 3.7 Sample Data Types.
Based on the data type, the distance measures vary. Some of the distance measures
are shown in the Table 3.8.
DMC 1628
NOTES
108 ANNA UNIVERSITY CHENNAI
Table 3.8 Sample Distance Measures.
Hence the distance measures are used to distinguish one object from other and also
useful for grouping the objects.
Quantitative variables
Euclid distance is one of the most important distance measures. The Euclid distance
between objects is calculate as below
Distance (O
i
,O
j
) =
2
ik jk
( O O )

Suppose, if the coordinates of the objects O1, O2 are (5,6) and (8,9) then the Euclid
distance can be calculated as
D(O
1
,O
2
) =
= =
Advantages of Euclid distance
Distance do not change with the addition of new objects
However if the units change, the resulting Euclid or squared Euclid change drastically
2 2
(5 8) (6 9) +
9 9 +
18
NOTES
109
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
City-block (Manhattan distance)
Manhattan distance measures
- Average distance across dimensions
- Dampen the effects of outliers
Manhattan Distance (O
i
,O
j
) = 1/n |O
ik
-O
jk
|
= ( | 5 8 | + | 6 9 | )
= ( 6 ) = 3
Chebysheve distance:
Distance (O
j
, O
i
) = max
k
( | O
ik
O
jk
| )
Distance (O
1
, O
2
) = max ( | 5 8 |, | 6 9 | )
= max ( 3, 3 ) = 3
Binary attributes
Binary attributes will have only two values. Distance measures are different if binary
attributes are used.
For example
This can be converted to a binary attribute based on a sample like
x
1
= 1 if height 165cm Otherwise x
1
= 0
x
1
= 1 if weight 70Kg Otherwise x
2
= 0
1
n
k =

DMC 1628
NOTES
110 ANNA UNIVERSITY CHENNAI
Then , in the table of matches and mismatches, number of attributes where both
individual score is 1 = a = 1
Ind = 1 Ind 2 = 0 b = 2
Ind = 0 Ind 2 = 1 c = 3
Ind = 0 Ind 2 = 0 d = 1
Then the following distance measures can be applied depending on the context. Many
of the measures use transitions between 0 and 1 for the calculation of distance.
1. Equal weights for 1 - 1 matches and 0 0 matches
2. Double weight for 1 1 and 0 - 0 matches
3. Double match for unmatched pairs
4. No 0 0 matches in numerators
5. 0 0 matches are treated irrelevant
6. No weight for 0 0 but double weight for 1 1
7. No 0 0 matches for numerators / denominators.
Double weight for unmatched pairs
8. Ratio of matches to numerators/ Denominators
Mismatches with 0 0 matches excluded
Categorical Data
In many cases, we use categorical values. It serves as a label for attribute values. It is
just a code or symbol to represent the values.
For example, for the attribute Gender, a code 1 can be Female and 0 can be male.
Similarly, the job description can be 1 for representing the manager, 2 for junior manager
and 3 for clerk (Refer Figure 3.12 for hierarchy of categorical data).
To calculate the distance between two objects represented by variables, we need to
find the categories. If the category is just two, techniques like simple matching, Jaccard
a
b c +
NOTES
111
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
method or Hamming distance can be used. If the category is more than two, then the
categories can be transformed to a set of dummy variables that is of binary category.
For example
Figure 3.12 Categorical Data
Then John who is working as manager can be coded as John = <1,<1,0,0>> and
Peter who is working as a clerk can be coded as <1,<0,0,1>>.
Then the distance between John and Peter is (0, 2/3) where the distance between
two objects is the ratio of number of unmatched and total number of dummy variables.
Each category can also be assigned into manay binary dummy variables.
If the number of categories is C, then V variables can be assigned for each category
so that the number of variables will satisfy the equation
V = Celing of ( Log C/ Log 2)
For example, for the category Job Designation = Manager/ Junior Manager/ Clerk,
The number of variables that can be assigned to is
V = Ceiling (Log 3/ Log 2)
= Ceiling (1.58)
= 2
DMC 1628
NOTES
112 ANNA UNIVERSITY CHENNAI
Therefore the representation of job designation would be
Manager = [ 1 1]
Junior Manager = [ 1 0]
Clerk = [ 0 1]
So, for the previous example, John would be ( 1, [1,1]) and Peter would be (1,
[0,1]). Hence the distance would be the ratio of unmatched and total dummy values.
Therefore the distance between John and Peter is (0, )
Another measure that is used for finding the distance between categorical variables is
the percent agreement. It can be defined as
Distance (O
i
, O
j
) = 100 x [Number of (O
ik
=

O
jk
)] / x
If Object 1 and Object 2 disagree on 2 attributes then
= 100 x 2 /4 (Total no. of attributes) = 50%
Distance measure for Ordinal variables
Ordinal variables are like categorical values. But they have a inherent order. For
example if job designation is 1 or 2 or 3 means code 1 is higher than 2 and code 2 is higher
than 3. It is ranked data as 1 >> 2 >> 3.
To compute similarity/ dissimilarity, there are many distance measures like Chebyshev
/ Maximum distance, Minkowski distance can be used. There are some specialized distance
measures like Kendall distance, Spearman distance.
For the previous example, distance is the spatial disorder between two vectors. The
vectors can be pattern vector or disorder vector. Pattern vector has order or sequences
vector and serves as guide. The distance would be the number of operation it takes to
make a disorder vector into a pattern vector.
For example, consider for example, the preferences of three persons are listed as
Person 1 = < Coffee, tea, Milk>
Person 2 = < Tea, Coffee, Milk>
Person 3 = < Coffee, Tea, Milk>
Then the distance between person 1 and person 3 is zero as all the preferences are
same. For finding the distance between person 1 and person 2, either person 1 can be a
pattern vector or person 2 can be pattern vector.
NOTES
113
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
If person 1 is considered as pattern vector, then Person 2 would be disorder vector.
Then the distance between these two would be calculated as the number of bits that
differs or calculated using traditional distance measures like Chebyshev distance measure.
The selection of pattern or disorder vector dont matter much as the distance between
A and B is same as the distance between B and A because of symmetry.
Mixed Types
Normally database has many attributes of all data types. A preferable approach is to
process all variables together performing a single cluster analysis.
If the data set contains p variables of mixed type, then the dissimilarity matrix d(I,j)
between object I and object j can be given as
D(I,j) = /
Where

ij = 0 if there are no measurement variable f for object i or object j.


If Xif = Xfi = 0, then variable is asymmetric binary, otherwise

ij
f
= 1.
For Quantitative variable
d
ij =
Xif Xjf / Max
h
X
hf
- Min
h
X
hf
Where h runs over all nonpromising objects for variable f.
For categorical variables or binary
Dij = 0 if X
if
= X
jf
= 1 Otherwise
For Ordinal variables, then the procedure would be to compute the rank. Then
Z
if
= R
if
/ Mf 1 where Z
if
is treated as quatitative variable.
Vector Objects
For text classification, vectors are normally used. The similarity function for vector
objects can be defined as
S(X,Y) = Y /
Where X
t
is the transposition of the vector X. x is the Euclidean norm of the
vector X and is the length of the vector. S(X,Y) is the cosine of the angle between vector
X and Y.
1
f
p
f
f
ij dij
=

1
p
f
f
ij
=

t
X x
DMC 1628
NOTES
114 ANNA UNIVERSITY CHENNAI
For example, if the vectors are <1,1,0> and <0,1,1> then
S(X,Y) = 0+1+0 /
2

2
If the bits are same, they yield 1 otherwise it is zero.
A simple variation also can be used as
S(X,Y) = X
t
Y / X
t
X + Y
t
Y - X
t
Y
This is known as Tanimoto coefficient or Tanimoto distance often used in Information
retrieval.
3.8.2 Categories of Clustering Algorithms
The Figure 3.13 shows the classification of clustering algorithms. Broadly speaking,
the clustering algorithms can be classified into four categories. They are Hierarchical methods,
Partitional methods, Grid models and model based methods.
Figure 3.13 Categories of Clustering Algorithm
Partitional methods use greedy procedures that are used iteratively to obtain a single
level of partitions. Being greedy approach based methods, often these methods produce
locally optimal solutions. Formally speaking, there are n objects here and the objective is
to find k clusters such that an object can be assigned to each cluster. The numbers of
clusters are obtained from the user beforehand and will not change during the algorithm
running. An object should be assigned to only one cluster. But the objects can be relocated
many times before the final assignment is made.
Hierarchical methods produce a nested partition of objects. Often the results are
shown as dendograms. Here the methods are divided into two categories. They are
aggloerative methods where a each individuals are considered as a cluster. Then they are
merged and the process is continued to get a single cluster. Divisive methods use another
NOTES
115
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
kind of philosophy, where a single cluster is chosen, then they are partitioned and the
process is continued till the cluster is splitted into smaller clusters.
Density based methods use the philosophy that at least some minimum data points
should be present within the acceptable radius for each point of the cluster.
Grid based methods partitions the data space rather than the data point based on the
characteristics of the data. This method can handle continuous data and one great advantage
of this method is input data order has no role to play in the final clustering process.
Model based methods tries to cluster data that has similar probability distribution.
3.8.3 Partitional and Hierarchical Methods
Classical clustering algorithms are straight forward Partitional algorithms. They search
in the space of possible assignment c of points to k clusters to find the one that minimizes
the score (or maximize depending on the set score function). This is typical combinatorial
optimization problem. This method is suitable for smaller data sets and generally not
applicable for larger data sets.
These are iterative improvement algorithm which employs greedy approach. The
general procedure is
- Start with randomly chosen points; reassign so as to give increase / decrease of
score function.
- Recalculate the updated cluster centers to reassign points till no change in the
score function or cluster memberships.
Advantages of classical algorithms
- Simplicity
- At least guarantee local optimal maximum / minimum of the scoring function.
K-means
K-means algorithm is a straight forward Partitional algorithm. It gets the values of K
(the number of clusters) beforehand from the user. Then random k cluster centers and
assign each points to the k cluster centers whose mean is minimum.. Then there is a
recomputation of mean vectors of the points assigned to cluster and using max as new
centers for iterative approach and this process is continued till no change of instances to
clusters is noticed.
The procedure of this algorithm is summarized as follows :
1. Determine the number of clusters before the algorithm is started This is called K.
2. Choose K instances randomly. These are initial cluster center
DMC 1628
NOTES
116 ANNA UNIVERSITY CHENNAI
3. Assign the remaining clusters to the closest cluster based on Euclid distance
4. Recompute new mean
5. Perform the iteration till the new mean is similar to the old mean. Otherwise chose
new mean and go to step 2.
The complexity of k-means algorithm is O (k n I) where I is the number of iterations.
The complexity of computing new cluster center is O(n)
K Medoid
K-medoid algorithm is a Partitional algorithm and the goal is, given k, find k
representatives in the data set so that when assigning each object to the closest representative,
the sum of the distances between the representatives and objects is minimal. This algorithm
is similar to K means algorithm but here only data points in space can become medoid
while in K means any point in the space can be mean points. Then based on the calculations
the medoids are swapped or retained until there is no change for all points for the medoid
assumed.
The procedure of K-medoid algorithm is given below
1. Arbitrarily choose K objects as the initial medoids (representatives)
2. Repeat
a. Assign each remaining object to the cluster with nearest medoids
b. Randomly select the non-medoid objects
c. Compute the cost S of swapping O
j
with O randomly if S<0 then
swap O
j
with O randomly to form a new set of K medoids
d. Until there is no change
3. Done
Hierarchical Clustering Algorithms
Hierarchical methods produce nested clustering tree and are classified into two
categories
- Agglomerative (merge)
- Divisive (divide)
The convenient graphic display to display cluster is to use a tree like structure called
dendograms.
Agglomerative
Agglomerative methods often referred as AGNES (Agglomerative NESTing method).
Agglomerative methods are based on the measures of distance between clusters. They
merge clusters to reduce the number of cluster. This is repeated each time merging two
closest clusters to get a single cluster. Usually the initial point is every cluster consists of
single points.
NOTES
117
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
The procedure is given as follows
1. Place each data instance into a separate cluster
2. Till a single cluster is obtained
a. Determine two most similar clusters
b. Merge the two clusters into a single cluster
3. Chose a clustering formed by one of the step 2 iteration as final result.
Advantage of Hierarchical methods
No vector representation for each object
Easy to understand and interpret
Simplicity
Divisive methods
These algorithms are referred as DIANA (DIvisive ANAlysis). These methods start
with all variables and remove those and whose removal cause least detoriation in the
model. Then split till there is a cluster with a single point. The splitting can be monothetic
where the Split clusters is done using one variable at a time and can be polythetic where
the basis of all of the variables together for splitting.
Disadvantages of divisive methods
- Computationally intensive.
- Less widely used.
Hierarchical Methods use distance measures for clustering purposes. Some of the
common algorithms and the distance measures are given the following Table 3.9.
Table 3.9 Distance Measures.
DMC 1628
NOTES
118 ANNA UNIVERSITY CHENNAI
Single linkage
Consider the array of distances
O
1
O
2
O
3
O
4
O
1
0

jk
D d = = O
2
0
O
3
8 2 0
O
4
6 3 4 0
The minimum is 1, so the new object O
1,2
is formed.
O
1,2
O
3
O
4
O
1,2
0
O
3

2 0
O
4
3 4 0
Distance (O
1,2
, O
3
) = min {Distance (O
1,3
), Distance (O
2
, O
3
)}
= min {8, 2}
= 2
O
1,2,3
O
4
O
1,2,3
0 0
O
4

3 0
Distance (O
1,2
, O
4
) = min {Distance (O
1
, O
4
), Distance (O
2
, O
4
)}
= min {6, 3}
= 3
NOTES
119
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
The corresponding Dendogram is shown in Figure 3.14.
Figure 3.14 Dendogram.
Distance (O
1,2,3
, O
4
) = min { D(O
1,2
, O
4
), D(O
3
, O
4
)}
= min (3,4)
= 3
Example using complete linkage:
Dist (O
1,2
, O
3
) = max {Dist (O
1
, O
, 3
), Dist (O
2
, O
3
)}
= max {8,2}
= 8
Dist (O
1, 2
, O
4
) = max {Dist (O
1
, O
4
), Dist (O
2
, O
4
)}
= max {6, 3}
= 6
O
1, 2
O
3
O
4
O
1, 2
0
O
3
8 0
O
4
6

4 0
O
1, 2
O
3, 4
O
1, 2
0
O
3, 4
8 0
DMC 1628
NOTES
120 ANNA UNIVERSITY CHENNAI
The final Dendogram is shown in Figure 3.15.
Figure 3.15 Dendogram for complete Linkage
Dist (O
1, 2
, O
3, 4
) = max {Dist (O
1, 2
, O
3
), Dist (O
1, 2
, O
4
)}
= max {8, 6}
= 8
Average:
This process is similar to earlier case, but the average is considered.
Dist (O
1,2
, O
3
) = {Dist (O
1
, O
, 3
) + Dist (O
2
, O
3
)}
= {8+2}
= 5
Dist (O
1, 2
, O
4
) = {Dist (O
1
, O
4
) + Dist (O
2
, O
4
)}
= {6+3}
= 4.5
O
1, 2
O
3
O
4
O
1, 2
0
O
3
5 0
O
4
4.5

4 0
Dist (O
1, 2
, O
3, 4
) = {Dist (O
1, 2
, O
3
) + Dist (O
1,2
, O
4
)}
= {5 + 4.5}
= 5.25
O
1, 2
O
3, 4
O
1, 2
0
O
3, 4
5.25 0
= 5.25
NOTES
121
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
The final dendogram is shown in Figure 3.16
Figure 3.16 Dendogram for Average Link
Advantages of hierarchical:
1. Correct number of clusters
2. Easy to understand
3. Easy to detect outlier
3.8.4 Clustering in Larger Databases
The problems with earlier algorithms is that
- They operate only in limited memory
- And the assumption that all data is available at once.
But data mining involves processing of huge dataset. Hence the data mining algorithms
should be scalable. Hence a scalable clustering algorithm operates in the following manner
1. Read a subset of Database into main memory
2. Apply cluster techniques to data in main memory
3. Combining results with those from prior samples
4. In memory data divided into
a. Those items that will always be needed even if next sample brought in
b. Those that can be always be discarded with appropriate updates to
data being kept in order to answer the problem
c. Those that can be stored in compresses format
5. If termination criteria are not met, then repeat from step 1
Some of the scalable algorithms are BIRCH, ROCK and CHAMELEON. The
following sections deal with the highly scalable algorithms.
BIRCH
BIRCH is an acronym of Balanced iterative reducing and clustering which is a
incremental algorithm where there is a need to adjust memory requirements to the size that
is available. The algorithm uses the concept of CF that is clustering feature. CF is a triplet
summarizing info about sub clusters.
CF = ( N,
SS , LS
) where
LS
= Linear sum on N points
1
n
i
i
x
=

SS
= Squared sum of data points i.e, =
2
1
n
i
i
x
=

DMC 1628
NOTES
122 ANNA UNIVERSITY CHENNAI
Example:
(1, 1) (2, 2) (3, 3)
N = 3
LS

= (6, 6)
SS
= (13, 13)
CF = (5, (6, 6), (13, 13))
CF tree is a balanced tree with two parameters, They are
1. B B ronchirs factor which
Specifies max no, of children and
2. T Threshold which specifies the
- max diameter of sub clusters stored at the leaf nodes
The CF Tree characteristics are
- built dynamically
- and the process of building CF-Tree Envelope
- insertion of correct leaf node.
- check balance of the threshold
- if necessary splitting the leafs
- and rebuilding the process
* If memory is limited then merging takes place to the nearest cluster
Birch algorithem can be written as
Input :
D = {
1
x

,
2
x

, ..,
n
x

} //set of elements
B Branch factor
- max no, of children
T Threshold
- max diameter
Output:
K ||set of clusters
For each X
i
eD do
Determine correct leaf node for X
i
for insertion
If threshold condition is not violated
Then
Add X
i
to cluster and update CF triplets
Else
IF there is a provision to insert X
i
Then
NOTES
123
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Insert X
i
as a single cluster and update CF
Else
Split leaf node and redistribute CF features.
ROCK
Rock is an acronym of Robust Clustering using links. It is an hierarchical clustering
method using the concept of links.
Traditional algorithms use the concept of distance measures for clustering algorithms.
Some times due to noise or outlier components clustering process yields very poor results.
The approach of ROCK algorithm is that instead of adopting a local approach, it
uses a concept of neighborhoods of individual pair of points.
If two points share similar neighborhoods, then two points belong to the same cluster
and can be merged. That is, for two points Pi and Pj to be neighbors, then similarity (Pi, Pj)
> Theta, where similarity is a function and theta is the user specified threshold. The number
of common neighbors of Pi and Pj is called links. If the number of links is more, then they
belong to the same neighborhood. Approximately this is equivalent to Jaccard coefficient
applied to the transaction database.
For the transaction database, the similarity function between the transaction Ti and
can be given as
Similarity (Ti, Tj) =
The steps of ROCK can be summarized as
1. Using the idea of data similarity and neighborhood concept, a sparse graph is
constructed
2. Perform agglomerative hierarchical clustering on the sparse graph
3. Evaluate the model.
ROCK is a suitable for very large databases.
CHAMELEON
Chameleon is a hierarchical clustering algorithm. This algorithm uses the dynamic
modeling concept to determine the similarity between the pairs of clusters.
The algorithm uses two measures. They are Interconnectivity and Closeness.
The relative connectivity R(Ci,Cj) is defined as the absolute connectivity between Ci
and Cj normalized with respect to the internal interconnectivity.
DMC 1628
NOTES
124 ANNA UNIVERSITY CHENNAI
R(Ci,Cj) = ( , ) EdgeCut Ci Cj
/
* ( ) EdgeCut ci + ( ) EdgeCut cj
EdgeCut(Ci,Cj) is the edge cut for a cluster containing both Ci and Cj.
EdgeCut(Ci or Cj) is the minimum sum of cut edges that partition the cluster roughly
into equal parts.
The relative closeness is the absolute closeness between Ci and Cj normalized with
respect to the internal closeness.
RC(Ci,Cj) = Average weight of the edges that connect Ci and Cj
((Average Weight of the edges (Ci) * Ci / Ci + Cj ) +
(Average Weight of the edges (Cj) * Ci / Ci + Cj ))
The algorithm is implemented in three steps
1. Use K-Nearest approach to construct a sparse graph. Each vertex is a data object.
The edges represent the similarity of the objects. The graph represents the dynamic
concept. Neighborhood radius is determined by the density of the region. In a
sparse region, the neighborhood is defined more widely. Also the density of the
region is the weight of the graph.
2. The algorithm then partitions the sparse graph into a large number of small clusters.
The algorithm is based on the partitioning the graph based on min-cut algorithm. Here the
cluster C is partitioned into the clusters Ci and Cj so as to minimize the weight of the edges.
Edge cut indicates the absolute interconnectivity between the clusters Ci and Cj.
3. Chameleon then uses a hierarchical agglomerative algorithm repeatedly to merge
the subclusters into a larger cluster based on the similarity based on the relative
interconnectivity and relative closeness.
3.8.5 Cluster Evaluation
Evaluation of clustering is difficult as no test data available as in classification. Even
for some meaningless data some data are obtained. There is no satisfactory methods available
for evaluating the results. But general guidelines of a good clustering are
1. Efficiency
2. Ability to handle missing data
3. Ability to handle noisy data
4. Ability to handle different attribute types
5. Ability to handle different magnitude
The essential conditions to be satisfied by a good cluster also include properties like
scale-invariance, richness (ability to obtain good clusters on all attribute values/methods)
and consistency. Consistency in clustering means the shrinking and expanding the result,
the cluster results should not vary. But it is still difficult to find a clustering algorithm that
fulfils all these three criteria.
NOTES
125
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Summary
- Classification is a supervised method that attempts to classify an instance to a
class
- Regression is sort of classification model that can predict continuous variable
- The major issues of classification are Overfitting, missing data and performance
- Types of classifiers range from statistical, distance based, decision tree, rule based
and soft computing techniques
- The decision tree can be constructed using the measures like entropy, Information
gain and Gini Index
- The techniques of pruning can be used to avoid Overfitting
- Bayesian classifier is a probabilistic model which estimates the class for a new
data
- Bayesian network shows the relationships among variables of a system
- Nearest neighboring techniques used to determine alikeness of different tuples in
the database
- Regression analysis is used to model the relationship between one or more
independent variables and a dependent variable whereas in multiple regression
problems, the output is a combination of predictor variables.
- The testing techniques are hold-out method, K-fold cross validation
- The evaluation metrics of the classifier are sensitivity, specificity, positive predictive
value, negative predictive value, precision and accuracy
- The predictor model use absolute error, mean absolute error, mean squared error,
relative absolute error, and relative squared error.
- The other criteria to evaluate the classifier models are speed, robustness, scalability
and goodness of fit.
- Clustering is a technique of partitioning the objects.
- Problems of clustering include outlier handling, dynamic data handling and
interpreting the results
- Clustering uses similarity measures for clustering
- K-means and K-medoid algorithm are traditional Partitional algorithms.
- Dendograms are used to display the hierarchical clustering results.
- BIRCH, ROCK and CHAMELEON clustering algorithms are used to cluster
very large data set.
- The measures that are used to cluster are efficiency, ability to handle missing data/
noisy data, ability to handle different attribute types/ magnitude.
DID YOU KNOW
- What is the difference between classification and prediction?
- What is the difference between distance measure and a metric?
- What is the difference between similarity and dissimilarity? Is it possible to obtain
dissimilarity if the similarity measures are available?
- Traditional clustering algorithms are not suitable for very large data set. Justify.
DMC 1628
NOTES
126 ANNA UNIVERSITY CHENNAI
Short Questions
1. What is meant by a classification model?
2. Distinguish between the terms: Classification, regression and estimation.
3. What are the issues of classifiers?
4. What are the types of classifiers?
5. What are the measures of measuring the degree of similarity?
6. Distinguish between the terms: Entropy, Information gain and Gini Index.
7. What is meant by pruning?
8. State Bayesian rule.
9. What are the advantages and disadvantages of Bayesian classification?
10. What is meant by Bayesian network?
11. What are the advantages of Bayesian network?
12. List out the techniques of classification methods?
13. What are the important parameters of a neural network?
14. What are the advantages and disadvantages of neural network based classification?
15. Distinguish between the terms: regression, non-linear regression and multiple
regression.
16. Enumerate the evaluation method of classifier and predictor?
17. What is meant by clustering?
18. What is meant by clustering?
19. What are the advantages and disadvantages of clustering?
20. Enumerate the distance measures?
21. Enumerate the different clustering methods?
22. What are the advantages and disadvantages of K-means and K-medoid algorithms?
23. What are the problems associated with clustering of a large data?
24. What is the methodology of evaluating a cluster?
25. What is meant by edge cut and relative closeness?
Long Questions
1. Explain in detail the method of constructing a decision tree.
2. Explain in detail the ID3 algorithm.
3. Explain in detail Bayesian classifier.
4. Explain in detail the Bayesian network.
5. Explain in detail the regression models.
6. Explain in detail the soft computing methods for classification?
7. List out the different similarity measures.
8. Explain in detail the methodology of K-means and K-medoid algorithm.
9. Explain in detail the problems associated with clustering of very large data.
10. Compare and contrast the algorithms BIRCH, ROCK and CHAMELEON for
clustering very large data.
NOTES
127
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
UNIT IV
DATA WAREHOUSING
INTRODUCTION
Data mining will continue to take place in environments with or without Data warehouse.
Data warehouse is not prerequisite for data mining. But there is a need for organized,
efficient data storage and retrieval structure. Therefore Data Warehouse is important
technology and is a direct resultant of information age. This chapter presents the concepts
of data warehouse and its relevance for data mining.
Learning Objectives
- To differentiate Data warehouse from traditional database, data mart and operational
data store
- To study the architecture of a typical Data warehouse
- To study the fundamentals of multidimensional models
- To study the OLAP concepts, OLAP systems
- To study the issues, problems and trends of data warehouse and its relevance for
data mining.
4.1 NEED FOR DATA WAREHOUSE
Large amount of data is stored by the organizations for making decisions for their day
to day activity. For this reasons, the organizations run many different applications for their
need. Some of the applications include human resources, Inventory, sales, marketing and
so on. These systems are called Online Transaction Processing Systems (OLTP) systems.
These applications are mostly relational databases but include many legacy applications
written in many programming languages and in different environments.
Over the years, organizations have developed thousands of such applications. This
proliferation can lead to a variety of problems. For example, the generation of reports for
managers becomes difficult as organizations spread across different geographical locations.
Hence there is a need for a single version of enterprise information containing data of high
level accuracy and user friendly interface that can analyze the queries for making effective
decisions.
DMC 1628
NOTES
128 ANNA UNIVERSITY CHENNAI
In order to meet the requirements of the decision makers, it makes sense to create a
separate database that stores information that is of interest to the decision makers. The
new system can help the decision makers in analyzing the patterns and trends.
The solutions that have been proposed to tackle this problem are two solutions. One
solution is through dimensional modeling and another solution is through either operational
data store or Data warehousing.
4.2 OPERATIONAL DATA STORE (ODS)
Operational Data Store or simply ODS is one solution that is designed to provide a
consolidated view of the business organization current operational information.
Inmon and Imhoff defined ODS as a subject-oriented, Integrated, volatile,
current valued data store, containing only corporate data.
In contrast Data warehouse does not contain any operational data. So it is better
review the ODS first and then using that concept, before developing a data warehouse
system. In short, ODS can provide assistance for grand data warehouse project.
ODS is a subject-oriented data. OLTP contains application oriented data. For example,
a payroll contains all data that is relevant for payroll application. On the other hand, ODS
data is generic. It is designed to contain data related to the major data subjects of the
business organizations. ODS contains integrated data as it derives data from multiple data
sources. ODS provides a consolidated view of the data of the organization.
ODS is current valued. This means that ODS contains the up-to-date information.
ODS data is volatile as ODS constantly refresh itself using new information periodically.
ODS data is also a detailed data because its is detailed enough to enable the decision
makers to take effective decisions.
Thus ODS may be viewed as the organizations short term memory and also considered
a miniature type of data warehouse.
The advantages of the ODS system are
1. It provides a unified view of the operational data. It can provide a reliable store of
information.
2. It can assist in better understanding of the business and customers of the organization.
3. ODS can help to generate a better reports without having resort to OLTP systems
and other legacy systems.
4. Finally, ODS can help to populate data warehouse. This reduces the time for the
development of final data warehouse.
NOTES
129
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Design and Implementation of ODS
The Typical Architecture of ODS is given Figure 4.1
Figure 4.1 ODS System
The extraction of information from different data sources needs to be efficient. Also
the data of the ODS is constantly refreshed and this process is carried out regularly and
frequently. Quality of the data is checked very often for various constraints.
Populating ODS is the acquisition process of extraction, transformation and loading
data from various data sources. This is called ETL.
ETL tasks looks simple, but in reality requires a great skill in terms of management,
business analysis and technology. ETL process involves many off-the-shelf tools and deals
with different issues like
1. Handling different data sources is a complex issues as data sources includes flat
files, RDBMS and various legacy systems.
2. Compatibility between source and target systems
3. Technological constraints
4. Refreshing constraints
5. Quality and Integrity of data
6. Issues like backup, recovery etc.
Transformation is an important area where problems like multi-source data are tackled.
The problems include instance identity problems, data errors, data integrity problems. A
sound theoretical basis of data cleaning is given below
DMC 1628
NOTES
130 ANNA UNIVERSITY CHENNAI
1. Parsing : All data components are parsed for errors.
2. Correcting: The errors are rectified. For example, a missing data may be corrected
using the relevant information of the organization
3. Standardizing: The organizations can evolve a rule to standardize the data. For
example, a format dd/mm/yyyy or mm/dd/yyyy can be fixed for all data of DATE
types. Then all the DATES are forced to comply with this business rule.
4. Matching : Since most of the data are related, the data must be matched to check
its integrity.
5. Consolidation: The corrected data should be consolidated for building a single
version of the enterprise data.
Once the data is cleaned, then it is loaded into the ODS. A suitable ETL tools can be
used to automate the ETL process.
4.3 INTRODUCTION TO DATA WAREHOUSE
Data Warehouse is defined as A Data Warehouse is a subject oriented, integrated,
time variant, and non volatile collection of data in support of managements decision
making process (W.H.Inmon). Thus the Data Warehouse is different from operational
DB in four aspects.
Subject oriented: The data that is stored in Data Warehouse is related to subjects like
sales, product, supplier and customer. The aim is to store data related to a particular
subject so that decisions can be made.
Integrated: Data Warehouse is an integrated data source. It derives data from various
data sources ranging from flat file to databases. The data is cleaned and integrated before
stored in Data Warehouse.
Non volatile: The operations on Data Warehouse is limited to adding and updation of
data. The data is a permanent store of data and normally not deleted. These data over a
period of time becomes historical data.
Time variant: To make correct decisions there is a need for explicit or implicit time constraint
on Data Warehouse structure to facilitate the trend analysis.
Another definition of the Data Warehouse (Sean Kelly)
- Separate
- Available
- Integrated
- Time stamped
- Subject oriented
- Non volatile
- Accessible
NOTES
131
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Subject oriented data
Normally, all business organizations run variety of applications like Payroll, Sales and
Inventory. They organize data sets around individual applications to support their operational
systems. In short, the data is required to suit the functionalities of the application software.
In contrast, Data Warehouse organizes the data in the form of subjects and not by
applications. The subjects vary from organizations to organizations.
For example, an organization may choose to organize data using subjects like Sales,
Inventory and Transport etc. The insurance company may have subjects like claims etc.
Inshort, the Data Warehouse data is meant for assisting the business organizations to take
decisions. Hence there is no application flavour.
Integrated
The data in the Data Warehouse comes from different data sources. The data stores
may have different databases, files and data segments. Data Warehouse data is constructed
by extracting the data from various disparate sources and its inconsistency are removed,
then it is standardized, transformed, consolidated and finally integrated.
The data is then preserved in a clean form so that it is devoid of any errors.
Time variant data
As the time passes, the data that is used by the organizations becomes stale and
obsolete. Hence the organizations always maintain current values. On the other hand, the
Data Warehouse data is meant for analysis and decision making. Hence the past data is
necessary for analyzing and performing trend analysis.
Hence the Data Warehouse has to contain historical data along with the current data.
Data is stored in snapshots. Hence the data in Data Warehouse is of time-variant in nature
that allows analysis of the past, relates information to the present and effectively makes
prediction of the future.
Non-Volatile data
While the data in the operational systems is intended to run day-to-day business,
Data Warehouse data is meant for assisting the organizations in decision making. Hence
the Data Warehouse data are not updated in real time. While we change, add, delete
operational DB at will, the Data Warehouse are not very commonly updated. That is the
data is not as volatile as the data in operational database.
Data warehouse sounds similar to ODS. But it differs from ODS in some aspects.
They are tabulated in the following Table 4.1.
DMC 1628
NOTES
132 ANNA UNIVERSITY CHENNAI
Table 4.1 ODS Vs Data Warehouse
Data Warehouse also sounds similar to the databases. So how does a how does a
data Warehouse differ from a database?
Even though they share main concepts like tables, schemas and so on , Data warehouse
is different from the database in a number of aspects and these differences are so fundamental
that it is often better to treat data warehouse differently from the database.
1. The biggest difference between the two is that most databases place more emphasis
on applications. More often the applications cater to a single domain like payroll
processing, financial and not contain data of multiple domains. In contrast, data
warehouses always deal with multiple domains. This allows the data warehouse to
show how the company as a single whole entity rather than in individual pieces.
2. Data warehouses are designed to support analysis. The two types of data that
normally company will possess is operational data and decision support data. The
purpose, format, and structure of these two data types are different. In most cases,
the operational data will be placed in a relational database while the data warehouse
will have all the decision support data.
3. In the relational database, tables are frequently used. Normally the normalization
rules are used to normalize data to remove redundancy. While this mechanism is
highly effective in an operational database, it is not conducive to decision making
as the changes are not maintained and monitored. For example, a student database
may have the current data, often it will not show the evolution of that database. In
short, historical information is absent in database. In this situation, decision support
is often useful and data warehouse takes the responsibility of maintaining the historical
information.
4. Data warehouse data often differs from the relational database in many aspects
like time span. Generally database data are involved with time span that is atomic
or current but generally deal with a short time frame. However, data warehouse
data deal with higher span of long time frames. Another difference is granularity of
data. The decision support data is a detailed as well as summarized data that has
many different parts of aggregation.
In short, Data warehouses are more elaborate than a mere database.
NOTES
133
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Advantages and Disadvantages of a Data Warehouse
There are number of advantages the business organizations can derive by using a
data warehouse. Some of the advantages are listed below.
1. The organization can analyze the data warehouse to find historical patterns and
trends that can allow them to make important business decisions.
2. Data warehouse enables the users to access a large amount of information. This
information can be used to solve number of queries of the user as well as the
decision makers. Data warehouse can access data from different locations in a
combined form. Hence the decisions made by the decision makers will assist in
increasing its profits to reduce the cost of computing.
3. When data is taken from multiple sources and placed in a centralized location, an
integrated view enables the decision makers to make a better decision than they
would if they looked at the data separately.
4. Data mining is connected to data warehouses. Hence data warehouse and data
mining can complement each other.
5. Data warehouses create a structure which will allow changes to the data. The
changed data is then transferred back to operational systems.
Disadvantages
1. Data must be cleaned, loaded, or extracted to get quality results. This process can
take a longer period of time and many compatibility issues are involved.
2. Users who will be working with the data warehouse require training.
3. The data warehouse is accessed by higher level management. Hence confidentiality
and privacy are important issues. Accessing the warehouse through the Internet
involves large number of security problems.
4. It is difficult to maintain data warehouses. Any organization that is considering
using a data warehouse must decide the benefits of warehouse versus the cost of
the project to decide upon the worthiness of data warehouse project.
Strategic Use of a Data Warehouses
It is not enough for a company to simply acquire a data warehouse if the business
organizations are not able to utilize it properly. It is crucial to use data warehouse effectively
in order to make important decisions.
Three categories of decision support can be defined
1. Reporting data: reporting data is considered to the lowest level of decision support.
But it is necessary for the business organizations to generate informative reports to
carry out successful business.
2. Analyzing Data: The Data warehouse is having plenty of tools for multidimensional
analysis. Hence the business organizations should use tools properly to analyze
DMC 1628
NOTES
134 ANNA UNIVERSITY CHENNAI
data that can assist the business organizations. It is important to understand the
past, present, and future. Hence a company can analyze the information to learn
about the mistakes they made in the past, and they can find ways to avoid the
mistakes getting repeated in the future. Companies also want to place an emphasis
on learning. This is a process important for a company to maneuver quickly to find
competitive edge among their competitors.
3. Knowledge discovery: Knowledge mining takes place through data mining. Hence
company will study patterns, connections, and changes over a given period of time
in order to make important decisions. A data warehouse is also a tool that can
allow companies to measure their success and failures in terms of the decisions
made.
Data Warehouse Issues
There are certain issues involved in developing the data warehouses that companies
need look at. A failure to prepare for these issues may result in a poor data warehouse.
1. The first issue is the quality of company data. One issue that confronts the
organization is the time they need to spend on loading and cleaning data. Some
experts believe that typical data warehouse project involve 80% of their time in
this process.
2. It is difficult to estimate time for data warehouse project. More often it is probably
longer time than the initial estimate.
3. Another issue that companies will have to face is security. Often the problem is to
decide what kind of data or information that can be placed in the warehouse.
Security and other issues often suddenly crop us thereby delaying the project.
4. Balancing the existing OLTP systems and applications versus data warehouse
may cause serious problems. Often the business organizations need to take decisions
whether or not the problem can be fixed via the transaction processing system or
a data warehouse.
5. There are many problems associated with data validation.
6. Estimating budgets for developing, maintaining and managing the data warehouse
is a big issue.
Data Warehouse and Data mart
Data mart is a database. It contains a subset of Data Warehouse data. The Data
Warehouse data is divided into many data marts so that data can be accessed faster.
There is a considerable confusion among the terms data warehouse and data marts
also. It is essential to understand the differences that exist between these two. Data
warehouses and data marts are not the same thing. There are some differences between
the Data warehouse and Data mart.
.
NOTES
135
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
1. A data warehouse has a structure which is separate from a data mart. Data mart is
specific. Say it can store groups of subjects related to a marketing department. In
contrast, a data warehouse is designed around the entire organization as a whole.
Data warehouse is owned by the organization and not by specific departments.
2. Data contained in data warehouses are highly granular, while the information in
data marts are not very granular.
3. Information stored in data warehouse is very large compared to data mart.
4. Much of the information that is held in data warehouses is historical and not biased
for any single department. It takes a overall view of the organization rather than a
specific view.
Role of Meta data in Data warehouse
Metadata is data about data in the Data Warehouse and is stored in repository. The
metadata is a sort of summarized data of the warehouse. It includes current and old detailed
data, lightly or highly summarized data. These serve as index and helps to access the data
present in the Data Warehouse. The metadata has
i. Structures to maintain schema, data definitions | hierarchy
ii. History of the migrated data
iii. Summarization algorithms
iv. Mapping constructs to help the mapping of the data in the operational data to
the Data Warehouse
v. Data related to the profiles and schedules to improve the performance of the
system
4.4 DATA WAREHOUSE ARCHITECTURE
The arrangement of components is called architecture. The components are arranged
in a form to extract maximum benefits. The basic components of the Data Warehouse are
as follows,
1. Source data component
2. Data staging component
3. Data storage component
4. Information delivery component
The basic components are generic on nature but the arrangement may vary as per the
organization needs. The variation is due to the fact that some of the components are stronger
than others is the architecture.
The basic components are described below.
1. Source data component
The source data that is coming to the Data Warehouse are in four broad categories.
DMC 1628
NOTES
136 ANNA UNIVERSITY CHENNAI
This component concerns about the data that serves as input for the Data Warehouse.
This consists of four categories of data.
Operational Systems data
Operational system provide major chunk of data for the Data Warehouse. Based on
the requirements of the Data Warehouse, the different segments of the data are chosen.
But the problem associated with the data is that they lack conformance. Hence the great
challenge of the warehouse designers is to standardize and transform data and integrate to
provide a useful data for the warehouse.
Internal data
These data belong to the individual users data and may be available in the form of a
spreadsheet, databases. These data may be useful for the warehouse and a suitable strategy
is required to utilize the data.
Archived data
Operational systems periodically backup the data. The frequency of the backup
operation may vary. Data Warehouse use historical data to take decisions. Hence warehouse
uses archived data as well.
External data
Business organization use external data for their day-to-day operation. Information
like market trends in stock exchanges is vital and is a typical external data. Usually external
data vary in data formats. These data should be organized so that warehouse can make use
of these data.
2. Data staging component
Data staging component is a workbench for performing three major functions,
1. Extraction
2. Transform
3. Loading
Extraction function plays a very important role as Data Warehouse pulls data from
various operational systems. The data differ in formats and sources. These range from flat
files to different relational databases. This step involves various techniques and tools to
prepare data suitable for the Data Warehouse requirement.
Data transformation performs crucial role of data conversion. First the external data
is cleaned. The cleaning may range from correction of spellings, missing value analysis to
advanced standardization of data. Finally after application of different techniques, this step
provides a integrated data that is devoid of any data collection errors.
NOTES
137
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Data loading concerns about both initial loading of data into the Data Warehouse and
to perform incremental revisions on a regular basis.
Collectively this is called ETL operation.
3. Data storage component
After the data is loaded, it is stored in a suitable data structure. Generally this is kept
separate from the storage of data related to operational systems. Normally operational
data vary very often. Hence the analysts should have the stable data of the warehouse.
That is the reason why Data Warehouses are read-only data repositories. No addition |
deletion is done on real time and updation is restricted to few people.
4. Information delivery content
This component is intended to serve the users who range from novice user to advanced
analyst who wishes to perform complex analysis. Hence this component provides different
methods of information delivery that includes adhoc reports; statistical analysis is usually
online that includes periodical delivery on Internet as well.
The basic components are generic on nature but the arrangement may vary as per the
organization needs. The typical three tier Data Warehouse architecture is shown below.
The three-tier architecture is shown in Figure 4.2.
Figure 4.2 Three Tier Architecture
DMC 1628
NOTES
138 ANNA UNIVERSITY CHENNAI
The bottom layer consists of difference data sources. It is a relational DB system.
The data is extracted from the operational database through an interface called gateway.
ODBC | JDBC is an example of the gateway.
The next layer is the Data warehouse or Data Mart. A data mart contains subset of
the corporate data that is of value to the specific group of users. Normally data mart is
implemented using low cost machines or Dependent data marts derive data directly from
the entire warehouse or independent data mart derives data directly from the operational
database. Metadata are data about data which includes structure of the Data warehouse,
operational metadata, algorithms for summarization, mapping, performance related data
and business data which includes business terms and definitions.
The next layer is OLAP server. It can be either ROLAP model or MOLAP model.
The front end tools and utilities perform functions like data extraction, data cleaning,
transformation, data load and refreshing.
4.5 IMPLEMENTATION OF DATA WAREHOUSE
OLAM integrates data mining with OLAP and performs knowledge extraction in
multidimensional data. OLAM is important because of high quality data in warehouse;
OLAP based exploratory analysis and on-line selection of data mining algorithms. The
integration of Data mining and OLAP is shown in Figure 4.3.
Figure 4.3 Integrating OLAP and DM.
OLAM engine perform data mining tasks like concept description, association,
classification, prediction, clustering and so on.
NOTES
139
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
4.6 MAPPING THE DATA WAREHOUSE TO MULTIPROCESSOR
ARCHITECTURE
Data warehouse environments are generally large and complex. So it requires a good
server management policies to monitor processes and statistics. Having multiple CPU
allows Data warehouse to perform more than one job. Hence parallelization plays an
important role in the effective implementation of Data warehouse server.
Server Hardware
This is the most crucial part. There are some architectural options. They are
Symmetric Multi-Processing (SMP)
Massively Parallel Processing (MPP)
The Figure 4.4 shows a SMP machine that is a set of tightly coupled CPU that share
memory and disk.
Example:
Figure 4.4 Tightly Coupled SMP Machine.
A cluster is a set of loosely coupled SMP machines connected by an Interconnect.
Each machine is called node. Every node has its own CPU and memory. But all share
access to a disk. These mimic a larger machine. Softwares manage the shared disk in a
distributed fashion. This kind of software is called distributed lock manager.
Massively Parallel Processing (MPP)
A MPP machine is made up of many nodes. All the nodes are loosely coupled and
linked together by the high speed Interconnect. This is shown in Figure 4.5.
DMC 1628
NOTES
140 ANNA UNIVERSITY CHENNAI
Figure 4.5 MPP Machine
MPP use distributed lock manager to maintain integrity.
4.7 DESIGN OF DATA WAREHOUSE
The first step in building data warehouse design is data modeling. A data model
documents the structure of the data independent of data usage. A common usage is the
entity relationship diagram. An ERD shows the structure of the data interms of the entity
and relationships.
An entity is the concept that represents a class of persons ,places or things. The
characteristics of the entity are called attributes. The attribute that is unique is called a key.
Entities are connected by various relations. The relations can be one to one, one-to-many
or many-to-many.
Figure 4.6 Sample ERD
Subject
Name
Subject
Offere
Subject ID
Faculty
Faculty ID
Faculty
Name
Designation
NOTES
141
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Once the ERD is completed, the model is analyzed. The analysis involves application
of normalization rules. The objective of the normalization is to remove the redundancies
that are present in the relational database. The first normal form (1NF) specifies that all
attributes to have a single value. The entities are in the 2NF if all the non key attributes are
dependent on the primary key. 3NF requires that values of all non key attributes are
dependent on the primary key.
Data warehousing design is different from the database design. The design involves
three steps.
1. Conceptual model
2. Logical design model
3. Physical design model.
Conceptual Data Model
Conceptual data model include identification of important entities and the relationships
among them. At this level, the objective is to identify the relationships among the different
entities.
Logical Data Model
The steps of the logical data model include identification of all entities and relationships
among them. All attributes for each entity are identified and then the primary key and
foreign key is identified. Normally normalization occurs at this level.
In data warehousing, it is common to combine the conceptual data model and the
logical data model to a single step.
The steps for logical data model are indicated below:
1. Identify all entities.
2. Identify primary keys for all entities.
3. Find the relationships between different entities.
4. Find all attributes for each entity.
5. Resolve all entity relationships that is many-to-many relationships.
6. Normalization if required.
Physical Data Model
Features of physical data model include:
- Specification all tables and columns.
- Specification of Foreign keys.
- Denormalization may be performed if necessary.
At this level, specification of logical data model is realized in the database.
DMC 1628
NOTES
142 ANNA UNIVERSITY CHENNAI
The steps for physical data model design involve Conversion of entities into tables,
conversion of relationships into foreign keys, conversion of attributes into columns and
changes to the physical data model based on the physical constraints.
Multidimensional Model
Dimensional data model is used in data warehousing systems. This is different from
the traditional 3NF design of traditional OLTP system. Multidimensional model is a way to
view and integrate data in a database. The data can be stored using different data structures
for effective retrieval. It may require storing data using multiple dimensions.
Dimension is a category of information. For example, a sale data involves three
dimensions like region, time and product type. The user query may be to summarize data
region wise or for a period or for a particular product type. Hence it makes sense to store
data in a form so that the retrieval time is faster. This type of modeling is called dimensional
modeling.
Attribute is a unique level within a dimension. For example, year is an attribute in the
Time Dimension. Often the attributes have relationships among them. This kind of relationship
is called Hierarchy. Hierarchy represents the relationship between the different attributes
with a dimension. For example one possible hierarchy in the Time dimension is Year !
Quarter ! Month ! Day.
The specific object of interest is stored in a table called fact table. Fact table contains
the measures of interest. For example, the sales in the fact table contain the attributes
related to sales. Often this is a numeric data. Fact consists of measures and context data
measures are for which queries are used and dimension facilitate the retrieval of data.
The dimensions can be stored as a table or a cube. This table is called Lookup Table.
This table contains all the detailed information about the attributes. For example, the lookup
table for the year in the time dimension contains a list of all of the year available in the data
warehouse.
Each dimension is seen as a cube and each dimension is seen as the axis of the cube
shown in Figure 4.7.
NOTES
143
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Figure 4.7 Data Cube
Product
The dimension level support partial order or a total order. The symbol < can be used
to represent the order relationship. Some of the order relationships are shown above as
lattice.
Operations like aggregation can be used to aggregate data like sum of sales of a
particular product type in a particular region. But dimensions are not always additive.
Multidimensional schema
Schema is used to represent multidimensional data. Some of the popular schemas
include,
- Star schema
- Snow flake schema
- Constellation schema
In the star schema design, a single fact table is in the middle and is connected to other
dimensional data radially like a star. A star schema can be simple or complex. A simple star
consists of one fact table while a complex star can have more than one fact table as part of
schema.
Company year country


Product category month state



Product type day district


Product town
DMC 1628
NOTES
144 ANNA UNIVERSITY CHENNAI
The snowflake schema is an extension of the star schema. Here the point of the star
explodes into more points. The main advantage of the snowflake schema is the improvement
in the query performance. But the main disadvantage of the snowflake schema is the additional
maintenance efforts needed due to the availability of more number of lookup dimension
tables.
Star schema is a graphical schema which shows data as a collection of facts and
dimensions and is shown in Figure 4.8.
Figure 4.8 Star Schema.
The centre of the star contains a fact table. The simplest star schema has one fact
table with multiple dimension tables.
Fact table contains data that is required for queries and can be very large. Each tuple
of the fact table contains a link to the dimension table. The dimension table contains much
descriptive information about dimension. The entire star schema can be obtained using
relational system.
For multidimensional table, it is better to develop indexes so that the index reduces
the overhead of scanning very large databases. Here the first bit of the index represents the
first tuple, the second bit the second tuple and so on. Hence the bits of the index are used
to represent the tuples. To find the specific tuple, each tuple should be associated with a
particular bit position. This facilitates the operations like aggregation and joins.
One of the common problems of the data warehouse is the Slowly Changing
Dimension problem. This problem is due to attributes that varies over time.
For example consider a customer record where there are two attributes. The two
attributes are customer name and location. For example, John, Delhi is a valid customer
record. But if John moves to another location say Delhi, the problem is how to record this
new fact.
NOTES
145
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
If the new record replaces the original record, then the old record would disappear.
In that case, there is no trace of the old record exists. On the other hand, if a new record
is added into the customer dimension table, then this is duplication as the customer is
treated as two people. One more way of tackling this issue is to modify the original record
is modified to reflect the change.
These scenarios can be evaluated and the star schema should be used to reflect the
changes. This is required because data warehouse is supposed to carry the historical
information to enable the managers to take effective decisions.
4.8 OLAP
OLAP stands for On-Line Analytical processing. It is designed to provide complex
results compared to traditional SQL. OLAP performs analysis of data and presents the
results to the user as a report. This aspect differentiates OLAP from traditional SQL.
OLAP
There has been a considerable confusion between the terms OLAP, data warehouse
and data mining. OLAP is different from other two as it is just a technology concerned with
the fast analysis of information. Basically it is a front end tool to process information that is
present in a warehouse. Some times the term business intelligence is used to refer both
OLAP and data warehousing.
OLAP provides a conceptual view of the data warehouse multidimensional data.
Data cubes are generalization of spread sheets that essentially provide a multidimensional
view.
The original definition of OLAP system given by E.F. Code is
OLAP is the dynamic enterprise analytic required to create, manipulate, animate
and synthesize information from exegetical, contemplative and formulaic data analysis
models
By exegetical means the analysis is from manager point of view and by contemplative
means it is from the view of the person who conceived it, thought about it and it is according
to some formula. Or one can view OLAP as the advanced analysis on shared
multidimensional information.
Characteristics of OLAP system
The difference between OLAP and OLTP is obvious. While the users of OLTP are
mainly middle and low level management, the users of OLAP systems are decision makers.
DMC 1628
NOTES
146 ANNA UNIVERSITY CHENNAI
OLAP systems are designed to be subject oriented while OLTP systems are application
oriented.
OLTP data are mostly read and changed regularly. But OLAP data are not updated in
real time.
OLTP mostly deals with the current information. But OLAP systems are designed to
support decision makers. Hence OLAP system require historical data over a large period
of time.
OLTP systems support day to day activity of the business organization. Mostly they
are performance and availability driven. But OLAP are management critical and are useful
to the management.
Queries of the OLTP are relatively simple. But OLAP queries are complex and often
deal with many records. The records are both current and historical data. Table 4.2 tabulates
some of the differences.
Table 4.2 OLTP Vs OLAP
What are the characteristics of OLAP systems?
OLAP characteristics are often collectively called FASMI characteristics based on
the first character of the characteristics. They are described as
1. Fast : OLAP systems should be able to furnish information quickly. For that the
queries should be executed in faster manner. But achieving fastness is a difficult
task. It requires a suitable data structures and sufficient hardware facilities to achieve
that kind of performance. Some times, the precomputation of aggregates are done
to reduce the execution time.
OLTP OLAP
Mostly Middle and low level
Management
Higher level management for strategic
decision making
Daily operations Decision support operations
Simple Query Complex Query
Application Oriented Subject Oriented
Current Data Historical, Summarized
multidimensional data
Read/ Write/ update at any time No real time updation

NOTES
147
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
2. Analytic: OLAP systems should provide rich functionalities. The system should be
user friendly and OLAP system should cope with multiple applications and users.
3. Shared: OLAP system is likely to be accessed by the higher level decision makers.
Hence the system should provide sufficient security for confidentiality and security.
4. Multidimensional: OLAP should provide multidimensional view of the data. This is
often referred as cube. The multidimensional structure should allow hierarchies
that shows the relationships among the members of the dimension.
5. Information: The OLAP system should be able to handle a large amount of input
data and information. It should be integrate effortlessly with the warehouse as
OLAP systems normally obtain data from the warehouse.
Codds OLAP characteristic
E.F. Codd has identified some of the important characteristics of the OLAP systems.
Some of the important characteristics are
1. Multidimensional conceptual view
2. Accessibility
3. Batch extraction Vs Interpretative
4. Multi user support
5. Storing OLAP results
6. Extraction of missing values
7. Treatment of missing values
8. Uniform reporting performance
9. Generic dimensionality
10. Unlimited dimensions and aggregation levels
Data Cube implementations
The data cubes ranges from thousands to lakhs depending on the business
organizations. Normally decision makers want results quickly. Hence there should be some
sort of strategy to handle data cube manipulation. Some of the strategies are mentioned
below
Precompute and Store all
Often there is a need to reduce time drastically. Hence it is better to compute the data
cubes and store it. But this solution is practically not implementable as the storage of more
cubes is a difficult task. Also creating indexes for data cubes pose a great problem.
Precompute and store some
Here the data cubes are precomputed but not stored. So there is no space requirement
and data cubes are executed on the fly. However the response time is very poor.
DMC 1628
NOTES
148 ANNA UNIVERSITY CHENNAI
Precompute and store some
In this strategy frequently accessed cubes are precomputed and stored. Some times,
the data cube aggregate may be derived from other data cubes. This often leads to better
performance.
The types of operations OLAP provide includes,
Simple query
This includes obtaining value from a single cell of a cube.
Slice
This operation looks at a sub cube by selecting one dimension.
Dice: This operation looks at a sub cube by selecting two or more dimensions. This
operation then includes slicing in one dimension and then rotating the cube to select on a
second dimension.
Roll up
This operation allows user to move up in the aggregation hierarchy like looking at the
overall sales of the company.
Drill down
This operation allows the user to get more detailed information by navigating down in
the aggregation hierarchy looking for detailed fact information.
Visualization
This operation allows the user to see results in a visual form for better understanding.
Slice and dice can be seen together as subdividing the cube on dimensions. Drill up and
drill down can be fastened by pre-computing and storing the frequently accessed
aggregations.
4.9 OLAP MODELS
The models that are available are
ROLAP Relational On-Line Analytical Processing
MOLAP Multidimensional On-Line Analytical Processing
DOLAP Desktop analytical processing and is a variation of ROLAP
NOTES
149
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
In MOLAP, the On-Line Analytical Processing is implemented by storing data
multidimensional. Usually MDDB is a vendor propriety system and store data in the form
of multidimensional hyper cubes. In ROLAP, OLAP engine resides in the desktop. In this
model, no prefabricated cubes are created beforehand. Rather relational data is presented
as a virtual MD data cubes.
MOLAP model
This is the traditional way of OLAP analysis. In MOLAP the data is stored in the
form of multidimensional cube. The storage is not in the form of a relational table. Hence
MOLAP is forced to choose various propriety standards that are available in the market.
The kind of processing in MOLAP is shown in the Figure 4.9.
Figure 4.9 MOLAP Model
OLAP engine reside in the special server that stores the propriety multidimensional
cubes.
Advantages:
* MOLAP cubes are built for fast data retrieval. Hence the performance of the system
is designed for providing excellent performance in retrieval of the information. Also these
cubes prove to be optimal for slicing and dicing operations.
DMC 1628
NOTES
150 ANNA UNIVERSITY CHENNAI
OLAP engine reside in the special server that stores the propriety multidimensional
cubes.
Advantages
* MOLAP cubes are built for fast data retrieval. Hence the performance of the system
is designed for providing excellent performance in retrieval of the information. Also these
cubes prove to be optimal for slicing and dicing operations
* MOLAP systems can perform complex calculations: All calculations have been
pre-generated when the cube is created. Hence, complex calculations are not only feasible
but return results in a faster manner.
Disadvantages
MOLAP is limited in the amount of data it can handle. It is not possible to include a
large amount of data in the cube itself because normally only the summarized data will be
included in the cubes.
This requires additional investment as Cube technology is often proprietary. So the
business organizations should be in a position to make additional investment in human and
capital resources. ROLAP
In ROLAP model, the OLAP engine resides in the desktop. Here the prefabricated
cubes are not created. Instead the relational data itself is presented as virtual data cubes.
The system is as shown in the following Figure 4.10.
Figure 4.10 ROLAP Model

Data Warehouse
OLAP SERVER
(Multidimensional
Database)
Desktop
NOTES
151
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
In the ROLAP model, the data is stored in the form of relational database. Then the
data is presented to the users as a form of dimensional data. Also the data is hided from the
user by a thin metadata layer.
The user presents the query to the middle layer. The analytical server converts the
user request into a set of complex queries and accesses the data from the data warehouse
Then, the middle layer constructs the cube on the fly and presents the cube to the user.
Here unlike the MOLAP, static structures are not created. The architecture of ROLAP
model is shown in Figure 4.11
Figure 4.11 Architecture of ROLAP Model.
The major advantages of the ROLAP is
1. Supports all basic OLAP features and functionalities
2. Mainly stores the data in the relational form
3. Supports some form of aggregation

Data Warehouse / RDBMS Servers
ANALYTICAL SERVER
D
a
t
a
C
o
m
p
l
e
x

S
Q
L
DESKTOP CLIENT
U
S
E
R
R
E
Q
U
E
S
T
D
y
n
a
m
i
c

C
u
b
e
s
DMC 1628
NOTES
152 ANNA UNIVERSITY CHENNAI
The differences between MOLAP and ROLAP is shown in Table 4.3.
Table 4.3 ROLAP Vs MOLAP Models
What is better choice? Often the business organizations need to take decisions based
on the user requirements, budgetary considerations and the requirement of query
performance and complexities of queries.
Summary
- ODS is a subject-oriented, Integrated, volatile, current valued data store, containing
only corporate data
- ODS provide a unified view enabling the management to understand business models
better.
- ETL is a process of populating ODS with data
- A Data warehouse is a subject-oriented, Integrated, time-variant, and non-volatile
collection of data in support of managements decision making process.
- Data warehouse helps in decision support like reporting data, analyzing data and
knowledge discovery through data mining.
- The types of data mart are base models and hybrid models.
- Data warehouse architecture has four components source data component, data
stage component, data storage component, information delivery component
- A cluster is a set of loosely coupled SMP machine connected by an interconnect.
- Data warehouse design includes conceptual model, logical model, normally
combined into one, and physical model.
- The popular schema includes star schema, snowflake and constellation schema.
- OLAP characteristics are called FASMI characteristics
- Data cube implementations are precompute and store all, precompute and store
none and Precompute and store some.
- OLAP models include MOLAP and ROLAP.
S.No. ROLAP MOLAP
1. Data stored in the relational
form
Summary data is stored in the
proprietary multidimensional
Data bases.
2 Support very large volume Support for moderate volumes
3 Use complex SQL query to
fetch data from warehouse
Prefabricated data cubes by
MOLAP engine
4 Known environment because
of relational nature.
Slightly unknown environment
complicated by proprietary
standards
5. Lesser speed because cubes
should be generated on the fly
Faster access.

NOTES
153
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
DID YOU KNOW?
1. What is the difference between ODS, Data mart and Data warehousing?
2. What is the difference between database, data warehouse and OLAP?
3. What is the link between Data warehouse and Business Intelligence?
Short Questions
1. What is the need for data warehouse?
2. What is the difference between data warehouse and database?
3. What is the standard definition of Data warehouse?
4. What is the difference between Data warehouse and ODS?
5. What are the advantages and disadvantages of Data warehouse?
6. How to use data warehouse strategically?
7. Enumerate some of the data warehouse issues.
8. What is the difference between Data warehouse and data mart?
9. What is meant by meta data?
10. What are the basic components of Data warehouse?
11. What is ETL?
12. What is meant by dimensional modeling?
13. What is a fact table and dimension lookup table?
14. What is meant by OLAP?
15. What are the characteristics of a OLAP system?
16. What is the difference between OLAP and OLTP system?
17. Enumerate some of the OLAP cube operations.
18. What are the different types of OLAP models?
19. What are the advantages and disadvantages of ROLAP and MOLAP?
20. What is the difference between ROLAP and MOLAP models?
Long Questions
1. Explain in detail the ODS design.
2. Explain in detail the Data warehouse architecture.
3. Explain how data mining can be integrated to OLAP model?
4. Explain in detail the Data warehouse design.
5. Explain the problem of Slowly changing dimension problem. Suggest how data
warehouse schema solves this problem.
6. Explain in detail the OLAP operations.
7. Explain in detail the OLAP models and its implementation.
DMC 1628
NOTES
154 ANNA UNIVERSITY CHENNAI
NOTES
155
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
UNIT V
APPLICATIONS OF DATA MINING
INTRODUCTION
Data mining is used extensively used in variety of fields. This chapter presents some
of the domain specific data mining applications. The social implications of the data mining
technologies are discussed. The tools that are required to implement the data mining
technologies are presented. Finally, some the latest developments like data mining in the
domains of text mining, spatial mining and web mining are mentioned briefly.
Learning objectives
- To study some of the sample data mining applications
- To study the social implications of data mining applications
- To explore some of the latest trends of data mining in the areas of text, spatial data
and web mining.
- To discuss the tools that are available for data mining
5.1 SURVEY OF DATA MINING APPLICATIONS
Data mining and warehousing technologies are used widely now in different domains.
Some of the domain areas are identified and some of the sample applications are mentioned
below.
Business
Predicting the future is a dominant theme in business. Many applications are reported
in the literature. Some of them are listed here
- Predicting the bankruptcy of a business firm
- Prediction of bank loan defaulters
- Prediction of interest rates for corporate funds and treasury bills
- Identification of groups of insurance policy holders with average claim cost
Data visualization is also used extensively along with data mining applications whenever
a huge volume of data is processed. Detecting credit card frauds is one of the major
applications deployed by credit card companies that exclusively use data mining technology.
DMC 1628
NOTES
156 ANNA UNIVERSITY CHENNAI
Telecommunication
Telecommunication is an attractive domain for data mining applications because telecom
industries have huge pile of data. The data mining applications are like
- Trend analysis and Identification of patterns to diagnose chronic faults
- To detect frequently occurring alarm episodes and its prediction
- To detect bogus calls, fraudulent calls and identification of its callers
- To predict cellular cloning fraud.
Marketing
Data mining applications traditionally enjoyed great prestige in marketing domain.
Some of the applications of data mining in this area include
- Retail sales analysis
- Market basket analysis
- Product performance analysis
- Market segmentation analysis
- Analysis of mail depth to identify customers who respond to mail campaigns.
- Study of travel patterns of customers.
Web analysis
Web provides an enormous scope for data mining. Some of the important applications
that are frequently mentioned in the data mining literature are
- Identification of access patterns
- Summary reports of user sessions, distribution of web pages, frequently
used/visited pages/paths.
- Detection of location of user home pages
- Identification of page classes and relationships among web pages
- Promotion of user websites
- Finding affinity of the users after subsequent layout modification.
Medicine
The field of medicine is always been a focus area for the data mining community.
Many data mining applications have been developed in medical informatics. Some of the
applications in this category include
- Prediction of diseases given disease symptoms
- Prediction of effectiveness of the treatment using patient history
Applications in Pharmaceuticals Company always are always of interest to data mining
researchers. Here the projects are mostly discovery oriented projects like discovery of
new drugs etc.
NOTES
157
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Security
This is another domain that traditionally enjoys more attention of data mining community.
Some of the applications that are mentioned in this category are
- Face recognition/Identification
- Biometric projects like identification of a person from a large image or
video database.
- Applications involving multimedia retrieval are also very popular.
Scientific Domain
Applications in this domain include
- Discovery of new galaxies
- Identification of groups of houses based on house type/geographical location
- Identification of earthquake epicenters
- Identification of similar land use
5.2 SOCIAL IMPACTS OF DATA MINING
Data mining has plenty of applications. Many data mining applications are ever-present
(ubiquitous) data mining applications which affects us in our daily life. Some of the examples
are web search engines, web services like recommender systems, intelligent databases,
and email agents which have overbearing influence in our life. Web tracking can help the
organization to develop a profile of the users. The applications like CRM(Customer Relation
Management) helps the organization to cater to the needs of customer in a personalized
manner, helps them to organize their products, catalogues to identify, market and organize
the facilities.
One of the recent issues that have cropped up is the question of privacy of the data.
When organizations collect millions of customer data, one of the major concerns is how
the business organizations use it. These questions have created more debates of code of
data mining.
Some of these are looked in the context of fair information practice principles.
These principles govern the quality, purpose, usage, security, accountability of the private
data.
The report says that the customers should have a say in how their private data should
be used. The levels are
1. Do not allow any analytics or data mining
2. Internal use of the organization
3. Allow data mining for all uses.
DMC 1628
NOTES
158 ANNA UNIVERSITY CHENNAI
These issues are just beginning. The sheer amount of data and the purpose of data
mining algorithm to explore hidden knowledge will generate great concerns and legal
challenges.
Some of the fair information report principles like
1. Clear purpose and usage should be disclosed in the data collection stage
itself.
2. Openness with regard to developments, practices, and policies with respect
to the private data.
3. Security safeguards to ensure that private data is secured. It should take
care of loss of data, unauthorized data access, modification or disclosure.
4. Participation of people
Privacy preserving data mining is a new area of data mining which concerns about the
privacy protection during data mining process. The aim is to avoid misuse of data while
getting all the benefits of data mining research can bring to humanity.
5.3 DATA MINING CHALLENGES
New data mining algorithms are expected to encounter more diverse data sources/
types of data that involve additional complexities that need to be tackled. Some of the
potential data mining challenges are listed below
Massive datasets and high dimensionality
Huge database provide combinatorial explosive search space for model induction.
This may produce patterns that are not always valid. Hence data mining algorithms should
be
1. Robust and efficient
2. Usage of good approximation methods
3. Scaling up of existing algorithms
4. Parallel processing in data mining
Mining methodologies and User Interaction issues
Mining different levels of knowledge is a great challenge. There are different types of
knowledge and different kinds of knowledge may be required at different stages. This
requires that database should be used in different perspectives and development of data
mining algorithms is a great challenge
User Interaction problems
Data mining algorithms are usually interactive in nature as users are expected to interact
with the KDD process at different points of time. The quality of the data mining of algorithms
NOTES
159
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
can be rapidly improved by incorporating the domain information. This helps to focus and
speedup the algorithms.
This requires the development of high-level data mining query languages to allow
users to describe ad-hoc data mining tasks by facilitating the necessary data. This must be
integrated with the existing database or data warehouse query language and must be
optimized for efficient and flexible data mining.
The discovered knowledge should also be expressed in such a manner so that the
user can understand it. This involves development of high-level languages, visual
representations, or similar forms. This requires the data mining system should adopt to
knowledge representation techniques like tables, trees etc.
Data handling problems
Managing the data is always a quite challenge for data mining algorithms. Data mining
algorithms are supposed to handle
1. Non standard data
2. Incomplete data
3. Mixed data involving numeric, symbolic, image and text.
Rapidly changing data pose great problems for the data mining algorithms. Changing
data make previously discovered patterns invalid. Hence the development of algorithms
with the incremental capability is required.
Also the presence of spurious data in the dataset leads to an over-fitting of the models.
Suitable regularization and re-sampling methodologies needs to be developed to avoid
overfitting of models.
Assessment of patterns is a great challenge. The algorithms can uncover thousands of
patterns, which are useless for the user and lack novelty. Hence development of suitable
metrics that assess the interestingness of the discovered patterns is a great challenge.
Also most of the data mining algorithms deal with multimedia data, which are stored
in a compressed form. Handling a compressed data is a great challenge for data mining
algorithms.
Performance challenges
Development of data mining algorithms that are efficient and scalable is a great
challenge. Algorithms of exponential complexity are of no use. Hence from database
perspective, efficiency and scalability are key issues.
Modern data mining algorithms are expected to handle interconnected data sources
of complex data objects like multimedia data, spatial data, temporal data, or hypertext
DMC 1628
NOTES
160 ANNA UNIVERSITY CHENNAI
data. Hence development of parallel and distributed algorithms to handle this huge and
diverse data is a great challenge.
5.4 TEXT MINING
This section focuses the role of data mining in text mining. Major amount of information
is available in text databases, which consists of larger collection of documents from various
sources like books, research papers and so on. This data is semi structured data.
The data may contain few structured data like of authors, title etc. Also some of the
data components like abstract and contents are unstructured data.
Two major areas that are often associated with text is information retrieval and text mining.
On-Line library catalog system is an example of information retrieval system where
relevant document is retrieved based on user query.
Thus information retrieval is a field concerned with the organization and retrieval of
information from a large collection of text-related data. Unlike databases where problems
and issues like concurrency control, recovery, transaction management, text retrieval itself
have problem like unstructured data, approximates search etc. Here the information can
be pulled or can be pushed on the system in that case is called filtering systems or
recommender systems.
Basic measure of text retrieval:
The measure of the accuracy of information retrieval is precision and recall.
Precision
This is the measure which indicates the percentage of retrieval document that are in
fact relevant to the query.
Recall
Recall is a metric which indicates the documents that are relevant to the query and is
defined as
Recall =
1 | |
log
| |
t
d
d
+
} retrieved {
} retrieval { } relevent {
Precision

=
{ } { }
{ }
relevant retrieved
relevant

NOTES
161
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Based on the precision and recall, one common tradeoff, a measure called F-Score can be
used.
Information retrieval
Information retrieval indicates that based on a query, the information can be retrieved.
One good example is web search engine. Here, based on the query, the search engine
performs a matching of key word with bulk of texts to retrieve user requested information.
Hence, the retrieval problem can be visualized as,
1. Document selection problem
2. Document ranking problem
In document searching problem, the query is considered as specifying constraints for
selecting the document. The user can give query. One typical system is the Boolean retrieval
system, where the user can give a query like bike or car. The system would then return
the query to fulfill the requirement of the user.
The document ranking problem, the documents are ranked based on the relevance
factor. Most systems present a ranked list based on the user keyword query. The goal is
to approximate the degree of relevance of a document with a score computed based on
the frequency of words of the document | collection.
One popular method is vector-space model. In this method both document and a
query represent vectors in the high-dimensional space of all possible keywords. Therefore
similarity measure is used to approximate document vector and query vector. Here similarity
values are used to rank the documents.
The steps of the vector space model are given as
1. The first step is called tokenization. This is a preprocessing step whose purpose
is to identify keywords. A sort of stop-list is used to avoid indexing irrelevant
words like The- a etc.
2. Identification of group of words based on commonality word stem is used to
group the documents is used.
3. Term frequency is a measure which finds the number of occurrences of terms in
the document. The factor term-frequency matrix is used to associate term with
respect to the given document. Its value is zero if the document does not contain
the term and non-zero otherwise. If t is used to denote term and d is used to
denote document, then
2 / ) precision recall (
precision recall
score - F
+

=
DMC 1628
NOTES
162 ANNA UNIVERSITY CHENNAI
TF(d,t) = { 0 if freq (d,t) = 0
1+log(1+log(freq(d,t))) otherwise

The importance of the termt is obtained using the measure called Inverse Document
Frequency (IDF).
IDF(t) =
Where,
d = document collection
d
t
= set of document that have the term t
4. Combine TF and IDF, which form the resultant,
TD IDF (d,t) = TF(d,t) x IDF(t)
Text Indexing
The popular text-indexing techniques are
1. Inverted index and
2. Signature file
Inverted index is an index structure that maintains two hash indexed or B+ tree index
tables. Typically the tables involved are document table and term table. Both tables contain
an identifier ID and a list of terms that occur in the document sorted based on some relevance
factor.
Signature file is another method which stores signature record for each document. A
signature is a fixed size vector. A bit is set to 1 if the term occurs in the document, otherwise
it is set to zero.
Query processing
Once the indexing is done, the retrieval system can answer the keyword by looking
up to the documents that contain the query keywords. A sort of counter is used for each
document and updates are made for each query term. The documents are fetched later that
match the term and increase their scores.
A sort of relevance feedback can be used to improve the performance.
One major limitation of these methods is that they are based on exact matching. The
problems associated with matching are synonym problems where the vocabulary differ
and polysemy problem where the words mean different things in different contexts.
1 | |
log
| |
t
d
d
+
NOTES
163
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Dimensionality reduction
The number of terms and documents are huge. This leads to a problem of inefficient
computation. A mathematical technique of dimensionality reduction is used to reduce the
vectors so that the application can be implemented effectively. Some of the techniques
used are latent semantic indexing, probabilistic semantic analysis, and locality preserving
indexing techniques.
The major approaches of text mining are
1) Keyword-based approach
2) Tagging approach
3) Info-extraction approach
Keyword-based approach discovers relationships at a shallow level and finding co-
occurring patterns.
Tagging can be manual process or can be automatic way of categorization of
documents.
Information extraction approach is more advanced and may lead to the discovery of
deep knowledge. But it requires semantic analysis of the text using NLP or machine learning
approaches.
Text-mining Tasks
1) Keyword based association analysis
2) Document Classification analysis
3) Document Clustering analysis
Keyword Based analysis
This analysis collects set of keywords or terms based on the frequency and extracts
association or correlation relationships among them.
Association analysis first extracts the terms, preprocess them using stop-words list.
Only the essential keywords are taken and stored in the database using the form
< ID, List of Keywords >
Then association analysis is performed on them.
The words that appear together are called term or a phrase. Association analysis can
perform compound associations or non compound associations. Compound associations
are domain-dependent terms/phrases, hence association analysis helps to tag the terms /
phrases automatic and also help in reducing the meaningless results.
DMC 1628
NOTES
164 ANNA UNIVERSITY CHENNAI
Document Classification Analysis
Classification helps to classify documents into classes so that document retrieval is
faster.
But classification of text is different from the relational database because relational
data is well structured. But text databases are not structured, that is the keywords associated
with the document are not organized into any fixed set of attributes. Therefore the traditional
classifications like decision tree are not effective for text mining.
Normally the Classifications that are used for text classifications are
1) K-nearest neighbor Classifier
2) Bayesian Classifier
3) Support vector machine
K-nearest neighbor uses similarity measure of the Vector-Space model for
Classification. All the documents are indexed. Indexes are associated with class label.
When a test document is submitted, it is treated as a query. All the documents that are
similar to the query document are returned by the classifier. The Class distribution can be
refined by tuning the query based on the refinements to get a good classifier with good
accuracy
Bayesian Classifier is another technique that can be used for effective document
Classification.
Support vector machine can be used to perform classification because they work
very well in the higher dimensional space.
Association-based Classification is effective for text mining. It extracts a set of
associated frequently occurring text patterns.
- It extracts keywords and terms Association analysis is applied.
- Concept hierarchies of the Keywords/ terms are obtained using WordNet or expert
knowledge, and then class hierarchies are formed.
- Association mining can then be used to discover a set of associative terms that can
be used to maximally distinguish one class or documents from others. The
association rules associated with each document class is derived.
- Such rules can be ordered based on the discriminative power and occurrence
frequency.
- These rules are then used to classify the new documents.
Document Clustering Systems
- Document clustering is one of the most important topic in text mining.
- Initially, a spectral clustering is used to reduce the dimensions of the document.
The mixture model Clustering method models text data with a mixture model.
NOTES
165
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
It performs clustering using two steps
1) Estimate the model parameters based on text data and Prior knowledge
2) Infer the Clusters based on the estimated model parameters.
5.5 SPATIAL DATA MINING
Spatial data mining is the process of discovering hidden but potentially useful patterns
from a large set of spatial data. Spatial data are data that have location component. The
specific location should be physical space such as address or longitude or latitude.
Spatial data can be stored in a spatial database. Spatial databases are constructed
using special data structures or Indices using distance or topological information. Some of
the spatial data characteristics are mentioned below
1) Rich data types
2) Spatial relationship among variables.
If a point says a house is affected by an earthquake, most probably the neighbor
house also would be affected. This property is called spatial autocorrelation.
3) Spatial autocorrelation among features.
4) Observations that is not independent.
Traditionally statistics can be applied only when a data is independent of neighbors.
Hence the term geo-statistics is normally applied to the spatial data as spatial statistics is
often associated with a discrete space.
Data I/P
The inputs for the spatial data mining algorithms are more complex as they include
extended objects like points, lines, polygons etc.
The data input also includes spatial attribute and non-spatial attributes. The non-
spatial attributes includes information like name, Population, disease type etc.
Spatial queries
The Spatial queries are like
Find all houses near the lake.
Find the regions affected by Fire etc.
Spatial queries, unlike traditional queries, do not use arithmetic operators like
< , > etc. Instead, they use operators like near, contained in, overlap etc. Thus the
relationships are spatial in nature.
DMC 1628
NOTES
166 ANNA UNIVERSITY CHENNAI
More often the spatial queries can be categorized as region wise or range query
asking for regions, nearest neighbor query to identify closer objects and distance scan to
identify objects within a certain distance.
The relationships that are often covered by the spatial queries are shown in Table 5.1.
Table 5.1 Spatial Operations
Data Mining Tasks for spatial Data
- Spatial OLAP
- Association Rules.
- Classification / Trend analysis
- Spatial Clustering methods.
Spatial OLAP
The kind of dimensions present in the spatial data include
Non-Spatial dimension
This represents non-spatial data.
Spatial-to-non-Spatial dimension
This tackles data at two levels. The base data is spatial but the generalizations are not
spatial. They are non-spatial in nature.
Spatial to-Non-Spatial dimension
This dimension includes data which are spatial both at the primitive level and at higher
level.
The measures of the spatial type can be
Numeric:This contains only numeric data.
Spatial :These may be a set of pointers pointing to the spatial object.
Disjoint Region A is disjoint from B, if there are
no common Points.
Overlaps or Intersects At least a common point
exists between region A and B
Equals
Covered by or Inside or
Contained:
Region B covers A if all the points
of A is covered by region B.
Covers of Contains: B contains a iff B is contained by A
NOTES
167
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Once the cube can be constructed, the queries can be answered just like the non-
spatial category.
Mining Association rules
Association rules can be applied to spatial data. The extracted associations rules are
of the form
A B (S
Y
, C
Y
) where A and B are sets of Spatial or non-spatial predicates. S
Y
is the
support of the rule and C
Y
is the confidence of the rule.
Spatial associations occur at different level. For example, the following is an association
rule.
Is_a(x, school) ^ closed_to(x, Big playground) => close_to(x, city) (50%, 50%)
Associations also identify groups of particular features that appear frequently closer
to each other. It is called mining of spatial-co-locations.
Spatial Clustering methods
Spatial Clustering is a process of grouping spatial objects to clusters. This process
ensures that the spatial objects clustered together have high similar and are dissimilar to the
objects in the other clusters. This process can be applied to a group of similar objects also
all classical clustering algorithms can be used to cluster the spatial data.
Spatial Classification algorithm
Spatial Classification analyzes special objects to derive classification scheme such as
neighborhood.
This requires the identification of spatial related factors and by performing and by
performing on max properties, relevance analysis, best attributes can be selected. Then
traditional classification algorithms like decision trees can be used to classify the spatial
data.
Spatial trend analysis also can be applied to detect changes and trends along spatial
dimension. This extracts trend of Spatial or non spatial data changing with space.
Sometimes, both time and space change. Traffic flows is an example. Spatio-temporal
classification scheme can be for these sorts of data.
DMC 1628
NOTES
168 ANNA UNIVERSITY CHENNAI
5.6 WWW MINING
Currently Web is the largest data source. Web has many characteristics that makes
mining the web a challenging task. Some of the characteristics are listed below:
1) The amount of data that is present in the Web is so huge.
2) Web is a repository where all kinds of data are present. This ranges from structured
tables, semi structured web pages and unstructured text and multimedia files.
3) The data are heterogeneous in nature and all the significant amount of information
is linked. The page that is referred by most of other pages is called authoritative
page.
4) Web data contains noise. The noise is due to the reason that many of the data are
unauthentic in nature as any one can put any information on the net.
5) Web is a dynamic content where changes are constant. Changing content and
management of dynamic data are biggest concerns.
Web mining aims to discover hidden information or knowledge of web data. The web
data includes web hyperlink structure, web page content and web usage data. Based on
these, the web mining tasks can be categorized into three categories
Web Structure Mining
Web pages are connected by links (hyperlinks). These hyperlinks can be mined to get
useful information like important web pages. Traditional algorithms are not helpful as normally
relational table has no link structure.
Web Content Mining
Web content mining tasks mine web page contents to get useful information or
knowledge. It is useful to cluster similar web pages and is useful in the classification of web
pages. The typical applications include customer profiling etc.
Web Usage Mining
Web usage mining refers to the process of mining user logs to discover user access
patterns.
Web mining is similar to data mining. In traditional data mining, data collection is a
bigger task. But in web mining, data collection can be substantial task. But once the data is
collected, it can be then preprocessed. Then mining algorithms can be applied to these
data.
NOTES
169
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Web Structure Mining
This represents the analysis of link structure of the web. The website X which contains
document A may have a hyperlink to another document B of web site Y. This means that
document B is useful to document A.
HITS (Hyperlink Induced Topic Search) are a common algorithm to get knowledge
to access documents relating to the topic. Also it aims to find the authoritative page for a
given topic.
The algorithm accepts a set of web site references as input. This is called Seed Set.
Typically it can be around 200 links. The HITS algorithm adds more references to the set
to expand it to a set T. The set T is called target set. The web links are measured as
weights. If it contains reference to an authoring site or if a authority site have links to a given
page. Then it is weighed more, and then it is weighted more. The outgoing links determine
weight of the hub.
1) Accept set S. Let p be the page of the set S
2) Initialize the weight of the hub to 1 for each page of the set S.
3) Initialize the weight of the authority to 1 for each page of the set S
4) Let the expression p q represent p has a hyper link to web page q.
5) Update weight of the authority, weight of the hub for each page p of the
set S.
Authority-weight (p) = hub-weight (q)
q p
hub-weight (p) = authority-weight (q)
p q
6) Exit.
Some of the problems associated with this algorithm is that it does not take
automatically generated hyperlinks into account, it does not exclude the irrelevant or less
relevant documents. Also the algorithm has problems like topic hijacking where many
hyperlinks points to the same web page. Drifting is another problem where the algorithm
fails to concentrate on the specific topic mentioned by the query.
Page-rank: Web graph
The semantic structure of the web can be constructed based on page-to block and
block- to-page relationship also.
Block-to-page relationship captures the several semantic blocks present in a web
page. If d represents these block-to-page matrix then
DMC 1628
NOTES
170 ANNA UNIVERSITY CHENNAI
Z =
Page-to-block relationship is defined as
X
ij
=
F is a function that assigns an importance value of every block b in the page p. The
function is empirically defined as the ratio between the size of the block b / and the distance
between the center of b and center of the screen multiplicity o. o is the normalization
factor to make the sum of f
p
(b) to be 1. Based on the value of X and Z , web page graph
can be constructed as
Web graph = XZ
Web Content Mining
Here patterns are extracted from online sources such as HTML files, text documents,
e-mail messages etc. For example, summarization of a web page is a data-mining task.
There are two approaches being explored here.
Local Knowledge-base model
In this model, web pages are collected. Then they are categorized. For example the
categories may be games, education. References to many web pages are collected under
this category. Based on the query, category is first selected then the search is performed
for the web page.
Agent based model
The agents can get the requirements of the user, then uses artificial intelligence concepts
to discover and organize the documents. The artificial intelligence ranges from user profiling
to customized or personalized web agents.
Web Usage Mining
Web page mining analyzes the behavior of the customers. Basically here, web log
analysis is carried out for web usage mining. The usage access pattern are used by the
organizations to devise strategy for various marketing applications.
5.7 DISTRIBUTED DATA MINING
Distributed Computing plays an important role in the data mining process. This is due
to the advances in computing and communication over wired and wireless networks. Data
mining and knowledge discovery in large amounts of data can benefit from the use of
parallel and distributed computational environments. This includes different distributed
sources of voluminous data, multiple compute nodes, and distributed user community. The
i
1 / s if there is a link between block i to page j
0 otherwise

( )
i j j i
fp b if b p
0 otherwise

NOTES
171
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
field of distributed data mining (DDM) deals with this problemmining distributed data
by paying careful attention to the distributed resources.
The need for Distributed Data Mining (DDM) is due to the fact that data mining
requires huge amount of data. Hence there is a need to distribute the data to provide a
scalable system. Also the data is inherently distributed into different databases.
Distributed data mining uses many technologies for distributed association mining,
rules mining and clustering. Some of the very popular implementations are for distributed
rule mining. This distributed rule mining involves some of the approaches that are mentioned
here
Synchronous Tree Construction (data parallelization)
In this model there is no need for movement of data. But this approach
requires high communication cost as tree becomes bushy.
Partitioned Tree Construction (task parallelization)
In this model the processors work independently once partitioned
completely. But this model involves load imbalance and high cost of data
movement
Hybrid Algorithm
This method combines good features of two approaches as this model
adapts dynamically according to the size and shape of trees
Often the distributed data mining uses a host of technologies to mine the data system.
There is no standardization of architecture for distributed data mining. For simplicity sake
a open source architecture called JAM (Java Aglet in Metalearning) can be mentioned
here for better understanding. This architecture uses agents to support DDM. Agents are
mobile carriers. The agents are used to generate and transport the trained classifiers, while
meta-learning combines these classifiers. This improves the efficiency as the classifiers are
generated at the different data sites in parallel. The classifiers are computed over these
stationary data sets.
Data Mining
5.8 TOOLS FOR DATA MINING AND CASE STUDIES
5.8.1 Data Mining OLE DB Miner
Data mining component is include in SQL server 2000/2005.OLE DB is also becoming
a standard for data mining application.
SQL server introduced data mining features for the first time. Initially two algorithms
were introduced: Microsoft decision tree and Microsoft clustering. Data mining components
became part of Microsoft Analysis Service.
DMC 1628
NOTES
172 ANNA UNIVERSITY CHENNAI
Microsoft Analysis has two components
1. OLAP Services
2. Data mining
OLAP and Data mining serve complementary to each other.
What is the need for OLE DB?
Existing packages have many problems. They are
1. All the packages have their own notion of developing such that they do not
communicate with each other. As package A communicate or interact with B.
2. Most packages are horizontal packages. Hence there is problem of integration
with user applications.
3. Another problem is, most of the data mining products extract data and store it in an
intermediate storage. Data porting and transformation is a difficult and expensive
operation. Instead data mining can be applied directly on the stored data.
Microsoft aims to remove the above said problems and aims at standardization. It
aims to provide an Industrial standard so that the data mining algorithms can be easily
plugged thus providing an common interface with both data mining consumer and data
mining provider.
The basic architecture of the OLE DB miner is shown in the Figure 5.1 below.
Figure 5.1 Distributed Data Mining Framework

SITE 1




E N G I N E
DDM
INTERFACE


CLI ENT /SERVER






CONFIGURATION
MANAGER
Repository/
Classifiers
SITE 2
DDM
INTERFACE


ENGINE


CLIENT/SERVER




Repository/
Classifiers


SITE 3

DDM
INTERFACE


ENGINE


CLIENT/SERVER



Repository/
Classifiers


DISTRIBUTED DATA
MINING NETWORK
NOTES
173
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Microsoft has complemented OLE DB provider based on DM specification. Data
mining provider is part of the Analysis Services. There is no sophisticated GUI. But there
are many wizards available as part of the Analysis Service to help data mining (Refer
Figure 5.2)
Figure 5.2 Architecture of OLE DB Miner.
Microsoft 2000 has complemented two algorithms as part of SQL Server 2000.
Both classification and clustering are highly scalable algorithms and suitable for large data
sets.
One of the biggest advantages is, it is based on OLE DB for data mining application.
Suppose if a college wants to mine student data to get a insight of the user requirements, all
it has to do is to conclude the data mining algorithms in the Student Information System
(Figure 5.3)
Figure 5.3 OLE DB Analysis

Consumer
OLE DB (API)
Data Mining Provider
OLE DB
Data Source

DM Wizard
DM Editor
DM Browser

Analysis Server
DMC 1628
NOTES
174 ANNA UNIVERSITY CHENNAI
The developer can use mining models developed using VB or C++ or using wizards
of the Analysis Manager. The wizards generate data mining models then issue text queries
similar to data bases.
Mining models are like containers. They do not store data but instead use data directly
in the RDMS created using SQL Server. The mining model looks like
Figure 5.4: Student Information System
CREATE MINING MODEL Student
{
Id long key,
Name text discrete,
Age long continuous
}
USING [Microsoft Decision Tree]
Once a model is created, the algorithm analyzes the input data. Any tabular data
source can be input for the model provided there is a OLE DB driver. To be consistent,
SQL Server provides syntax similar to SQL. SQL Server provides commands like OPEN
row set to access data from the remote site also. Data need not have to be loaded ahead
of time. This service is called in-place mining. After the training is over, data mining algorithms
can be applied. Once the data mining returns the patterns, the user can browse the mining
model to look at the discovered patterns.
5.8.2 WEKA Introduction With Case Studies
There is no single suite which provides all the data mining tasks. The users must use
sometimes different suites for their requirements.
WEKA workbench is a collection of data mining learning algorithms. It is designed in
such a manner that the user can explore data mining algorithms quickly over their datasets.

Student
Information
System
DM
Provider
DM
Models
NOTES
175
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
The major advantages of WEKA system are
1. IT is a open source software. Hence it is free but rich in features. Hence it is
maintainable, modifiable.
2. It provides many algorithms so that the user can easily explore, experiment and
compare classifiers.
3. It is developed in Java, so it is truly portable and can run on any machine.
4. It is easy to use.
The advantages of WEKA are
All algorithms are main-memory based. This limitation prevents the usage of WEKA
for larger datasets. For larger datasets, some sort of sub-sampling should be used.
Exploring WEKA:
1. Preprocessing:
WEKA requires input in CSV format or the native ARFF file formats. Database
access is provided using JDBC. So using SQL, data can be accessed from any database.
WEKA provides many filters to preprocess the data.
2. Classify
3. Cluster
4. Associate
5. Selection of attributes
6. visualize
WEKA provides a knowledge flow interface for specifying a data stream by graphically
connecting components representing data sources, preprocessing tools, learning algorithms,
evaluation and visualization tools.
WEKA provides an experimental component to run and compare the different
classification and regression algorithms with different parameter values. This also facilitates
to distribute load across many machines using Java RMI.
Methods, Algorithms/Architecture:
WEKA provides a comprehensive set of useful algorithms. Weka algorithms cover a
wide range of algorithms. The range of algorithms includes filters, clustering, classification,
association rule learning, and regression. Virtually all algorithms are supported by WEKA
To facilitate, operations as flexible as possible, WEKA is designed with a modular, object-
oriented architecture. So any new algorithms can be included easily.
DMC 1628
NOTES
176 ANNA UNIVERSITY CHENNAI
WEKA implementation is a top level package called core. This provides the global
data structures, classes, instances, attributes. Even data mining task is available as a sub
packages like classification, cluster etc.
This whole set of features make WEKA as an attractive, open source mechanism to
explore the benefits of data mining.
Case Study 1
Weka is effective in solving data mining problem. This case study is aimed to
demonstrate the use of weka. The data set that is chosen for demonstration is weather
data.
The weather data is a nominal data and it has 14 records. The attributes of the
table are given below.
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
The aim is to decide whether the kid can play or not?
The first step is the collection of data. Weka requires that the data should be given in
a format called ARFF. ARFF stands for Attribute Relation File.
The ARFF file of this database is given below.
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
NOTES
177
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
The file first specifies the attribute list followed by the actual data.
Weka provides rich data mining functionalities. It provides a GUI using which the
user can select the required tasks.
For the demonstration sake, a ID3 algorithm is selected. When applied, Weka
produces this classification model.
The resultant classification model is given below
=== Classifier model (full training set) ===
Id3
outlook = sunny
| humidity = high: no
| humidity = normal: yes
outlook = overcast: yes
outlook = rainy
| windy = TRUE: no
| windy = FALSE: yes
This is the decision tree produced by Weka. The terminal nodes are classes. The root
and internal nodes test the attributes.
It can be seen that the Time taken to build model is only 0.01 seconds. The quality of
the classification model is given by confusion matrix. The confusion matrix can be analyzed
using the metrics discussed in the earlier sections.
=== Confusion Matrix ===
a b < classified as
8 1 | a = yes
1 4 | b = no
DMC 1628
NOTES
178 ANNA UNIVERSITY CHENNAI
The confusion matrix yields the following conclusions
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 12 85.7143 %
Incorrectly Classified Instances 2 14.2857 %
Kappa statistic 0.6889
Mean absolute error 0.1429
Root mean squared error 0.378
Relative absolute error 30 %
Root relative squared error 76.6097%
Total Number of Instances 14
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.889 0.2 0.889 0.889 0.889 0.844 yes
0.8 0.111 0.8 0.8 0.8 0.844 no
We can choose another algorithm J4.8 of weka (Java implementation of C4.5) algorithm
that uses information gain. (Refer chapter 3)
The trace of the algorithm is shown below. The same kind of analysis can be made for
this corresponding result also.
J48 pruned tree

outlook = sunny
| humidity = high: no (3.0)
| humidity = normal: yes (2.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
Number of Leaves : 5
Size of the tree : 8
Time taken to build model: 0.06 seconds
=== Stratified cross-validation ===
=== Summary ===
NOTES
179
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Correctly Classified Instances 7 50 %
Incorrectly Classified Instances 7 50 %
Kappa statistic -0.0426
Mean absolute error 0.4167
Root mean squared error 0.5984
Relative absolute error 87.5 %
Root relative squared error 121.2987 %
Total Number of Instances 14
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.556 0.6 0.625 0.556 0.588 0.633 yes
0.4 0.444 0.333 0.4 0.364 0.633 no
=== Confusion Matrix ===
a b < classified as
5 4 | a = yes
3 2 | b = no
This is the result of the association rule mining. Association rule mining algorithm tries
to associate attributes to generate rule.
Apriori algorithm (Refer chapter 2) is used to generate the following association rules.
Apriori
=======
Minimum support: 0.15 (2 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 17
Generated sets of large itemsets:
Size of set of large itemsets L(1): 12
Size of set of large itemsets L(2): 47
Size of set of large itemsets L(3): 39
Size of set of large itemsets L(4): 6
Best rules found:
1. outlook=overcast 4 ==> play=yes 4conf:(1)
2. temperature=cool 4 ==> humidity=normal 4 conf:(1)
DMC 1628
NOTES
180 ANNA UNIVERSITY CHENNAI
3. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1)
4. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1)
5. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1)
7. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1)
8. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1)
9. outlook=sunny temperature=hot 2 ==> humidity=high 2 conf:(1)
10. temperature=hot play=no 2 ==> outlook=sunny 2 conf:(1)
We can notice that all the rules are associated with confidence factor.
Getting a good quality data set is a difficult task. So for testing algorithms. It is better to
generate random datasets (Synthetic data sets) for testing purposes.
%
% Commandline
%
% weka. datagenerators. classifiers. classification. Agrawal -r
weka.datagenerators.classifiers.classification.Agrawal-S_1_-n_100_-F_1_-P_0.05 -S
1 -n 100 -F 1 -P 0.05
%
@relation weka.datagenerators.classifiers.classification.Agrawal-S_1_-n_100_-F_1_-
P_0.05
@attribute salary numeric
@attribute commission numeric
@attribute age numeric
@attribute elevel {0,1,2,3,4}
@attribute car {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20}
@attribute zipcode {0,1,2,3,4,5,6,7,8}
@attribute hvalue numeric
@attribute hyears numeric
@attribute loan numeric
@attribute group {0,1}
@data
110499.735409,0,54,3,15,4,135000,30,354724.18253,1
140893.779095,0,44,4,20,7,135000,2,395015.33902,1
119159.651677,0,49,2,1,3,135000,22,122025.085242,1
20000,52593.636537,56,0,9,1,135000,30,99629.621457,1
=== Run information ===
NOTES
181
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: weka.datagenerators.classifiers.classification.Agrawal-S_1_-n_100_-
F_1_-P_0.05
Instances: 100
Attributes: 10
salary
commission
age
elevel
car
zipcode
hvalue
hyears
loan
group
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
J48 pruned tree

age <= 37: 0 (31.0)


age > 37
| age <= 62: 1 (39.0/4.0)
| age > 62: 0 (30.0)
Number of Leaves : 3
Size of the tree : 5
Time taken to build model: 0.14 seconds
=== Stratified cross-validation ===
=== Summary ===
DMC 1628
NOTES
182 ANNA UNIVERSITY CHENNAI
Relative absolute error 21.4022 %
Root relative squared error 53.9683 %
Total Number of Instances 100
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.938 0.086 0.953 0.938 0.946 0.912 0
0.914 0.062 0.889 0.914 0.901 0.912 1
=== Confusion Matrix ===
a b < classified as
61 4 | a = 0
3 32 | b = 1
The results of IR algorithm (Refer chapter 3) for a random data is shown below
=== Run information ===
Scheme: weka.classifiers.rules.OneR -B 6
Relation: weka.datagenerators.classifiers.classification.RDG1-S_1_-n_100_-a_10_-
c_2_-N_0_-I_0_-M_1_-R_10
Instances: 100
Attributes: 11
a0
a1
a2
a3
a4
a5
a6
a7
a8
a9
class
Test mode: 10-fold cross-validation
NOTES
183
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
=== Classifier model (full training set) ===
a5:
false -> c0
true -> c1
(74/100 instances correct)
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 74 74 %
Incorrectly Classified Instances 26 26 %
Kappa statistic 0.4992
Mean absolute error 0.26
Root mean squared error 0.5099
Relative absolute error 57.722 %
Root relative squared error 107.5058 %
Total Number of Instances 100
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.636 0.059 0.955 0.636 0.764 0.789 c0
0.941 0.364 0.571 0.941 0.711 0.789 c1
=== Confusion Matrix ===
a b < classified as
42 24 | a = c0
2 32 | b = c1
=== Run information ===
Scheme: weka.classifiers.trees.Id3
Relation: weka.datagenerators.classifiers.classification.RDG1-S_1_-n_100_-a_10_-
c_2_-N_0_-I_0_-M_1_-R_10
Instances: 100
Attributes: 11
a0
a1
a2
a3
a4
DMC 1628
NOTES
184 ANNA UNIVERSITY CHENNAI
a5
a6
a7
a8
a9
class
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Id3
a5 = false
| a1 = false: c0
| a1 = true
| | a8 = false: c0
| | a8 = true
| | | a0 = false: c0
| | | a0 = true
| | | | a2 = false: c1
| | | | a2 = true
| | | | | a4 = false: c1
| | | | | a4 = true: c0
a5 = true
| a8 = false
| | a9 = false
| | | a2 = false
| | | | a3 = false: c0
| | | | a3 = true
| | | | | a1 = false: c0
| | | | | a1 = true: c1
| | | a2 = true
| | | | a0 = false
| | | | | a4 = false: c1
| | | | | a4 = true: c0
| | | | a0 = true: c1
| | a9 = true
| | | a3 = false
| | | | a0 = false: c1
| | | | a0 = true
| | | | | a7 = false: c0
| | | | | a7 = true: c1
| | | a3 = true: c1
NOTES
185
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
| a8 = true
| | a1 = false
| | | a2 = false
| | | | a0 = false
| | | | | a3 = false: c0
| | | | | a3 = true: c1
| | | | a0 = true: c0
| | | a2 = true
| | | | a4 = false: c1
| | | | a4 = true: c0
| | a1 = true
| | | a7 = false: c0
| | | a7 = true
| | | | a0 = false
| | | | | a2 = false: c1
| | | | | a2 = true: c0
| | | | a0 = true: c0
Time taken to build model: 0.06 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 78 78 %
Incorrectly Classified Instances 22 22 %
Kappa statistic 0.5234
Mean absolute error 0.22
Root mean squared error 0.469
Relative absolute error 48.8417 %
Root relative squared error 98.891 %
Total Number of Instances 100
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.803 0.265 0.855 0.803 0.828 0.769 c0
0.735 0.197 0.658 0.735 0.694 0.769 c1
=== Confusion Matrix ===
a b < classified as
53 13 | a = c0
9 25 | b = c1
DMC 1628
NOTES
186 ANNA UNIVERSITY CHENNAI
=== Run information ===
The results of the Apriori Algorithm (Refer Chapter 3) of the above data set using Weka is
shown below
Scheme:weka.associations.Apriori -N 10 -T 0 -C 0.9-D 0.05-U 1.0-M 0.1-S -1.0 -c -1
Relation: weka.datagenerators.classifiers.classification.RDG1-S_1_-n_100_-a_10_-
c_2_-N_0_-I_0_-M_1_-R_10
Instances: 100
Attributes: 11
a0
a1
a2
a3
a4
a5
a6
a7
a8
a9
class
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.2 (20 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 16
Generated sets of large itemsets:
Size of set of large itemsets L(1): 22
Size of set of large itemsets L(2): 182
Size of set of large itemsets L(3): 56
Best rules found:
1. a1=false a5=false 24 ==> class=c0 24 conf:(1)
2. a5=false a8=false 24 ==> class=c0 24 conf:(1)
3. a5=false a6=false 23 ==> class=c0 23 conf:(1)
4. a8=false class=c1 22 ==> a5=true 22 conf:(1)
5. a5=false a7=true 21 ==> class=c0 21 conf:(1)
NOTES
187
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
6. a5=false a9=false 21 ==> class=c0 21 conf:(1)
7. a3=false a5=false 20 ==> class=c0 20 conf:(1)
8. a6=false class=c1 20 ==> a5=true 20 conf:(1)
9. a2=false a5=false 27 ==> class=c0 26 conf:(0.96)
10. a4=false a5=false 23 ==> class=c0 22 conf:(0.96)
5.8.3 Selection of Data Mining Tool
One of the major decisions, the business organizations should exercise is to select a
suitable data mining tool for their requirement. The best tool need not have to be advanced
one. Some of the requirements of the business organizations should look into the selection
of tools are listed below.
1. Data types: The data types can be either record based relational data to specialized
data like spatial, stream data, time series or web data. The company should
have a clear idea about the kind of data they will be dealing with. No single suite
will support all the data types.
2. System Issues: Issues like operating system, machine requirements interfaces
like XML play an import role in selection.
3. Data sources
4. Ease of use: Many data mining tasks can be performed by a plain programmer
instead of an expert statistician. Modern data mining tools should have suitable
GUI to ease the pressure of using the tool thereby increasing the learning curve.
A good GUI is required to perform user-guided, high quality interactive data mining.
Lacks of standards is a primary issue. The requirements of the business organizations may
force them to choose functionalities of many data mining suites can be made use of.
1. Visualization tools
It is better to have visualization capability. Business organizations deal with terabytes.
Exploration of this amount of data is plan impossible. So visualization is required for
visualizing the data, result and process. The quality and flexibility of visualization tools
should be evaluated by the business organizations to select a suitable suite.
2. Accuracy
The accuracy of the data mining tool is important. A good tool with acceptable level
of accuracy normally influences organizations selection of tools.
3. Common tasks
The data mining suite capability to perform many data mining tasks should be evaluated.
The requirements of the organization vary. Some organizations are interested in OLAP
analysis, association mining, while some organizations may be interested in prediction and
trend analysis.
DMC 1628
NOTES
188 ANNA UNIVERSITY CHENNAI
It is impossible for any data mining suite to provide all facilities. Sometimes, business
organizations need to perform multiple tasks and may wish to integrate the tasks. This
provides flexibility to the organization.
Some of the data mining tasks require data warehouse and some may not. Hence, it
is the requirement of the user to explore the data mining functionalities to decide upon the
suitability of the data mining suite for the business organizations.
4. Scalability
Data mining has the scalability issues. Some tools prove only in-memory algorithm
which makes application of suites questionable especially when organization have very
larger datasets. Hence this should be the major criteria of tool selection for the organizations.
These are some of the features that needs to be considered along with the requirements
of the organization to consider the suitability of the suite. Some are the commercial suites
that are commonly used by the organization are Microsoft SQL Server. Intelligent Miner of
IBM, mine set, Oracle Data mining, Clementine, Enterprise Miner etc. Some of the tools
are open source projects like WEKA.
Summary
- Data mining is used in various domains like business, telecommunication, marketing,
web analysis, medicine, security and scientific domain.
- One of the major issues of data mining are privacy and confidentiality of the private
data
- Text mining mines the text present in the large collection of documents
- The basic measures of text retrieval are precision, recall and F-score.
- Text mining includes keyword based association analysis, document classification
and document clustering
- Text indexing includes inverted index and underlying signature file
- Spatial mining is the process of discovering hidden but potentially useful patterns
from a large set of spatial data
- Spatial mining includes spatial OLAP, association rules, classification, trend analysis
and spatial clustering methods.
- Web mining includes web structure mining, web content mining and web usage
mining.
- There is no single data mining suite which provides all the data mining tasks. Some
of the popular tool suites are weka and Microsoft OLE DB.
- Some of the criteria for choosing data mining suites are the ability to handle many
data types, ease of use, visualization capability, accuracy and data mining
functionalities.
NOTES
189
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
DID YOU KNOW?
1. What is the difference between ubiquitous data mining and traditional data mining
applications?
2. What is meant by privacy data?
3. Guarding the privacy of the data is a difficult task for data mining algorithms. Justify.
Short Questions
1. Enumerate some of the applications of data mining.
2. What is the role of Fair Information Report?
3. What are the measures of text retrieval?
4. What is meant by vector-space model?
5. Enumerate some of the spatial queries?
6. Enumerate some of the text mining tasks?
7. What are the kinds of the web mining?
8. What are the salient features of a data mining suite?
9. What are the social implications of data mining?
10. Enumerate some of the issues of data mining.
LONG QUESTIONS
1. What are the major issues that confront data mining? Explain in detail.
2. Explain in detail spatial data mining.
3. Explain the differences between text mining and text retrieval.
4. Explain in detail text mining tasks.
5. Explain in detail the HITS algorithm of page ranking.
6. Explain the criteria for selecting a data mining suite.
DMC 1628
NOTES
190 ANNA UNIVERSITY CHENNAI
NOTES
NOTES
191
DATA WAREHOUSING AND DATA MINING
ANNA UNIVERSITY CHENNAI
NOTES
DMC 1628
NOTES
192 ANNA UNIVERSITY CHENNAI
NOTES

You might also like