DWA - Unit 4

18CSE487T-DATA WAREHOUSING AND ITS APPLICATIONS
UNIT-IV
Data Mining-introduction
Data mining is a process that is used by an organization to turn the raw data into useful data.
Utilizing software to find patterns in large data sets, organizations can learn more about their
customers to develop more efficient business strategies, boost sales, and reduce costs. Effective
data collection, storage, and processing of the data are important advantages of data mining. Data
mining method has been used to develop machine learning models. It is the extraction of vital
information/knowledge from a large set of data. Think of data as a large ground/rocky surface.
We don’t know what it is inside, we don’t know if something useful is beneath the rocks.
Steps involved in Data Mining:
• Business Understanding,
• Data Understanding,
• Data Preparation,
• Data Modeling,
• Evaluation and
• Deployment
Techniques used in Data Mining
(i) Cluster Analysis: It enables to identify a given user group according to common
features in a database. These features could include age, location, education level .
(ii) Anomaly Detection, It is used to determine when something is noticeably different
from the regular pattern. It is used to eliminate any database inconsistencies or
anomalies at the source.
(iii) Regression Analysis This technique is used to make predictions based on relationships
within the data set. For example, one can predict the stock rate of a particular product by
analyzing the past rate and by taking into account the different factors that determine the
stock rate.
(iv) Classification This deals with the things which have labels on it. Note in cluster
detection, the things did not have a label in it and by using data mining we had to label
and form into clusters, but in classification, there is information existing that can be
easily classified using an algorithm.
Data Mining Algorithms:
• The k-means Algorithm This algorithm is a simple method of partitioning a given data set
into the userspecified number of clusters. This algorithm works on d-dimensional vectors,
D={xi | i= 1, … N} where i is the data point. To get these initial data seeds, the data has to be
Prepared by - K. Panimalar AP/DSBS-SRMIST

sampled at random. This sets the solution of clustering a small subset of data, the global
mean of data k times. This algorithm can be paired with another algorithm to describe non-
convex clusters. It creates k groups from the given set of objects. It explores the entire data
set with its cluster analysis. It is simple and faster than other algorithms when it is used with
other algorithms. This algorithm is mostly classified as semisupervised.
• Naive Bayes Algorithm This algorithm is based on Bayes theorem. This algorithm is
mainly used when the dimensionality of inputs is high. This classifier can easily calculate the
next possible output. New raw data can be added during the runtime and it provides a better
probabilistic classifier. Each class has a known set of vectors which aim at creating a rule
which allows the objects to be assigned to classes in the future. The vectors of variables
describe the future objects. This is one of the easiest algorithms as it is easy to construct and
does not have any complicated parameter estimation schemas.
• Support Vector Machines Algorithm It is formed on the basis of structural risk

minimization and statistical learning theory. The decision boundaries must be identified
which is known as a hyperplane. It helps in the optimal separation of classes. The main job of
SVM is to identify the maximizing the margin between two classes. The margin is defined as
the amount of space between two classes. A hyperplane function is like an equation for the
line, y= MX + b. SVM can be extended to perform numerical calculations as well.
• The Apriori Algorithm
Join: The whole database is used for the hoe frequent 1 item sets.
Prune: This item set must satisfy the support and confidence to move to the next round for
the 2 item sets.
Repeat: Until pre-defined size is not reached till then this is repeated for each itemset level.
Data : Data can be defined as a systematic record of a particular quantity. It is the different
values of that quantity represented together in a set. It is a collection of facts and figures to be
used for a specific purpose such as a survey or analysis. When arranged in an organized form,
can be called information.
Types of Data
Data may be qualitative or quantitative. Once you know the difference between them, you can
know how to use them.
I. Qualitative Data: Qualitative data, also known as the categorical data, describes the data
that fits into the categories. Qualitative data are not numerical. The categorical
information involves categorical variables that describe the features such as a person’s
gender, home town etc. Categorical measures are defined in terms of natural language

specifications, but not in terms of numbers. Sometimes categorical data can hold
numerical values (quantitative value), but those values do not have a mathematical sense.
They represent some characteristics or attributes. They depict descriptions that may be
observed but cannot be computed or calculated. They are more exploratory than
conclusive in nature.
A. Nominal Data: Nominal data is one of the types of qualitative information which
helps to label the variables without providing the numerical value. The name
“nominal” comes from the Latin name “nomen,” which means “name.” Nominal
data is also called the nominal scale. It cannot be ordered and measured. But
sometimes, the data can be qualitative and quantitative. Examples of nominal data
are letters, symbols, words, gender etc. The nominal data are examined using the
grouping method. In this method, the data are grouped into categories, and then
the frequency or the percentage of the data can be calculated. These data are
visually represented using the pie charts.
B. Ordinal Data:Ordinal data/variable is a type of data that follows a natural order.
The significant feature of the nominal data is that the difference between the data
values is not determined. This variable is mostly found in surveys, finance,
economics, questionnaires, and so on.The ordinal data is commonly represented
using a bar chart. These data are investigated and interpreted through many
visualisation tools. The information may be expressed using tables in which each
row in the table shows the distinct category.
II. Quantitative Data: Quantitative data is also known as numerical data which represents
the numerical value (i.e., how much, how often, how many). Numerical data gives
information about the quantities of a specific thing. Some examples of numerical data are
height, length, size, weight, and so on. The quantitative data can be classified into two
different types based on the data sets. The two different classifications of numerical data
are discrete data and continuous data. These can be measured and not simply observed.
They can be numerically represented and calculations can be performed on them.
A. Discrete Data: Discrete data can take only discrete values. Discrete information
contains only a finite number of possible values. Those values cannot be
subdivided meaningfully. Here, things can be counted in whole numbers.
Example: Number of students in the class

B. Continuous Data: Continuous data are in the form of fractional numbers.

Continuous data is data that can be calculated. It has an infinite number of
probable values that can be selected within a given specific range.Example:
Temperature range
Data Mining Functionalities
We have observed various types of databases and information repositories on which datamining
can be performed. Let us now examine the kinds of data patterns that can be mined. Data mining
functionalities are used to specify the kind of patterns to be found in data mining tasks. In
general, data mining tasks can be classified into two categories: descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data to make predictions. In some cases,
users may have no idea regarding what kinds of patterns in their data may be interesting, and
hence may like to search for several different kinds of patterns in parallel. Thus it is important to
have a data mining system that can mine multiple kinds of patterns to accommodate different
user expectations or applications. Furthermore, data mining systems should be able to discover
patterns at various granularity (i.e., different levels of abstraction).
Data mining systems should also allow users to specify hints to guide or focus the search for
interesting patterns. Because some patterns may not hold for all of the data in the database, a
measure of certainty or “trustworthiness” is usually associated with each discovered pattern.
Data mining functionalities, and the kinds of patterns they can discover, are described Mining
Frequent Patterns, Associations, and Correlations Frequent patterns, as the name suggests, are
patterns that occur frequently in data. There are many kinds of frequent patterns, including item
sets, subsequences, and substructures. A frequent itemset typically refers to a set of items that
frequently appear together in a transactional data set, such as milk and bread. A frequently
occurring subsequence, such as the pattern that customers tend to purchase first a PC, followed
by a digital camera, and then a memory card, is a (frequent) sequential pattern. A substructure
can refer to different structural forms, such as graphs, trees, or lattices, which may be combined
with item sets or subsequences. If a substructure occurs frequently, it is called a (frequent)
structured pattern. Mining frequent patterns leads to the discovery of interesting associations and
correlations within data.
Data cleansing: This is a very initial stage in the case of data mining where the classification of
the data becomes an essential component to obtain final data analysis. It involves identifying and
removal of inaccurate and tricky data from a set of tables, database, and recordset. Some
techniques include the ignorance of tuple which is mainly found when the class label is not in
place, the next technique requires filling of the missing values on its own, replacement of
missing values and incorrect values with global constants or predictable or mean values.
Data integration: It is a technique which involves the merging of the new set of information
with the existing set. The source may, however, involve many data sets, databases or flat files.
The customary implementation for data integration is the creation of an EDW (enterprise data
warehouse) which then talks about two concepts- tight as well as loose coupling, but let’s not dig
into the detail.
Data transformation: This requires the transformation of data within formats generally from the
source system to the required destination system. Some strategies include Smoothing,
Aggregation, Normalization, Generalization and attribute construction.
Data discretization: The techniques which can split the domain of continuous attribute along
intervals is called data discretization wherein the datasets are stored in small chunks and thereby
making our study much more efficient. Two strategies involve Topdown discretization and
bottom-up discretization.
Concept hierarchies: They minimize the data by replacing and collecting low-level concepts
from high-level concepts. The multi-dimensional data with multiple levels of abstraction are
defined by concept hierarchies. The methods are Binning, histogram and, cluster analysis, etc.
Pattern evaluation and data presentation: If the data is presented in an efficient manner, the
client, as well as the customers, can make use of it in the best possible way. After going through
the above set of stages the data then is presented in forms of graphs and diagrams and thereby
understanding it with minimum statistical knowledge
Integrating Data Mining with Data Warehouse
If a data mining system is not integrated with a database or a data warehouse system, then there
will be no system to communicate with. This scheme is known as the non-coupling scheme. In
this scheme, the main focus is on data mining design and on developing efficient and effective
algorithms for mining the available data sets. The list of Integration Schemes is as follows –
• No Coupling − In this scheme, the data mining system does not utilize any of the database or
data warehouse functions. It fetches the data from a particular source and processes that data
using some data mining algorithms. The data mining result is stored in another file.
• Loose Coupling − In this scheme, the data mining system may use some of the functions of
database and data warehouse system. It fetches the data from the data respiratory managed by
these systems and performs data mining on that data. It then stores the mining result either in a
file or in a designated place in a database or in a data warehouse.
• Semi−tight Coupling − In this scheme, the data mining system is linked with a database or a
data warehouse system and in addition to that, efficient implementations of a few data mining
primitives can be provided in the database.
• Tight coupling − In this coupling scheme, the data mining system is smoothly integrated into
the database or data warehouse system. The data mining subsystem is treated as one functional
component of an information system
Data Mining Task Primitives
A data mining task can be specified in the form of a data mining query, which is input to the data
mining system. A data mining query is defined in terms of data mining task primitives. These
primitives allow the user to interactively communicate with the data mining system during
discovery to direct the mining process or examine the findings from different angles or depths.
The data mining primitives specify the following,
1. Set of task-relevant data to be mined.

2. Kind of knowledge to be mined.
3. Background knowledge to be used in the discovery process.
4. Interestingness measures and thresholds for pattern evaluation.
5. Representation for visualizing the discovered patterns.
A data mining query language can be designed to incorporate these primitives, allowing users to
interact with data mining systems flexibly. Having a data mining query language provides a
foundation on which user-friendly graphical interfaces can be built. Designing a comprehensive
data mining language is challenging because data mining covers a wide spectrum of tasks, from
data characterization to evolution analysis. Each task has different requirements. The design of
an effective data mining query language requires a deep understanding of the power, limitation,
and underlying mechanisms of the various kinds of data mining tasks. This facilitates a data
mining system's communication with other information systems and integrates with the overall
information processing environment.A data mining query is defined in terms of the following
primitives, such as:
1. The set of task-relevant data to be mined
This specifies the portions of the database or the set of data in which the user is interested. This
includes the database attributes or data warehouse dimensions of interest (the relevant attributes
or dimensions). In a relational database, the set of task-relevant data can be collected via a
relational query involving operations like selection, projection, join, and aggregation. The data
collection process results in a new data relational called the initial data relation. The initial data
relation can be ordered or grouped according to the conditions specified in the query. This data
retrieval can be thought of as a subtask of the data mining task. This initial relation may or may
not correspond to physical relation in the database. Since virtual relations are called Views in the
field of databases, the set of task-relevant data for data mining is called a minable view.
2. The kind of knowledge to be mined
This specifies the data mining functions to be performed, such as characterization,discrimination,

association or correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.

3. The background knowledge to be used in the discovery process
This knowledge about the domain to be mined is useful for guiding the knowledge discovery
process and evaluating the patterns found. Concept hierarchies are a popular form of background
knowledge, which allows data to be mined at multiple levels of abstraction.
Concept hierarchy defines a sequence of mappings from low-level concepts to higher-level, more
general concepts.
Rolling Up - Generalization of data: Allow to view data at more meaningful and explicit
abstractions and makes it easier to understand. It compresses the data, and it would require fewer
input/output operations.
Drilling Down - Specialization of data: Concept values replaced by lower-level concepts. Based
on different user viewpoints, there may be more than one concept hierarchy for a given attribute
or dimension.
An example of a concept hierarchy for the attribute (or dimension) age is shown below. User
beliefs regarding relationships in the data are another form of background knowledge.
4. The interestingness measures and thresholds for pattern evaluation
Different kinds of knowledge may have different interesting measures. They may be used to
guide the mining process or, after discovery, to evaluate the discovered patterns. For example,
interesting measures for association rules include support and confidence. Rules whose support
and confidence values are below user-specified thresholds are considered uninteresting.
Simplicity: A factor contributing to the interestingness of a pattern is the pattern's overall

simplicity for human comprehension. For example, the more complex the structure of a rule is,
the more difficult it is to interpret, and hence, the less interesting it is likely to be. Objective
measures of pattern simplicity can be viewed as functions of the pattern structure, defined in
terms of the pattern size in bits or the number of attributes or operators appearing in the pattern.
Certainty (Confidence): Each discovered pattern should have a measure of certainty associated
with it that assesses the validity or "trustworthiness" of the pattern. A certainty measure for
association rules of the form "A =>B" where A and B are sets of items is confidence. Confidence
is a certainty measure. Given a set of task-relevant data tuples, the confidence of "A => B" is
defined as Confidence (A=>B) = # tuples containing both A and B /# tuples containing A
Utility (Support): The potential usefulness of a pattern is a factor defining its interestingness. It
can be estimated by a utility function, such as support. The support of an association pattern
refers to the percentage of task-relevant data tuples (or transactions) for which the pattern is true.
Utility (support): usefulness of a pattern
Support (A=>B) = # tuples containing both A and B / total #of tuples

Novelty: Novel patterns are those that contribute new information or increased performance to
the given pattern set. For example -> A data exception. Another strategy for detecting novelty is
to remove redundant patterns.
5. The expected representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be displayed, which may include
rules, tables, cross tabs, charts, graphs, decision trees, cubes, or other visual representations.
Users must be able to specify the forms of presentation to be used for displaying the discovered
patterns. Some representation forms may be better suited than others for particular kinds of
knowledge.For example, generalized relations and their corresponding cross tabs or pie/bar
charts are good for presenting characteristic descriptions, whereas decision trees are common for
classification.
Example of Data Mining Task Primitives
Suppose, as a marketing manager of AllElectronics, you would like to classify customers based
on their buying patterns. You are especially interested in those customers whose salary is no less
than $40,000 and who have bought more than $1,000 worth of items, each of which is priced at
no less than $100. In particular, you are interested in the customer's age, income, the types of
items purchased, the purchase location, and where the items were made. You would like to view
the resulting classification in the form of rules. This data mining query is expressed in DMQL3
as follows, where each line of the query has been enumerated to aid in our discussion.
1. use database AllElectronics_db

2. use hierarchy location_hierarchy for T.branch, age_hierarchy for C.age
3. mine classification as promising_customers
4. in relevance to C.age, C.income, I.type, I.place_made, T.branch
5. from customer C, an item I, transaction T
6. where I.item_ID = T.item_ID and C.cust_ID = T.cust_ID and C.income ≥ 40,000 and
I.price ≥ 100
7. group by T.cust_ID
Data Preprocessing
Data preprocessing is the process of transforming raw data into an understandable format. It
is also an important step in data mining as we cannot work with raw data. The quality of the data
should be checked before applying machine learning or data mining algorithms. The data
preprocessing phase is perhaps the most crucial one in the data mining process. Yet, it is rarely
explored to the extent that it deserves because most of the focus is on the analytical aspects of
data mining. This phase begins after the collection of the data, and it consists of the following
steps:

Data cleaning: The extracted data may have erroneous or missing entries. Therefore,
some records may need to be dropped, or missing entries may need to be estimated.
Inconsistencies may need to be removed.
Feature selection and transformation: When the data are very high dimensional, many
data mining algorithms do not work effectively. Furthermore, many of the high-
dimensional features are noisy and may add errors to the data mining process. There-
fore, a variety of methods are used to either remove irrelevant features or transform
the current set of features to a new data space that is more amenable for analysis.
Another related aspect is data transformation, where a data set with a particular set
of attributes may be transformed into a data set with another set of attributes of the
same or a different type. For example, an attribute, such as age, may be partitioned
into ranges to create discrete values for analytical convenience. Preprocessing of data is mainly
to check the data quality. The quality can be checked by the following:
• Accuracy: To check whether the data entered is correct or not.

• Completeness: To check whether the data is available or not recorded.
• Consistency: To check whether the same data is kept in all the places that do or do not
match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
Steps Involved in Data Preprocessing:
1. Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
• (a). Missing Data: This situation arises when some data is missing in the data. It can be
handled in various ways. Some of them are:
1. Ignore the tuples: This approach is suitable only when the dataset we have is
quite large and multiple values are missing within a tuple.
2. Fill the Missing values: There are various ways to do this task. You can choose
to fill the missing values manually, by attribute mean or the most probable value.
• (b). Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It
can be generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
• Binning Method: This method works on sorted data in order to smooth it. The
whole data is divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately. One can
replace all data in a segment by its mean or boundary values can be used to
complete the task.

• Regression: Here data can be made smooth by fitting it to a regression

function.The regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
• Clustering: This approach groups the similar data in a cluster. The outliers may
be undetected, or it will fall outside the clusters.
2. Data Transformation: This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways:
• Normalization: It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)
• Attribute Selection: In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
• Discretization: This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.
• Concept Hierarchy Generation: Here attributes are converted from lower level to
higher level in hierarchy. For Example-The attribute “city” can be converted to
“country”.
3. Data Reduction: Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such cases. In order to get
rid of this, we uses data reduction technique. It aims to increase the storage efficiency and reduce
data storage and analysis costs. The various steps to data reduction are:
• Data Cube Aggregation: Aggregation operation is applied to data for the construction of
the data cube.
• Attribute Subset Selection: The highly relevant attributes should be used, rest all can be
discarded. For performing attribute selection, one can use level of significance and p-
value of the attribute.the attribute having p-value greater than significance level can be
discarded.
• Numerosity Reduction: This enable to store the model of data instead of whole data, for
example: Regression Models.
• Dimensionality Reduction: This reduce the size of data by encoding mechanisms. It can
be lossy or lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy reduction. The
two effective methods of dimensionality reduction are: Wavelet transforms and PCA
(Principal Component Analysis).
4.Data integration: The process of combining multiple sources into a single dataset. The Data
integration process is one of the main components in data management. There are some
problems to be considered during data integration.
• Schema integration: Integrates metadata (a set of data that describes other data) from
different sources.

• Entity identification problem: Identifying entities from multiple databases. For

example, the system or the use should know student _id of one database and
student_name of another database belongs to the same entity.
• Detecting and resolving data value concepts: The data taken from different databases
while merging may differ. Like the attribute values from one database may differ from
another database. For example, the date format may differ like “MM/DD/YYYY” or
“DD/MM/YYYY”.
5 . Feature extraction: An analyst may be confronted with vast volumes of raw documents,
system logs, or commercial transactions with little guidance on how these raw data
should be transformed into meaningful database features for processing. This phase
is highly dependent on the analyst to be able to abstract out the features that are
most relevant to a particular application. For example, in a credit-card fraud detection
application, the amount of a charge, the repeat frequency, and the location are often
good indicators of fraud. However, many other features may be poorer indicators
of fraud. Therefore, extracting the right features is often a skill that requires an
understanding of the specific application domain at hand.
The data cleaning process requires statistical methods that are commonly used for miss-
ing data estimation. In addition, erroneous data entries are often removed to ensure more
accurate mining results. The topics of data cleaning is addressed in Chap. 2 on data pre-
processing. Feature selection and transformation should not be considered a part of data
preprocessing because the feature selection phase is often highly dependent on the specific
analytical problem being solved. In some cases, the feature selection process can even be tightly
integrated with the specific algorithm or methodology being used, in the form of a wrapper
model or embedded model. Nevertheless, the feature selection phase is usually performed
before applying the specific algorithm at hand.
Association rule mining and classification
A common and powerful method employed in data mining is to discover rules about how
variables associate among themselves. For example, if the tendency of customers in a
supermarket is to buy cheese along with bread and milk, an association rule may be of the form
(bread, milk) ¼.cheese. Such association rules, once they are found out through data mining, can
be very useful for the supermarket management. The management can use the rule about bread,
milk, and cheese to do promotional pricing of these products properly. They can make use of the
rule in the placement of these products in close proximity on the shelves.
Association rules are really no different from classification rules except that they can predict any
attribute, not just the class, and this gives them the freedom to predict combinations of attributes
too. Also, association rules are not intended to be used together as a set, as classification rules
are. Different association rules express different regularities that underlie the dataset, and they

generally predict different things. Because so many different association rules can be derived
from even a tiny dataset, interest is restricted to those that apply to a reasonably large number of
instances and have a reasonably high accuracy on the instances to which they apply to. The
coverage of an association rule is the number of instances for which it predicts correctly—this is
often called its support. Its accuracy—often called confidence—is the number of instances that it
predicts correctly, expressed as a proportion of all instances to which it applies. For example,
with the rule: If temperature = cool then humidity = normal the coverage is the number of days
that are both cool and have normal humidity (4 days in the data of Table 1.2), and the accuracy is
the proportion of cool days that have normal humidity (100% in this case). It is usual to specify
minimum coverage and accuracy values and to seek only those rules whose coverage and
accuracy are both at least these specified minima. In the weather data, for example, there are 58
rules whose coverage and accuracy are at least 2 and 95%, respectively. (It may also be
convenient to specify coverage as a percentage of the total number of instances instead.)
Association rules that predict multiple consequences must be interpreted rather carefully. For
example, with the weather data in Table 1.2 we saw this rule:
If windy = false and play = no then outlook = sunny and humidity = high
This is not just a shorthand expression for the two separate rules:
If windy = false and play = no then outlook = sunny
If windy = false and play = no then humidity = high
It indeed implies that these exceed the minimum coverage and accuracy figures—but it also
implies more. The original rule means that the number of examples that are nonwindy,
nonplaying, with sunny outlook and high humidity, is at least as great as the specified minimum
coverage figure. It also means that the number of such days, expressed as a proportion of
nonwindy, nonplaying days, is at least the specified minimum accuracy figure. This implies that
the rule If humidity = high and windy = false and play = no then outlook = sunny also holds,
because it has the same coverage as the original rule, and its accuracy must be at least as high as
the original rule’s because the number of highhumidity, nonwindy, nonplaying days is
necessarily less than that of nonwindy, nonplaying days—which makes the accuracy greater. As
we have seen, there are relationships between particular association rules: some rules imply
others. To reduce the number of rules that are produced, in cases where several rules are related
it makes sense to present only the strongest one to the user. In the preceding example, only the
first rule should be printed.
Frequent pattern Mining
The problem of association pattern mining is naturally defined on unordered set-wise data.
It is assumed that the database T contains a set of n transactions, denoted by T1 . . . Tn.
Each transaction Ti is drawn on the universe of items U and can also be represented as

a multidimensional record of dimensionality, d = |U |, containing only binary attributes.

Each binary attribute in this record represents a particular item. The value of an attribute
in this record is 1 if that item is present in the transaction, and 0 otherwise. In practical
settings, the universe of items U is very large compared to the typical number of items in
each transaction Ti. For example, a supermarket database may have tens of thousands of
items, and a single transaction will typically contain less than 50 items. This property is
often leveraged in the design of frequent pattern mining algorithms.
An itemset is a set of items. A k-itemset is an itemset that contains exactly k items.

In other words, a k-itemset is a set of items of cardinality k. The fraction of transactions in T1 . . .
Tn in which an itemset occurs as a subset provides a crisp quantification of its
frequency. This frequency is also known as the support.
Definition 1 (Support) The support of an itemset I is defined as the fraction of the

transactions in the database T = {T1 . . . Tn} that contain I as a subset.
The support of an itemset I is denoted by sup(I). Clearly, items that are correlated will
frequently occur together in transactions. Such itemsets will have high support. Therefore,
the frequent pattern mining problem is that of determining itemsets that have the requisite
level of minimum support.
Definition 2 (Frequent Itemset Mining): Given a set of transactions T ={T1….Tn}, where each
Ti is a subset of items from U , determine all item-sets I that occur as a subset of at least a
predefined fraction minsup of the transactions in T . The predefined fraction minsup is referred
to as the minimum support. While the default convention in this book is to assume that minsup
refers to a fractional relative value, it is also sometimes specified as an absolute integer value in
terms of the raw number of transactions. This chapter will always assume the convention of a
relative value, unless specified otherwise. Frequent patterns are also referred to as frequent
itemsets, or large itemsets. This book will use these terms interchangeably. The unique identifier
of a transaction is referred to as a transaction identifier, or tid for short. The frequent itemset
mining problem may also be stated more generally in set-wise form.

Definition 3 (Frequent Itemset Mining: Set-wise Definition) Given a set of

sets T = {T1 . . . Tn}, where each element of the set Ti is drawn on the universe of elements U ,
determine all sets I that occur as a subset of at least a predefined fraction minsup of the sets in T .
As discussed in Chap. 1, binary multidimensional data and set data are equivalent. This
equivalence is because each multidimensional attribute can represent a set element (or
item). A value of 1 for a multidimensional attribute corresponds to inclusion in the set (or
transaction). Therefore, a transaction data set (or set of sets) can also be represented as a
multidimensional binary database whose dimensionality is equal to the number of items.
Consider the transactions illustrated in Table 4.1. Each transaction is associated with a
unique transaction identifier in the leftmost column, and contains a baskets of items that
were bought together at the same time. The right column in Table 4.1 contains the binary
multidimensional representation of the corresponding basket. The attributes of this binary
representation are arranged in the order {Bread, Butter, Cheese, Eggs, M ilk, Y ogurt}. In this
database of 5 transactions, the support of {Bread, M ilk} is 2/5 = 0.4 because both
items in this basket occur in 2 out of a total of 5 transactions. Similarly, the support of
{Cheese, Y ogurt} is 0.2 because it appears in only the last transaction. Therefore, if the
minimum support is set to 0.3, then the itemset {Bread, M ilk} will be reported but not
the itemset {Cheese, Yogurt}.
The number of frequent itemsets is generally very sensitive to the minimum support
level. Consider the case where a minimum support level of 0.3 is used. Each of the items
Bread, Milk, Eggs, Cheese, and Yogurt occur in more than 2 transactions, and can
therefore be considered frequent items at a minimum support level of 0.3. These items
are frequent 1-itemsets. In fact, the only item that is not frequent at a support level of
0.3 is Butter. Furthermore, the frequent 2-itemsets at a minimum support level of 0.3 are
{Bread, M ilk}, {Eggs, M ilk}, {Cheese, M ilk}, {Eggs, Y ogurt}, and {M ilk, Y ogurt}. The
only 3-itemset reported at a support level of 0.3 is {Eggs, M ilk, Y ogurt}. On the other hand,
if the minimum support level is set to 0.2, it corresponds to an absolute support value of
only 1. In such a case, every subset of every transaction will be reported. Therefore, the use
of lower minimum support levels yields a larger number of frequent patterns. On the other
hand, if the support level is too high, then no frequent patterns will be found. Therefore, an
appropriate choice of the support level is crucial for discovering a set of frequent patterns
with meaningful size. When an itemset I is contained in a transaction, all its subsets will also be
contained in the transaction. Therefore, the support of any subset J of I will always be at least
equal to that of I. This property is referred to as the support monotonicity property.
Definition 4 (Maximal Frequent Itemsets) A frequent itemset is maximal at a

given minimum support level minsup, if it is frequent, and no superset of it is frequent.
In the example of Table 4.1, the itemset {Eggs, M ilk, Y ogurt} is a maximal frequent item-
set at a minimum support level of 0.3. However, the itemset {Eggs, M ilk} is not maxi-
mal because it has a superset that is also frequent. Furthermore, the set of maximal fre-
quent patterns at a minimum support level of 0.3 is {Bread, M ilk}, {Cheese, M ilk}, and
{Eggs, M ilk, Yogurt}. Thus, there are only 3 maximal frequent itemsets, whereas the num-
ber of frequent itemsets in the entire transaction database is 11. All frequent itemsets can be
derived from the maximal patterns by enumerating the subsets of the maximal frequent patterns.
Therefore, the maximal patterns can be considered condensed representations of
the frequent patterns. However, this condensed representation does not retain information
about the support values of the subsets. For example, the support of {Eggs, M ilk, Y ogurt}
is 0.4, but it does not provide any information about the support of {Eggs, M ilk}, which is
0.6. A different condensed representation, referred to as closed frequent itemsets, is able to
retain support information as well. An interesting property of itemsets is that they can be
conceptually arranged in the form of a lattice of itemsets. This lattice contains one node for each
of the 2|U | sets drawn from the universe of items U . An edge exists between a pair of nodes, if
the corresponding sets differ by exactly one item. An example of an itemset lattice of size 25 =
32 on a universe of 5 items. The lattice represents the search space of frequent patterns.
All frequent pattern mining algorithms, implicitly or explicitly, traverse this search space
to determine the frequent patterns. The lattice is separated into frequent and infrequent itemsets
by a border, which is illustrated by a dashed line in Fig. 4.1. All itemsets above this border are
frequent, whereas those below the border are infrequent. Note that all maximal frequent itemsets
are adjacent to this border of itemsets. Furthermore, any valid border representing a true division
between frequent and infrequent itemsets will always respect the downward closure property.
Apriori algorithm
The Apriori algorithm uses the downward closure property in order to prune the candidate
search space. The downward closure property imposes a clear structure on the set of frequent
patterns. In particular, information about the infrequency of itemsets can be leveraged to
generate the superset candidates more carefully. Thus, if an itemset is infrequent, there is
little point in counting the support of its superset candidates. This is useful for avoiding
wasteful counting of support levels of itemsets that are known not to be frequent. The
Apriori algorithm generates candidates with smaller length k first and counts their supports
before generating candidates of length (k + 1). The resulting frequent k-itemsets are used to
restrict the number of (k + 1)-candidates with the downward closure property. Candidate
generation and support counting of patterns with increasing length is interleaved in Apriori.
Because the counting of candidate supports is the most expensive part of the frequent
pattern generation process, it is extremely important to keep the number of candidates low.
For ease in description of the algorithm, it will be assumed that the items in U have a
lexicographic ordering, and therefore an itemset {a, b, c, d} can be treated as a (lexicograph-
ically ordered) string abcd of items. This can be used to impose an ordering among itemsets
(patterns), which is the same as the order in which the corresponding strings would appear
in a dictionary.

The Apriori algorithm starts by counting the supports of the individual items to generate
the frequent 1-itemsets. The 1-itemsets are combined to create candidate 2-itemsets, whose
support is counted. The frequent 2-itemsets are retained. In general, the frequent itemsets
of length k are used to generate the candidates of length (k + 1) for increasing values
of k. Algorithms that count the support of candidates with increasing length are referred
to as level-wise algorithms. Let Fk denote the set of frequent k-itemsets, and Ck denote
the set of candidate k-itemsets. The core of the approach is to iteratively generate the
(k + 1)-candidates Ck+1 from frequent k-itemsets in Fk already found by the algorithm.
The frequencies of these (k+1)-candidates are counted with respect to the transaction
database. While generating the (k+1)-candidates, the search space may be pruned by
checking whether all k-subsets of Ck+1 are included in Fk. So, how does one generate the
relevant (k+1)-candidates in Ck+1 from frequent k-patterns in Fk?
If a pair of itemsets X and Y in Fk have (k−1) items in common, then a join between
them using the (k−1) common items will create a candidate itemset of size (k+1). For
example, the two 3-itemsets {a, b, c} (or abc for short) and {a, b, d} (or abd for short), when
joined together on the two common items a and b, will yield the candidate 4-itemset abcd.
Of course, it is possible to join other frequent patterns to create the same candidate. One
might also join abc and bcd to achieve the same result. Suppose that all four of the 3-subsets
of abcd are present in the set of frequent 3-itemsets. One can create the candidate 4-itemset
in 6 different ways. To avoid redundancy in candidate generation, the convention is to
impose a lexicographic ordering on the items and use the first (k − 1) items of the itemset
for the join. Thus, in this case, the only way to generate abcd would be to join using the first
two items a and b. Therefore, the itemsets abc and abd would need to be joined to create
abcd. Note that, if either of abc and abd are not frequent, then abcd will not be generated as
a candidate using this join approach. Furthermore, in such a case, it is assured that abcd will
not be frequent because of the downward closure property of frequent itemsets. Thus, the
downward closure property ensures that the candidate set generated using this approach
does not miss any itemset that is truly frequent. As we will see later, this non-repetitive and
exhaustive way of generating candidates can be interpreted in the context of a conceptual
hierarchy of the patterns known as the enumeration tree. Another point to note is that the
joins can usually be performed very efficiently. This efficiency is because, if the set Fk is
sorted in lexicographic (dictionary) order, all itemsets with a common set of items in the
first k − 1 positions will appear contiguously, allowing them to be located easily.
A level-wise pruning trick can be used to further reduce the size of the (k + 1)-candidate
set. All the k-subsets (i.e., subsets of cardinality k) of an itemset I ∈ Ck+1 need to be
present in Fk because of the downward closure property. Otherwise, it is guaranteed that
the itemset I is not frequent. Therefore, it is checked whether all k-subsets of each itemset
I ∈ Ck+1 are present in Fk. If this is not the case, then such itemsets I are removed from
Ck+1. After the candidate itemsets Ck+1 of size (k + 1) have been generated, their support can
be determined by counting the number of occurrences of each candidate in the transaction

database T. Only the candidate itemsets that have the required minimum support are
retained to create the set of (k + 1)-frequent itemsets Fk+1 ⊆ Ck+1. In the event that
the set Fk+1 is empty, the algorithm terminates. At termination, the union ∪k i=1Fi of the
frequent patterns of different sizes is reported as the final output of the algorithm. The heart of
the algorithm is an iterative loop that generates (k + 1)-candidates from frequent k-patterns for
successively higher values of k and counts them. The three main operations of the algorithm are
candidate generation, pruning, and support counting. Of these, the support counting process is
the most expensive one because it depends on the size of the transaction database T . The level-
wise approach ensures that the algorithm is relatively efficient at least from a disk-access
cost perspective. This is because each set of candidates in Ck+1 can be counted in a single
pass over the data without the need for random disk accesses. The number of passes over
the data is, therefore, equal to the cardinality of the longest frequent itemset in the data.
Nevertheless, the counting procedure is still quite expensive especially if one were to use
the naive approach of checking whether each itemset is a subset of a transaction. Therefore,
efficient support counting procedures are necessary.
Efficient Support Counting
To perform support counting, Apriori needs to efficiently examined whether each candidate
itemset is present in a transaction. This is achieved with the use of a data structure known
as the hash tree. The hash tree is used to carefully organize the candidate patterns in Ck+1
for more efficient counting. Assume that the items in the transactions and the candidate
itemsets are sorted lexicographically. A hash tree is a tree with a fixed degree of the internal
nodes. Each internal node is associated with a random hash function that maps to the index
of the different children of that node in the tree. A leaf node of the hash tree contains a list
of lexicographically sorted itemsets, whereas an interior node contains a hash table. Every
itemset in Ck+1 is contained in exactly one leaf node of the hash tree. The hash functions
in the interior nodes are used to decide which candidate itemset belongs to which leaf node
with the use of a methodology described below.
It may be assumed that all interior nodes use the same hash function f (·) that maps to

[0 . . . h−1]. The value of h is also the branching degree of the hash tree. A candidate itemset
in Ck+1 is mapped to a leaf node of the tree by defining a path from the root to the leaf node
with the use of these hash functions at the internal nodes. Assume that the root of the hash
tree is level 1, and all successive levels below it increase by 1. As before, assume that the
items in the candidates and transactions are arranged in lexicographically sorted order. At
an interior node in level i, a hash function is applied to the ith item of a candidate itemset
I ∈ Ck+1 to decide which branch of the hash tree to follow for the candidate itemset. The
tree is constructed recursively in top-down fashion, and a minimum threshold is imposed
on the number of candidates in the leaf node to decide where to terminate the hash tree
extension. The candidate itemsets in the leaf node are stored in sorted order.
To perform the counting, all possible candidate k-itemsets in Ck+1 that are subsets of
a transaction Tj ∈ T are discovered in a single exploration of the hash tree. To achieve
this goal, all possible paths in the hash tree, whose leaves might contain subset itemsets of
the transaction Tj , are discovered using a recursive traversal. The selection of the relevant
leaf nodes is performed by recursive traversal as follows. At the root node, all branches are
followed such that any of the items in the transaction Tj hash to one of the branches. At a
given interior node, if the ith item of the transaction Tj was last hashed (at the parent node),
then all items following it in the transaction are hashed to determine the possible children to
follow. Thus, by following all these paths, the relevant leaf nodes in the tree are determined.
The candidates in the leaf node are stored in sorted order and can be compared efficiently
to the transaction Tj to determine whether they are relevant. This process is repeated for
each transaction to determine the final support count of each itemset in Ck+1.
Frequent pattern Mining without candidate
Frequent pattern mining (set of items, sequence, etc.) that occurs frequently in a database
[AIS93]. Frequent pattern mining plays an essential role in mining associations, correlations ,
causality , sequential patterns, episodes , multi-dimensional patterns , max-patterns, partial
periodicity, emerging patterns , and many other important data mining tasks such as Associative
classification, cluster analysis, iceberg cube, fascicles (semantic data compression). Apriori-like
approach, which is based on the anti-monotone Apriori heuristic:
If any length k pattern is not frequent in the database, its length (k + 1)

super-pattern can never be frequent. The essential idea is to iteratively generate the set of
candidate patterns of length (k + 1) from the set of frequent patterns of length k (for k ≥ 1),
and check their corresponding occurrence frequencies in the database. The Apriori heuristic
achieves good performance gained by (possibly significantly) reducing the size of candidate sets.
However, in situations with many frequent patterns, long patterns, or quite low minimum support
thresholds, an Apriori-like algorithm may suffer from the following two nontrivial costs:
– It is costly to handle a huge number of candidates sets. For example, if there are 10 4 frequent
1-itemsets, the Apriori algorithm will need to generate more than 10 7 length-2 candidates and
accumulate and test their occurrence frequencies. Moreover, to discover a frequent pattern of

size 100, such as {a1, . . ., a100}, it must generate 2 100 − 2 ≈ 10 30 candidates in total. This is
the inherent cost of candidate generation, no matter what implementation technique is applied.
– It is tedious to repeatedly scan the database and check a large set of candidates by pattern
matching, which is especially true for mining long patterns. Frequent-pattern tree, or FP-tree in
short, is constructed, which is an extended prefix-tree structure storing crucial, quantitative
information about frequent patterns.
Mining Multilevel Association Rules
For many applications, strong associations discovered at high abstraction levels, though with
high support, could be commonsense knowledge. We may want to drill down to find novel
patterns at more detailed levels. On the other hand, there could be too many scattered patterns at
low or primitive abstraction levels, some of which are just trivial specializations of patterns at
higher levels. Therefore, it is interesting to examine how to develop effective methods for mining
patterns at multiple abstraction levels, with sufficient flexibility for easy traversal among
different abstraction spaces.
Suppose we are given the task-relevant set of transactional data in Table 7.1 for sales in an
AllElectronics store, showing the items purchased for each transaction. The concept hierarchy
for the items is shown in Figure 7.2. A concept hierarchy defines a sequence of mappings from a
set of low-level concepts to a higher-level, more general concept set. Data can be generalized by
replacing low-level concepts within the data by their corresponding higher-level concepts, or
ancestors, from a concept hierarchy. Figure 7.2’s concept hierarchy has five levels, respectively
referred to as levels 0 through 4, starting with level 0 at the root node for all (the most general
abstraction level). Here, level 1 includes computer, software, printer and camera, and computer
accessory; level 2 includeslaptop computer, desktop computer, office software, antivirus
software, etc.; and level 3 includes Dell desktop computer, . . . , Microsoft office software, etc.
Level 4 is the most specific abstraction level of this hierarchy. It consists of the raw data values.
Concept hierarchies for nominal attributes are often implicit within the database schema, in
which case they may be automatically generated using methods. Concept hierarchies for numeric
attributes can be generated using discretization techniques, many of which were introduced in
Chapter 3. Alternatively, concept hierarchies may be specified by users familiar with the data
such as store managers in the case of our example. It is difficult to find interesting purchase
patterns in such raw or primitive-level data. For instance, if “Dell Studio XPS 16 Notebook” or
“Logitech VX Nano Cordless Laser Mouse” occurs in a very small fraction of the transactions,
then it can be difficult to find strong associations involving these specific items. Few people may
buy these items together, making it unlikely that the itemset will satisfy minimum support.
However, we would expect that it is easier to find strong associations between generalized
abstractions of these items, such as between “Dell Notebook” and “Cordless Mouse.”

Association rules generated from mining data at multiple abstraction levels are called multiple-
level or multilevel association rules.
Multilevel association rules can be mined efficiently using concept hierarchies under a support-
confidence framework. In general, a top-down strategy is employed, where counts are
accumulated for the calculation of frequent itemsets at each concept level, starting at concept
level 1 and working downward in the hierarchy toward the more specific concept levels, until no
more frequent itemsets can be found. For each level, any algorithm for discovering frequent
itemsets may be used, such as Apriori or its variations. A number of variations to this approach
are described next, where each variation involves “playing” with the support threshold in a
slightly different way. The variations are illustrated in Figures 7.3 and 7.4, where nodes indicate
an item or itemset that has been examined, and nodes with thick borders indicate that an
examined item or itemset is frequent.
Using uniform minimum support for all levels (referred to as uniform support): The same
minimum support threshold is used when mining at each abstraction level. For example, in
Figure 7.3, a minimum support threshold of 5% is used throughout (e.g., for mining from
“computer” downward to “laptop computer”). Both “computer” and “laptop computer” are found
to be frequent, whereas “desktop computer” is not. When a uniform minimum support threshold
is used, the search procedure is simplified. The method is also simple in that users are required to
specify only one minimum support threshold. An Apriori-like optimization technique can be

adopted, based on the knowledge that an ancestor is a superset of its descendants: The search
avoids examining itemsets containing any item of which the ancestors do not have minimum
support. The uniform support approach, however, has some drawbacks. It is unlikely that items
at lower abstraction levels will occur as frequently as those at higher abstraction levels. If the
minimum support threshold is set too high, it could miss some meaningful associations occurring
at low abstraction levels. If the threshold is set too low, it may generate many uninteresting
associations occurring at high abstraction levels. This provides the motivation for the next
approach.
Using reduced minimum support at lower levels (referred to as reduced support): Each
abstraction level has its own minimum support threshold. The deeper the abstraction level, the
smaller the corresponding threshold. For example, in Figure 7.4, the minimum support thresholds
for levels 1 and 2 are 5% and 3%, respectively. In this way, “computer,” “laptop computer,” and
“desktop computer” are all considered frequent.
Using item or group-based minimum support (referred to as group-based support): Because

users or experts often have insight as to which groups are more important than others, it is
sometimes more desirable to set up user-specific, item, or group-based minimal support
thresholds when mining multilevel rules. For example, a user could set up the minimum support
thresholds based on product price or on items of interest, such as by setting particularly low
support thresholds for “camera with price over $1000” or “Tablet PC,” to pay particular attention
to the association patterns containing items in these categories. For mining patterns with mixed
items from groups with different support thresholds, usually the lowest support threshold among
all the participating groups is taken as the support threshold in mining. This will avoid filtering
out valuable patterns containing items from the group with the lowest support threshold. In the
meantime, the minimal support threshold for each individual group should be kept to avoid
generating uninteresting itemsets from each group. Other interestingness measures can be used
after the itemset mining to extract truly interesting rules.
Mining Multidimensional Association Rule
So far, we have studied association rules that imply a single predicate, that is, the predicate buys.
For instance, in mining our AllElectronics database, we may discover the Boolean association
rule
buys(X, “digital camera”) ⇒ buys(X, “HP printer”).
Following the terminology used in multidimensional databases, we refer to each distinct

predicate in a rule as a dimension. Hence, we can refer to Rule as a singledimensional or
intradimensional association rule because it contains a single distinct predicate (e.g., buys) with
multiple occurrences (i.e., the predicate occurs more than once within the rule). Such rules are
commonly mined from transactional data. Instead of considering transactional data only, sales
and related information are often linked with relational data or integrated into a data warehouse.

Such data stores are multidimensional in nature. For instance, in addition to keeping track of the
items purchased in sales transactions, a relational database may record other attributes associate
with the items and/or transactions such as the item description or the branch location of the sale.
Additional relational information regarding the customers who purchased the items (e.g.,
customer age, occupation, credit rating, income, and address) may also be stored. Considering
each database attribute or warehouse dimension as a predicate, we can therefore mine association
rules containing multiple predicates such as
age(X, “20...29”) ∧ occupation(X, “student”)⇒buys(X, “laptop”).
Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules. Rule (7.7) contains three predicates (age, occupation, and
buys), each of which occurs only once in the rule. Hence, we say that it has no repeated
predicates. Multidimensional association rules with no repeated predicates are called
interdimensional association rules. We can also mine multidimensional association rules with
repeated predicates, which contain multiple occurrences of some predicates. These rules are
called hybrid-dimensional association rules. An example of such a rule is the following, where
the predicate buys is repeated:
age(X, “20...29”) ∧ buys(X, “laptop”)⇒buys(X, “HP printer”).
Database attributes can be nominal or quantitative. The values of nominal (or categorical)
attributes are “names of things.” Nominal attributes have a finite number of possible values, with
no ordering among the values (e.g., occupation, brand, color). Quantitative attributes are numeric
and have an implicit ordering among values (e.g., age, income, price). Techniques for mining
multidimensional association rules can be categorized into two basic approaches regarding the
treatment of quantitative attributes. In the first approach, quantitative attributes are discretized
using predefined concept hierarchies. This discretization occurs before mining. For instance, a
concept hierarchy for income may be used to replace the original numeric values of this attribute
by interval labels such as “0..20K,” “21K..30K,” “31K..40K,” and so on. Here, discretization is
static and predetermined. Chapter 3 on data preprocessing gave several techniques for
discretizing numeric attributes. The discretized numeric attributes, with their interval labels, can
then be treated as nominal attributes (where each interval is considered a category). We refer to
this as mining multidimensional association rules using static discretization of quantitative
attributes. In the second approach, quantitative attributes are discretized or clustered into “bins”
based on the data distribution. These bins may be further combined during the mining process.
The discretization process is dynamic and established so as to satisfy some mining criteria such
as maximizing the confidence of the rules mined. Because this strategy treats the numeric
attribute values as quantities rather than as predefined ranges or categories, association rules
mined from this approach are also referred to as (dynamic) quantitative association rules. Let’s
study each of these approaches for mining multidimensional association rules. For simplicity, we
confine our discussion to interdimensional association rules. Note that rather than searching for

frequent itemsets (as is done for single-dimensional association rule mining), in

multidimensional association rule mining we search for
Correlation Analysis Rule
Correlation analysis is a statistical method used to measure the strength of the linear relationship
between two variables and compute their association. Correlation analysis calculates the level of
change in one variable due to the change in the other. A high correlation points to a strong
relationship between the two variables, while a low correlation means that the variables are
weakly related.
Researchers use correlation analysis to analyze quantitative data collected through research
methods like surveys and live polls for market research. They try to identify relationships,
patterns, significant connections, and trends between two variables or datasets. There is a
positive correlation between two variables when an increase in one variable leads to an increase
in the other. On the other hand, a negative correlation means that when one variable increases,
the other decreases and vice-versa.
Correlation is a bivariate analysis that measures the strength of association between two
variables and the direction of the relationship. In terms of the strength of the relationship, the
correlation coefficient's value varies between +1 and -1. A value of ± 1 indicates a perfect degree
of association between the two variables.
As the correlation coefficient value goes towards 0, the relationship between the two variables
will be weaker. The coefficient sign indicates the direction of the relationship; a + sign indicates
a positive relationship, and a - sign indicates a negative relationship.
Why Correlation Analysis is Important
Correlation analysis can reveal meaningful relationships between different metrics or groups of
metrics. Information about those connections can provide new insights and reveal
interdependencies, even if the metrics come from different parts of the business. Suppose there is
a strong correlation between two variables or metrics, and one of them is being observed acting
in a particular way. In that case, you can conclude that the other one is also being affected
similarly. This helps group related metrics together to reduce the need for individual data
processing.
Types of Correlation Analysis in Data Mining
Usually, in statistics, we measure four types of correlations: Pearson correlation, Kendall rank
correlation, Spearman correlation, and the Point-Biserial correlation.
1. Pearson r correlation
Pearson r correlation is the most widely used correlation statistic to measure the degree of the
relationship between linearly related variables. For example, in the stock market, if we want to

measure how two stocks are related to each other, Pearson r correlation is used to measure the
degree of relationship between the two. The point-biserial correlation is conducted with the
Pearson correlation formula, except that one of the variables is dichotomous. The following
formula is used to calculate the Pearson r correlation:
rxy= Pearson r correlation coefficient between x and y
n= number of observations
xi = value of x (for ith observation)
yi= value of y (for ith observation)
2. Kendall rank correlation
Kendall rank correlation is a non-parametric test that measures the strength of dependence
between two variables. Considering two samples, a and b, where each sample size is n, we know
that the total number of pairings with a b is n(n-1)/2. The following formula is used to calculate
the value of Kendall rank correlation:
Nc= number of concordant
Nd= Number of discordant
3. Spearman rank correlation
Spearman rank correlation is a non-parametric test that is used to measure the degree of
association between two variables. The Spearman rank correlation test does not carry any
assumptions about the data distribution. It is the appropriate correlation analysis when the
variables are measured on an at least ordinal scale.
This coefficient requires a table of data that displays the raw data, its ranks, and the difference
between the two ranks. This squared difference between the two ranks will be shown on a scatter

graph, which will indicate whether there is a positive, negative, or no correlation between the
two variables. The constraint that this coefficient works under is -1 ≤ r ≤ +1, where a result of 0
would mean that there was no relation between the data whatsoever. The following formula is
used to calculate the Spearman rank correlation:
ρ= Spearman rank correlation

di= the difference between the ranks of corresponding variables
n= number of observations
The two methods outlined above will be used according to whether there are parameters
associated with the data gathered. The two terms to watch out for are:
• Parametric:(Pearson's Coefficient) The data must be handled with the parameters of

populations or probability distributions. Typically used with quantitative data already set
out within said parameters.
• Non-parametric:(Spearman's Rank) Where no assumptions can be made about the
probability distribution. Typically used with qualitative data, but can be used with
quantitative data if Spearman's Rank proves inadequate.
In cases when both are applicable, statisticians recommend using the parametric methods such as
Pearson's Coefficient because they tend to be more precise. But that doesn't mean discounting the
non-parametric methods if there isn't enough data or a more specified accurate result is needed.
Interpreting Results
1. Positive Correlation: Any score from +0.5 to +1 indicates a very strong positive
correlation, which means that they both increase simultaneously. This case follows the
data points upwards to indicate the positive correlation. The line of best fit, or the trend
line, places to best represent the graph's data.
2. Negative Correlation: Any score from -0.5 to -1 indicates a strong negative correlation,
which means that as one variable increases, the other decreases proportionally. The line
of best fit can be seen here to indicate the negative correlation. In these cases, it will
slope downwards from the point of origin.
3. No Correlation: Very simply, a score of 0 indicates no correlation, or relationship,
between the two variables. This fact will stand true for all, no matter which formula is
used. The more data inputted into the formula, the more accurate the result will be. The
larger the sample size, the more accurate the result.
Outliers or anomalies must be accounted for in both correlation coefficients. Using a scatter
graph is the easiest way of identifying any anomalies that may have occurred. Running the
correlation analysis twice (with and without anomalies) is a great way to assess the strength of
the influence of the anomalies on the analysis. Spearman's Rank coefficient may be used if

anomalies are present instead of Pearson's Coefficient, as this formula is extremely robust
against anomalies due to the ranking system used.
Benefits of Correlation Analysis
1. Reduce Time to Detection:In anomaly detection, working with many metrics and surfacing
correlated anomalous metrics helps draw relationships that reduce time to detection (TTD) and
support shortened time to remediation (TTR). As data-driven decision-making has become the
norm, early and robust detection of anomalies is critical in every industry domain, as delayed
detection adversely impacts customer experience and revenue.
2. Reduce Alert Fatigue:Another important benefit of correlation analysis in anomaly detection

is reducing alert fatigue by filtering irrelevant anomalies (based on the correlation) and grouping
correlated anomalies into a single alert. Alert storms and false positives are significant challenges
organizations face - getting hundreds, even thousands of separate alerts from multiple systems
when many of them stem from the same incident.
3. Reduce Costs:Correlation analysis helps significantly reduce the costs associated with the
time spent investigating meaningless or duplicative alerts. In addition, the time saved can be
spent on more strategic initiatives that add value to the organization.
Example Use Cases for Correlation Analysis
Marketing professionals use correlation analysis to evaluate the efficiency of a campaign by

monitoring and testing customers' reactions to different marketing tactics. In this way, they can
better understand and serve their customers.Financial planners assess the correlation of an
individual stock to an index such as the S&P 500 to determine if adding the stock to an
investment portfolio might increase the portfolio's systematic risk.For data scientists and those
tasked with monitoring data, correlation analysis is incredibly valuable for root cause analysis
and reduces time to detection (TTD) and remediation (TTR). Two unusual events or anomalies
happening simultaneously/rate can help pinpoint an underlying cause of a problem. The
organization will incur a lower cost of experiencing a problem if it can be understood and fixed
sooner.Technical support teams can reduce the number of alerts they must respond to by filtering
irrelevant anomalies and grouping correlated anomalies into a single alert. Tools such as Security
Information and Event Management (SIEM) systems automatically facilitate incident response.
Classification
Classification rules are a popular alternative to decision trees, and we have already seen
examples for the weather (page 10), contact lens (page 13), iris (page 15), and soybean (page 18)
datasets. The antecedent, or precondition, of a rule is a series of tests just like the tests at nodes in
decision trees, and the consequent, or conclusion, gives the class or classes that apply to
instances covered by that rule, or perhaps gives a probability distribution over the classes.
Generally, the preconditions are logically ANDed together, and all the tests must succeed if the
rule is to fire. However, in some rule formulations the preconditions are general logical

expressions rather than simple conjunctions. We often think of the individual rules as being
effectively logically ORed together: if any one applies, the class (or probability distribution)
given in its conclusion is applied to the instance. However, conflicts arise when several rules
with different conclusions apply; we will return to this shortly. It is easy to read a set of rules
directly off a decision tree. One rule is generated for each leaf. The antecedent of the rule
includes a condition for every node on the path from the root to that leaf, and the consequent of
the rule is the class assigned by the leaf. This procedure produces rules that are unambiguous in
that the order in which they are executed is irrelevant. However, in general, rules that are read
directly off a decision tree are far more complex than necessary, and rules derived from trees are
usually pruned to remove redundant tests. Because decision trees cannot easily express the
disjunction implied among the different rules in a set, transforming a general set of rules into a
tree is not quite so straightforward. A good illustration of this occurs when the rules have the
same structure but different attributes, like:
If a and b then x
If c and d then x
Then it is necessary to break the symmetry and choose a single test for the root node. If, for
example, a is chosen, the second rule must, in effect, be repeated twice in the tree, as shown in
Figure 3.2. This is known as the replicated subtree problem. The replicated subtree problem is
sufficiently important that it is worth looking at a couple more examples. The diagram on the left
of Figure 3.3 shows an exclusive-or function for which the output is a if x = 1 or y = 1 but not
both. To make this into a tree, you have to split on one attribute first, leading to a structure like
the one shown in the center. In contrast, rules can faithfully reflect the true symmetry of the
problem with respect to the attributes, as shown on the right.
In this example the rules are not notably more compact than the tree. In fact, they are just what
you would get by reading rules off the tree in the obvious way. But in other situations, rules are
much more compact than trees, particularly if it is possible to have a “default” rule that covers
cases not specified by the other rules. For example, to capture the effect of the rules in Figure
3.4—in which there are four attributes, x, y, z, and w, that can each be 1, 2, or 3—requires the
tree shown on the right. Each of the three small gray triangles to the upper right should actually
contain the whole three-level subtree that is displayed in gray, a rather extreme example of the
replicated subtree problem. This is a distressingly complex description of a rather simple
concept. One reason why rules are popular is that each rule seems to represent an independent
“nugget” of knowledge. New rules can be added to an existing rule set without disturbing ones
already there, whereas to add to a tree structure may require reshaping the whole tree. However,
this independence is something of an illusion, because it ignores the question of how the rule set
is executed.

On the other hand, if the order of interpretation is supposed to be immaterial, then it is not clear
what to do when different rules lead to different conclusions for the same instance. This situation
cannot arise for rules that are read directly off a decision tree because the redundancy included in
the structure of the rules prevents any ambiguity in interpretation. But it does arise when rules
are generated in other ways. If a rule set gives multiple classifications for a particular example,
one solution is to give no conclusion at all. Another is to count how often each rule fires on the
training data and go with the most popular one. These strategies can lead to radically different

results. A different problem occurs when an instance is encountered that the rules fail to classify
at all. Again, this cannot occur with decision trees, or with rules read directly off them, but it can
easily happen with general rule sets. One way of dealing with this situation is to fail to classify
such an example; another is to choose the most frequently occurring class as a default. Again,
radically different results may be obtained for these strategies. Individual rules are simple, and
sets of rules seem deceptively simple—but given just a set of rules with no additional
information, it is not clear how it should be interpreted. A particularly straightforward situation
occurs when rules lead to a class that is Boolean (say, yes and no) and when only rules leading to
one outcome (say, yes) are expressed. The assumption is that if a particular instance is not in
class yes, then it must be in class no—a form of closed world assumption. If this is the case, then
rules cannot conflict and there is no ambiguity in rule interpretation: any interpretation strategy
will give the same result. Such a set of rules can be written as a logic expression in what is called
disjunctive normal form: that is, as a disjunction (OR) of conjunctive (AND) conditions. It is this
simple special case that seduces people into assuming rules are very easy to deal with, because
here each rule really does operate as a new, independent piece of information that contributes in
a straightforward way to the disjunction. Unfortunately, it only applies to Boolean outcomes and
requires the closed world assumption, and both these constraints are unrealistic in most practical
situations. Machine learning algorithms that generate rules invariably produce ordered rule sets
in multiclass situations, and this sacrifices any possibility of modularity because the order of
execution is critical

Decision Tree
A “divide-and-conquer” approach to the problem of learning from a set of independent instances

leads naturally to a style of representation called a decision tree. s. Nodes in a decision tree
involve testing a particular attribute. Usually, the test at a node compares an attribute value with
a constant. However, some trees compare two attributes with each other, or use some function of
one or more attributes. Leaf nodes give a classification that applies to all instances that reach the
leaf, or a set of classifications, or a probability distribution over all possible classifications. To
classify an unknown instance, it is routed down the tree according to the values of the attributes
tested in successive nodes, and when a leaf is reached the instance is classified according to the
class assigned to the leaf. If the attribute that is tested at a node is a nominal one, the number of
children is usually the number of possible values of the attribute. In this case, because there is
one branch for each possible value, the same attribute will not be retested further down the tree.
Sometimes the attribute values are divided into two subsets, and the tree branches just two ways
depending on which subset the value lies in the tree; in that case, the attribute might be tested
more than once in a path. If the attribute is numeric, the test at a node usually determines whether
its value is greater or less than a predetermined constant, giving a two-way split Alternatively, a
three-way split may be used, in which case there are several different possibilities. If missing
value is treated as an attribute value in its own right, that will create a third branch. An
alternative for an integer-valued attribute would be a three-way split into less than, equal to, and
greater than. An alternative for a real-valued attribute, for which equal to is not such a
meaningful option, would be to test against an interval rather than a single constant, again giving
a three-way split: below, within, and above. A numeric attribute is often tested several times in
any given path down the tree from root to leaf, each test involving a different constant. We return
to this when describing the handling of numeric attributes in Section 6.1. Missing values pose an
obvious problem. It is not clear which branch should be taken when a node tests an attribute
whose value is missing. Sometimes, as described in Section 2.4, missing value is treated as an
attribute value in its own right. If this is not the case, missing values should be treated in a
special way rather than being considered as just another possible value that the attribute might
take. A simple solution is to record the number of elements in the training set that go down each
branch and to use the most popular branch if the value for a test instance is missing. A more
sophisticated solution is to notionally split the instance into pieces and send part of it down each
branch and from there right on down to the leaves of the subtrees involved. The split is
accomplished using a numeric weight between zero and one, and the weight for a branch is
chosen to be proportional to the number of training instances going down that branch, all weights
summing to one. A weighted instance may be further split at a lower node. Eventually, the

various parts of the instance will each reach a leaf node, and the decisions at these leaf nodes
must be recombined using the weights that have percolated down to the leaves. It is instructive
and can even be entertaining to build a decision tree for a dataset manually. To do so effectively,
you need a good way of visualizing the data so that you can decide which are likely to be the best
attributes to test and what an appropriate test might be. The Weka Explorer, described in Part II,
has a User Classifier facility that allows users to construct a decision tree interactively. It
presents you with a scatter plot of the data against two selected attributes, which you choose.
When you find a pair of attributes that discriminates the classes well, you can create a two-way
split by drawing a polygon around the appropriate data points on the scatter plot. A rectangle has
been drawn, manually, to separate out one of the classes (Iris versicolor). Then the user switches
to the decision tree view in The left-hand leaf node contains predominantly irises of one type
(Iris versicolor, contaminated by only two virginicas); the right-hand one contains predominantly
two types (Iris setosa and virginica, contaminated by only two versicolors). The user will
probably select the right-hand leaf and work on it next, splitting it further with another
rectangle—perhaps based on a different pair of attributes.Most people enjoy making the first few
decisions but rapidly lose interest thereafter, and one very useful option is to select a machine
learning method and let it take over at any point in the decision tree. Manual construction of
decision trees is a good way to get a feel for the tedious business of evaluating different
combinations of attributes to split on.
This technique applies to classification and prediction. The major attraction of decision trees is
their simplicity. By following the tree, you can decipher the rules and understand why a record is
classified in a certain way. Decision trees represent rules. You can use these rules to retrieve
records falling into a certain category. In some data mining processes, you really do not care how
the algorithm selected a certain record. For example, when you are selecting prospects to be
targeted in a marketing campaign, you do not need the reasons for targeting them. You only need
the ability to predict which members are likely to respond to the mailing. But in some other
cases, the reasons for the prediction are important. If your company is a mortgage company and
wants to evaluate an application, you need to know why an application must be rejected. Your
company must be able to protect itself from any lawsuits of discrimination. Wherever the reasons
are necessary and you must be able to trace the decision paths, decision trees are suitable. As you
have seen from Figure 17-12, a decision tree represents a series of questions. Each question
determines what follow-up question is best to be asked next. Good questions produce a short
series. Trees are drawn with the root at the top and the leaves at the bottom, an unnatural
convention. The question at the root must be the one that best differentiates among the target
classes. A database record enters the tree at the root node. The record works its way down until it
reaches a leaf. The leaf node determines the classification of the record. How can you measure
the effectiveness of a tree? In the example of the profiles of buyers of notebook computers, you
can pass the records whose classifications are already known. Then you can calculate the
percentage of correctness for the known records. A tree showing a high level of correctness is
more effective. Also, you must pay attention to the branches. Some paths are better than others

because the rules are better. By pruning the incompetent branches, you can enhance the
predictive effectiveness of the whole tree. How do the decision tree algorithms build the trees?
First, the algorithm attempts to find the test that will split the records in the best possible manner
among the wanted classifications. At each lower level node from the root, whatever rule works
best to split the subsets is applied. This process of finding each additional level of the tree
continues. The tree is allowed to grow until you cannot find better ways to split the input records.
Bayesian Classification
In numerous applications, the connection between the attribute set and the class variable is non-
deterministic. In other words, we can say the class label of a test record cant be assumed with
certainty even though its attribute set is the same as some of the training examples. These
circumstances may emerge due to the noisy data or the presence of certain confusing factors that
influence classification, but it is not included in the analysis. For example, consider the task of
predicting the occurrence of whether an individual is at risk for liver illness based on individuals
eating habits and working efficiency. Although most people who eat healthly and exercise
consistently having less probability of occurrence of liver disease, they may still do so due to
other factors. For example, due to consumption of the high-calorie street foods and alcohol
abuse. Determining whether an individual's eating routine is healthy or the workout efficiency is
sufficient is also subject to analysis, which in turn may introduce vulnerabilities into the leaning
issue..Bayesian classification uses Bayes theorem to predict the occurrence of any event.
Bayesian classifiers are the statistical classifiers with the Bayesian probability understandings.
The theory expresses how a level of belief, expressed as a probability. Bayes theorem came into
existence after Thomas Bayes, who first utilized conditional probability to provide an algorithm
that uses evidence to calculate limits on an unknown parameter. Bayes's theorem is expressed
mathematically by the following equation that is given below.
Where X and Y are the events and P (Y) ≠ 0
P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is
true.
P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is
true.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is
known as the marginal probability.
In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem

connects the degree of belief in a hypothesis before and after accounting for evidence. For
example, Lets us consider an example of the coin. If we toss a coin, then we get either heads or
tails, and the percent of occurrence of either heads and tails is 50%. If the coin is flipped
numbers of times, and the outcomes are observed, the degree of belief may rise, fall, or remain
the same depending on the outcomes.
For proposition X and evidence Y,
• P(X), the prior, is the primary degree of belief in X

• P(X/Y), the posterior is the degree of belief having accounted for Y.
• The quotient represents the supports Y provides for X.
Bayes theorem can be derived from the conditional probability:
Where P (X⋂Y) is the joint probability of both X and Y being true, because
Bayesian network:
A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM)
procedure that is utilized to compute uncertainties by utilizing the probability concept. Generally
known as Belief Networks, Bayesian Networks are used to show uncertainties using Directed
Acyclic Graphs (DAG)
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical
graph, a DAG consists of a set of nodes and links, where the links signify the connection
between the nodes.
The nodes here represent random variables, and the edges define the relationship between these
variables.
A DAG models the uncertainty of an event taking place based on the Conditional Probability
Distribution (CDP) of each random variable. A Conditional Probability Table (CPT) is used to
represent the CPD of each variable in a network.

Naïve Bayes Classification
The Naive Bayes classification algorithm is a probabilistic classifier. It is based on probability

models that incorporate strong independence assumptions. The independence assumptions often
do not have an impact on reality. Therefore they are considered as naive.You can derive
probability models by using Bayes' theorem (credited to Thomas Bayes). Depending on the
nature of the probability model, you can train the Naive Bayes algorithm in a supervised learning
setting. Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems. It is mainly used in text classification that
includes a high-dimensional training dataset. Naïve Bayes Classifier is one of the simple and
most effective Classification algorithms which helps in building the fast machine learning
models that can make quick predictions. It is a probabilistic classifier, which means it predicts
on the basis of the probability of an object. Some popular examples of Naïve Bayes Algorithm
are spam filtration, Sentimental analysis, and classifying articles. Data mining in
InfoSphere™ Warehouse is based on the maximum likelihood for parameter estimation for
Naive Bayes models. The generated Naive Bayes model conforms to the Predictive Model
Markup Language (PMML) standard. A Naive Bayes model consists of a large cube that
includes the following dimensions:
• Input field name

• Input field value for discrete fields, or input field value range for continuous fields.
Continuous fields are divided into discrete bins by the Naive Bayes algorithm
• Target field value
This means that a Naive Bayes model records how often a target field value appears together
with a value of an input field. You can activate the Naive Bayes classification algorithm by using
the following command:
DM_ClasSettings()..DM_setAlgorithm('NaiveBayes')

The Naive Bayes classification algorithm includes the probability-threshold parameter

ZeroProba. The value of the probability-threshold parameter is used if one of the above
mentioned dimensions of the cube is empty. A dimension is empty, if a training-data record with
the combination of input-field value and target value does not exist. The default value of the
probability-threshold parameter is 0.001. Optionally, you can modify the probability threshold.
For example, you can set the value to 0.0002 by using the following command:
DM_ClasSettings()..DM_setAlgorithm('NaiveBayes','<ZeroProba>0.0002</ZeroProba>')
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It
is not a single algorithm but a family of algorithms where all of them share a common principle,
i.e. every pair of features being classified is independent of each other.
Consider a fictional dataset that describes the weather conditions for playing a game of golf.
Given the weather conditions, each tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for
playing golf.
Outlook Temperature Humidity Windy Play Golf

0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No
The dataset is divided into two parts, namely, feature matrix and the response vector. Feature
matrix contains all the vectors(rows) of dataset in which each vector consists of the value of
dependent features. In above dataset, features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and
‘Windy’. Response vector contains the value of class variable(prediction or output) for each row
of feature matrix. In above dataset, the class variable name is ‘Play golf’. The fundamental Naive
Bayes assumption is that each feature makes an:
• independent
• equal
We assume that no pair of features are dependent. For example, the temperature being ‘Hot’ has
nothing to do with the humidity or the outlook being ‘Rainy’ has no effect on the winds. Hence,
the features are assumed to be independent.Secondly, each feature is given the same weight(or

importance). For example, knowing only temperature and humidity alone can’t predict the
outcome accurately. None of the attributes is irrelevant and assumed to be contributing equally
to the outcome.
Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes’ theorem is stated mathematically as the following
equation:
where A and B are events and P(B) ≠ 0.
• Basically, we are trying to find probability of event A, given the event B is true. Event B
is also termed as evidence.
• P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is
seen). The evidence is an attribute value of an unknown instance(here, it is event B).
• P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.
Now, with regards to our dataset, we can apply Bayes’ theorem in following way:
where, y is class variable and X is a dependent feature vector (of size n) .Just to clear, an
example of a feature vector and corresponding class variable can be: (refer 1st row of dataset)
X = (Rainy, Hot, High, False)

y = No
So basically, P(y|X) here means, the probability of “Not playing golf” given that the weather
conditions are “Rainy outlook”, “Temperature is hot”, “high humidity” and “no wind”. Now, its
time to put a naive assumption to the Bayes’ theorem, which is, independence among the
features. So now, we split evidence into the independent parts. Now, if any two events A and B
are independent, then,
P(A,B) = P(A)P(B)
Hence, we reach to the result:
which can be expressed as:
Now, as the denominator remains constant for a given input, we can remove that term:

Now, we need to create a classifier model. For this, we find the probability of given set of inputs
for all possible values of the class variable y and pick up the output with maximum probability.
This can be expressed mathematically as:
So, finally, we are left with the task of calculating P(y) and P(xi | y).Please note that P(y) is also
called class probability and P(xi | y) is called conditional probability. The different naive
Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P(xi |
y). Let us try to apply the above formula manually on our weather dataset. For this, we need to
do some precomputations on our dataset.We need to find P(xi | yj) for each xi in X and yj in y. All
these calculations have been demonstrated in the tables below:
So, in the figure above, we have calculated P(xi | yj) for each xi in X and yj in y manually in the
tables 1-4. For example, probability of playing golf given that the temperature is cool, i.e P(temp.
= cool | play golf = Yes) = 3/9. Also, we need to find class probabilities (P(y)) which has been
calculated in the table 5. For example, P(play golf = Yes) = 9/14. So now, we are done with our
pre-computations and the classifier is ready!.
Other popular Naive Bayes classifiers are:

• Multinomial Naive Bayes: Feature vectors represent the frequencies with which certain
events have been generated by a multinomial distribution. This is the event model
typically used for document classification.
• Bernoulli Naive Bayes: In the multivariate Bernoulli event model, features are
independent booleans (binary variables) describing inputs. Like the multinomial model,
this model is popular for document classification tasks, where binary term occurrence(i.e.
a word occurs in a document or not) features are used rather than term frequencies(i.e.
frequency of a word in the document).
As we reach to the end of this article, here are some important points to ponder upon:
• In spite of their apparently over-simplified assumptions, naive Bayes classifiers have

worked quite well in many real-world situations, famously document classification and
spam filtering. They require a small amount of training data to estimate the necessary
parameters.
• Naive Bayes learners and classifiers can be extremely fast compared to more
sophisticated methods. The decoupling of the class conditional feature distributions
means that each distribution can be independently estimated as a one dimensional
distribution. This in turn helps to alleviate problems stemming from the curse of
dimensionality.
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other Algorithms.
• It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
• Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Applications of Naïve Bayes Classifier:
• It is used for Credit Scoring.

• It is used in medical data classification.
• It can be used in real-time predictions since Naïve Bayes Classifier is an eager learner.
• It is used in Text classification such as Spam filtering and Sentiment analysis.
Types of Naïve Bayes Model:
• Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes
that these values are sampled from the Gaussian distribution.

• Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc. The classifier uses the frequency of words for the predictors.
• Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document classification
tasks.
SVM Linear
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning. The goal of the SVM algorithm is to create the
best line or decision boundary that can segregate n-dimensional space into classes so that we can
easily put the new data point in the correct category in the future. This best decision boundary is
called a hyperplane. SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:
SVM can be of two types:
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
• Linear SVM:
• The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and
x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or
blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Hence, the SVM algorithm helps to
find the best line or decision boundary; this best boundary or region is called as a hyperplane.
SVM algorithm finds the closest point of the lines from both the classes. These points are called
support vectors. The distance between the vectors and the hyperplane is called as margin. And
the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.

Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. So to separate these data points, we need to add one
more dimension. For linear data, we have used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it
in 2d space with z=1Hence we get a circumference of radius 1 in case of non-linear data.

Advantages of SVM:
• Effective in high dimensional cases

• Its memory efficient as it uses a subset of training points in the decision function called
support vectors
• Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels
Text Mining
Data mining is about looking for patterns in data. Likewise, text mining is about looking for
patterns in text: it is the process of analyzing text to extract information that is useful for
particular purposes. Compared with the kind of data we have been talking about in this book, text
is unstructured, amorphous, and difficult to deal with. Nevertheless, in modern Western culture,

text is the most common vehicle for the formal exchange of information. The motivation for
trying to extract information from it is compelling—even if success is only partial.
Another general class of text mining problems is metadata extraction. Meta- data was mentioned
previously as data about data: in the realm of text the term generally refers to salient features of a
work, such as its author, title, subject classification, subject headings, and keywords. Metadata
is a kind of highly structured (and therefore actionable) document summary. The idea of
metadata is often expanded to encompass words or phrases that stand for objects or “entities” in
the world, leading to the notion of entity extraction. Ordinary documents are full of such terms:
phone numbers, fax numbers, street addresses, email addresses, email signatures, abstracts,
tables of contents, lists of references, tables, figures, captions, meeting announcements, Web
addresses, and more. In addition, there are countless domain-specific entities, such as
international standard book numbers (ISBNs), stock symbols, chemical structures, and
mathematical equations. These terms act as single vocabulary items, and many
document processing tasks can be significantly improved if they are identified as such. They can
aid searching, interlinking, and cross-referencing between documents. How can textual entities
be identified? Rote learning, that is, dictionary lookup, is one idea, particularly when coupled
with existing resources—lists of personal names and organizations, information about locations
from gazetteers, or abbreviation and acronym dictionaries. Another is to use capitalization and
punctuation patterns for names and acronyms; titles (Ms.), suffixes (Jr.), and baronial prefixes
(von); or unusual language statistics for foreign names. Regular expressions suffice for artificial
constructs such as uniform resource locators (URLs); explicit grammars can be written to
recognize dates and sums of money. Even the simplest task opens up opportunities for learning
to cope with the huge variation that real-life documents present.text mining, including Web
mining, is a burgeoning technology that is still, because of its newness and intrinsic difficulty, in
a fluid state—akin, perhaps, to the state of machine learning in the mid-1980s. There is no real
consensus about what it covers: broadly interpreted, all natural language processing comes under
the ambit of text mining. It is usually difficult to provide general and meaningful evaluations
because the mining task is highly sensitive to the particular text under consideration. Automatic
text mining techniques have a long way to go before they rival the ability of people, even without
any special domain knowledge, to glean information from large document collections. But they
will go a long way, because the demand is immense.
Text data are copiously found in many domains, such as the Web, social networks, newswire
services, and libraries. With the increasing ease in archival of human speech and expression,
the volume of text data will only increase over time. This trend is reinforced by the increasing
digitization of libraries and the ubiquity of the Web and social networks. Some examples of
relevant domains are as follows:
1. Digital libraries: A recent trend in article and book production is to rely on digitized versions,
rather than hard copies. This has led to the proliferation of digital libraries in which effective

document management becomes crucial. Furthermore mining tools are also used in some
domains, such as biomedical literature, to glean useful insights.
2. Web and Web-enabled applications: The Web is a vast repository of documents that is further
enriched with links and other types of side information. Web documents are also referred to as
hypertext. The additional side information available with hypertext can be useful in the
knowledge discovery process. In addition, many web-enabled applications, such as social
networks, chat boards, and bulletin boards, are a significant source of text for analysis.
3. Newswire services: An increasing trend in recent years has been the de-emphasis of printed
newspapers and a move toward electronic news dissemination. This trend creates a massive
stream of news ocuments that can be analyzed for important events and insights. The set of
features (or dimensions) of text is also referred to as its lexicon. A collection of documents is
referred to as a corpus. A document can be viewed as either a sequence, or a multidimensional
record. A text document is, after all, a discrete sequence of words, also referred to as a string.
However, such sequence mining methods are rarely used in the text domain. This is partially
because sequence mining methods are most effective when the length of the sequences and the
number of possible tokens are both relatively modest. On the other hand, documents can often be
long sequences drawn on a lexicon of several hundred thousand words.
In practice, text is usually represented as multidimensional data in the form of frequency-

annotated bag-of-words. Words are also referred to as terms. Although such a representation
loses the ordering information among the words, it also enables the use of much larger classes of
multidimensional techniques. Typically, a preprocessing approach is applied in which the very
common words are removed, and the variations of the same word are consolidated. The
processed documents are then represented as an unordered set of words, where normalized
frequencies are associated with the individual words. The resulting representation is also referred
to as the vector space representation of text. The vector space representation of a document is a
multidimensional vector that contains a frequency associated with each word (dimension) in the
document. The overall dimensionality of this data set is equal to the number of distinct words in
the lexicon. The words from the lexicon that are not present in the document are assigned a
frequency of 0. Therefore, text is not very different from the multidimensional data type that has
been studied in the preceding chapters. Due to the multidimensional nature of the text, the
techniques studied in the afore- mentioned chapters can also be applied to the text domain with a
modest number of modifications. What are these modifications, and why are they needed? To
understand these modifications, one needs to understand a number of specific characteristics that
are unique to text data:
1. Number of “zero” attributes: Although the base dimensionality of text data may be of the
order of several hundred thousand words, a single document may contain only a few hundred
words. If each word in the lexicon is viewed as an attribute, and the document word frequency is
viewed as the attribute value, most attribute values are 0. This phenomenon is referred to as high-

dimensional sparsity. There may also be a wide variation in the number of nonzero values across
different documents. This has numerous implications for many fundamental aspects of text
mining, such as distance computation. For example, while it is possible, in theory, to use the
Euclidean function for measuring distances, the results are usually not very effective from a
practical perspective. This is because Euclidean distances are extremely sensitive to the varying
document lengths (the number of nonzero attributes). The Euclidean distance function cannot
compute the distance between two short documents in a comparable way to that between two
long documents because the latter will usually be larger.
2. Nonnegativity: The frequencies of words take on nonnegative values. When combined with
high- dimensional sparsity, the nonnegativity property enables the use of specialized methods for
document analysis. In general, all data mining algorithms must be cognizant of the fact that the
presence of a word in a document is statistically more significant than its absence. Unlike
traditional multidimensional techniques, incorporating the global statistical characteristics of the
data set in pairwise distance computation is crucial for good distance function design.
3. Side information: In some domains, such as the Web, additional side information is available.
Examples include hyperlinks or other metadata associated with the document. These additional
attributes can be leveraged to enhance the mining process further.
Temporal Data Mining
Temporal data mining defines the process of extraction of non-trivial, implicit, and potentially
essential data from large sets of temporal data. Temporal data are a series of primary data types,
generally numerical values, and it deals with gathering beneficial knowledge from temporal
data.The objective of temporal data mining is to find temporal patterns, unexpected trends, or
several hidden relations in the higher sequential data, which is composed of a sequence of
nominal symbols from the alphabet referred to as a temporal sequence and a sequence of
continuous real-valued components called a time series, by utilizing a set of approaches from
machine learning, statistics, and database technologies.Temporal data mining is composed of
three major works such as the description of temporal data, representation of similarity measures,
and mining services.Temporal Data Mining includes processing time series, generally sequences
of data, which compute values of the same attribute at a sequence of multiple time points. Pattern
matching using such information, where it is searching for specific patterns of interest, has
attracted considerable interest in current years.Temporal Data Mining can include the
exploitation of efficient techniques of data storage, quick processing, and quick retrieval methods
that have been advanced for temporal databases.Temporal data mining is an individual phase in
the process of knowledge discovery in temporal databases that calculate temporal patterns from
or fit models too, temporal data is a temporal data mining algorithm.Temporal data mining is
concerned with the analysis of temporal data and for discovering temporal patterns and
consistencies in sets of temporal information. It also allows the possibility of computer-driven,
automatic exploration of the data. There are various tasks in temporal mining which are−
• Data characterization and comparison

• Clustering analysis
• Classification
• Association rules
• Pattern analysis
• Prediction and trend analysis
Temporal data mining has led to a new way of interacting with a temporal database and
specifying queries at a much more abstract level than say, temporal structured query language
permits. It also facilities data exploration for problems that are due to multiple and multi-
dimensionality. The basic goal of temporal classification is to predict temporally related fields in
a temporal database based on other fields. The problem, in general, is cast as deciding the
general value of the temporal variable being predicted given the different fields, the training data
in which the target variable is given for each observation, and a set of assumptions representing
one’s prior knowledge of the problem. Temporal classification techniques are associated with the
complex problem of density estimation.
Spatial Data mining
It refers to the process of extraction of knowledge, spatial relationships and interesting patterns
that are not specifically stored in a spatial database; Spatial means space. The emergence of
spatial data and extensive usage of spatial databases has led to spatial knowledge discovery.
Spatial data mining can be understood as a process that determines some exciting and
hypothetically valuable patterns from spatial databases. Several tools are there that assist in
extracting information from geospatial data. These tools play a vital role for organizations like
NASA, the National Imagery and Mapping Agency (NIMA), the National Cancer Institute
(NCI), and the United States Department of Transportation (USDOT) which tends to make big
decisions based on large spatial datasets. Earlier, some general-purpose data mining like
Clementine See5/C5.0, and Enterprise Miner were used. These tools were utilized to analyze
large commercial databases, and these tools were mainly designed for understanding the buying
patterns of all customers from the database.Besides, the general-purpose tools were preferably
used to analyze scientific and engineering data, astronomical data, multimedia data, genomic
data, and web data.These are the given specific features of geographical data that prevent the use
of general-purpose data mining algorithms are:
1. spatial relationships among the variables,

2. spatial structure of errors
3. observations that are not independent
4. spatial autocorrelation among the features
5. non-linear interaction in feature space.
Spatial data must have latitude or longitude, UTM easting or northing, or some other coordinates
denoting a point's location in space. Beyond that, spatial data can contain any number of
attributes pertaining to a place. You can choose the types of attributes you want to describe a
place. Government websites provide a resource by offering spatial data, but you need not be
limited to what they have produced. You can produce your own. Say, for example, you wanted to

log information about every location you've visited in the past week. This might be useful to
provide insight into your daily habits. You could capture your destination's coordinates and list a
number of attributes such as place name, the purpose of visit, duration of visit, and more. You
can then create a shapefile in Quantum GIS or similar software with this information and use the
software to query and visualize the data. For example, you could generate a heatmap of the most
visited places or select all places you've visited within a radius of 8 miles from home.Any data
can be made spatial if it can be linked to a location, and one can even have spatiotemporal data
linked to locations in both space and time
Classification:Classification determines a set of rules which find the class of the specified object
as per its attributes.
Association rules:Association rules determine rules from the data sets, and it describes patterns
that are usually in the database.
Characteristic rules:Characteristic rules describe some parts of the data set.
Discriminate rules:As the name suggests, discriminate rules describe the differences between
two parts of the database, such as calculating the difference between two cities as per
employment rate.
SNO. Spatial data mining Temporal data mining

1. It requires space. It requires time.
2. Spatial mining is the extraction of Temporal mining is the extraction of
knowledge/spatial relationship and knowledge about occurrence of an event
interesting measures that are not whether they follow Cyclic , Random
explicitly stored in spatial database. ,Seasonal variations etc.
3. It deals with spatial (location , Geo- It deals with implicit or explicit Temporal
referenced) data. content , from large quantities of data.
4. Spatial databases reverses spatial Temporal data mining comprises the subject
objects derived by spatial data. types as well as its utilization in modification of
and spatial association among such fields.
objects.
5. It includes finding characteristic rules, It aims at mining new and unknown
discriminant rules, association rules and knowledge, which takes into account the
evaluation rules etc. temporal aspects of data.
6. It is the method of identifying unusual It deals with useful knowledge from temporal
and unexplored data but useful models data.
from spatial databases.
7. Examples – Examples –
Determining hotspots , Unusual An association rule which looks like – “Any
locations. Person who buys a car also buys steering
lock”. By temporal aspect this rule would be
– ” Any person who buys a car also buys a
steering lock after that “.

Cluster Analysis
Clustering means identifying and forming groups. Take the very ordinary example of how you
do your laundry. You group the clothes into whites, dark-colored clothes, light-colored clothes,
permanent press, and the ones to be dry-cleaned. You have five distinct clusters. Each cluster has
a meaning and you can use the meaning to get that cluster cleaned properly. The clustering helps
you take specific and proper action for the individual pieces that make up the cluster. Now think
of a specialty store owner in a resort community who wants to cater to the neighborhood by
stocking the right type of products. If he has data about the age group and income level of each
of the people who frequent the store, using these two variables the store owner can probably put
the customers into four clusters. These clusters may be formed as follows: wealthy retirees
staying in resorts, middle-aged weekend golfers, wealthy young people with club memberships,
and low-income clients who happen to stay in the community. The information about the clusters
helps the store owner in his marketing. Clustering or cluster detection is one of the earliest data
mining techniques. This technique is designated as undirected knowledge discovery or
unsupervised learning. What do we mean by this statement? In the cluster detection technique,
you do not search preclassified data. No distinction is made between independent and dependent
variables. For example, in the case of the store’s customers, there are two variables: age group
and income level. Both variables participate equally in the functioning of the data mining
algorithm. The cluster detection algorithm searches for groups or clusters of data elements that
are similar to one another. What is the purpose of this? You expect similar customers or similar
products to behave in the same way. Then you can take a cluster and do something useful with it.
Again, in the example of the specialty store, the store owner can take the members of the cluster
of wealthy retirees and target products especially interesting to them. Notice one important
aspect of clustering. When the mining algorithm produces a cluster, you must understand what
that cluster means exactly. Only then you will be able to do something useful with that cluster.
The store owner has to understand that one of the clusters represents wealthy retirees residing in
resorts. Only then can the store owner do something useful with that cluster. It is not always easy
to discern the meaning of every cluster the data mining algorithm forms. A bank may get as
many as 20 clusters but may be able to interpret the meanings of only two. However, the return
for the bank from the use of just these two clusters may be enormous enough so that they may
simply ignore the other 18 clusters. If there are only two or three variables or dimensions, it is
fairly easy to spot the clusters, even when dealing with many records. But if you are dealing with
500 variables from 100,000 records, you need a special tool. How does the data mining tool
perform the clustering function? Without getting bogged down in too much technical detail, let
us study the process. First, some basics. If you have two variables, then points on a two-
dimensional graph represent the values of sets of these two variables. Figure 17-10 shows the
distribution of these points. Let us consider an example. Suppose you want the data mining
algorithm to form clusters of your customers, but you want the algorithm to use 50 different
variables for each customer, not just two. Now we are discussing a 50-dimensional space.
Imagine each customer record with different values for the 50 dimensions. Each record is then a

vector defining a “point” in the 50-dimensional space. Let us say you want to market to the
customers and you are prepared to run marketing campaigns for 15 different groups. So you set
the number of clusters as 15. This number is K in the K-means clustering algorithm, a very
effective one for cluster detection. Fifteen initial records (called “seeds”) are chosen as the first
set of centroids based on best guesses. One seed represents one set of values for the 50 variables
chosen from the customer record. In the next step, the algorithm assigns each customer record in
the database to a cluster based on the seed to which it is closest. Closeness is based on the
nearness of the values of the set of 50 variables in a record to the values in the seed record. The
first set of 15 clusters is now formed. Then the algorithm calculates the centroid or mean for each
of the first set of 15 clusters. The values of the 50 variables in each centroid are taken to
represent that cluster. The next iteration then starts. Each customer record is rematched with the
new set of centroids and cluster boundaries are redrawn. After a few iterations the final clusters
emerge. Each implementation of the cluster detection algorithm adopts a method of comparing
the values of the variables in individual records with those in the centroids. The algorithm uses
these comparisons to calculate the distances of individual customer records from the centroids.
After calculating the distances, the algorithm redraws the cluster boundaries.
K-means
In the k-means algorithm, the sum of the squares of the Euclidean distances of data points to
their closest representatives is used to quantify the objective function of the clustering.
Therefore, we have:
Here, || · ||p represents the Lp-norm. The expression Dist(Xi, Yj ) can be viewed as the squared
error of approximating a data point with its closest representative. Thus, the overall objective
minimizes the sum of square errors over different data points. This is also sometimes referred to
as SSE. In such a case, it can be shown1 that the optimal representative Yj for each of the
“optimize” iterative steps is the mean of the data points in cluster Cj . Thus, the only difference
between the generic pseudocode and a k-means pseudocode is the specific instantiation of the
distance function Dist(·, ·), and the choice of the representative as the local mean of its cluster.
An interesting variation of the k-means algorithm is to use the local Mahalanobis distance for
assignment of data points to clusters. Each cluster Cj has its d×d own covariance matrix Σj ,
which can be computed using the data points assigned to that cluster in the previous iteration.
The squared Mahalanobis distance between data point Xi and representative Yj with a
covariance matrix Σj is defined a s follows:
The use of the Mahalanobis distance is generally helpful when the clusters are elliptically
elongated along certain directions, as in the case of Fig. 6.3. The factor Σ−1 j also provides

local density normalization, which is helpful in data sets with varying local density. The resulting
algorithm is referred to as the Mahalanobis k-means algorithm. The k-means algorithm does not
work well when the clusters are of arbitrary shape. An example is illustrated in Fig. 6.4a, in
which cluster A has a nonconvex shape. The k-means algorithm breaks it up into two parts, and
also merges one of these parts with cluster B. Such situations are common in k-means, because it
is biased toward finding spherical clusters. Even the Mahalanobis k-means algorithm does not
work well in this scenario in spite of its ability to adjust for the elongation of clusters. On the
other hand, the Mahalanobis k-means algorithm can adjust well to varying cluster density, as
illustrated in Fig. 6.4b. This is because the Mahalanobis method normalizes local distances with
the use of a cluster- specific covariance matrix. The data set cannot be effectively clustered by
many density-based algorithms, which are designed to discover arbitrarily shaped clusters..
Therefore, different algorithms are suitable in different application settings
Partitioning Methods
Hierarchical Methods
Hierarchical algorithms typically cluster the data with distances. However, the use of dis-
tance functions is not compulsory. Many hierarchical algorithms use other clustering meth-
ods, such as density- or graph-based methods, as a subroutine for constructing the hierarchy.
So why are hierarchical clustering methods useful from an application-centric point of
view? One major reason is that different levels of clustering granularity provide different
application-specific insights. This provides a taxonomy of clusters, which may be browsed
for semantic insights. As a specific example, consider the taxonomy2 of Web pages created
by the well-known Open Directory Project (ODP). In this case, the clustering has been created by a
manual volunteer effort, but it nevertheless provides a good understanding of
the multigranularity insights that may be obtained with such an approach. A small portion
of the hierarchical organization is illustrated in Fig. 6.6. At the highest level, the Web pages
are organized into topics such as arts, science, health, and so on. At the next level, the topic

of science is organized into subtopics, such as biology and physics, whereas the topic of health
is divided into topics such as fitness and medicine. This organization makes manual browsing
very convenient for a user, especially when the content of the clusters can be described in a
semantically comprehensible way. In other cases, such hierarchical organizations can be used
by indexing algorithms. Furthermore, such methods can sometimes also be used for creating
better “flat” clusters. Some agglomerative hierarchical methods and divisive methods, such
as bisecting k-means, can provide better quality clusters than partitioning methods such as
k-means, albeit at a higher computational cost. There are two types of hierarchical algorithms, depending
on how the hierarchical tree of clusters is constructed:
1. Bottom-up (agglomerative) methods: The individual data points are successively

agglomerated into higher-level clusters. The main variation among the different meth-
ods is in the choice of objective function used to decide the merging of the clusters.
2. Top-down (divisive) methods: A top-down approach is used to successively partition

the data points into a tree-like structure. A flat clustering algorithm may be used
for the partitioning in each step. Such an approach provides tremendous flexibility
in terms of choosing the trade-off between the balance in the tree structure and the
balance in the number of data points in each node. For example, a tree-growth strategy
that splits the heaviest node will result in leaf nodes with a similar number of data
points in them. On the other hand, a tree-growth strategy that constructs a balanced
tree structure with the same number of children at each node will lead to leaf nodes
with varying numbers of data points.
In the following sections, both types of hierarchical methods will be discussed.

6.4.1 Bottom-Up Agglomerative Methods In bottom-up methods, the data points are successively
agglomerated into higher level clusters. The algorithm starts with individual data points in their own
clusters and successively agglomerates them into higher level clusters. In each iteration, two clusters are
selected that are deemed to be as close as possible. These clusters are merged and replaced with a
newly created merged cluster.

Algorithm AgglomerativeMerge(Data: D)
begin
Initialize n × n distance matrix M using D;
repeat
Pick closest pair of clusters i and j using M ;
Merge clusters i and j;
Delete rows/columns i and j from M and create
a new row and column for newly merged cluster;
Update the entries of new row and column of M ;
until termination criterion;
return current merged cluster set;
end
Thus, each merging step reduces the number of clusters by 1. Therefore, a method needs to be designed
for measuring proximity between clusters containing multiple data points, so that they may be merged. It
is in this choice of computing the distances between clusters, that most of the variations among different
methods arise. Let n be the number of data points in the d-dimensional database D, and nt = n − t be
the number of clusters after t agglomerations. At any given point, the method maintains an
nt ×nt distance matrix M between the current clusters in the data. The precise methodology
for computing and maintaining this distance matrix will be described later. In any given
iteration of the algorithm, the (nondiagonal) entry in the distance matrix with the least
distance is selected, and the corresponding clusters are merged. This merging will require
the distance matrix to be updated to a smaller (nt −1)×(nt −1) matrix. The dimensionality
reduces by 1 because the rows and columns for the two merged clusters need to be deleted,
and a new row and column of distances, corresponding to the newly created cluster, needs
to be added to the matrix. This corresponds to the newly created cluster in the data. The
algorithm for determining the values of this newly created row and column depends on
the cluster-to-cluster distance computation in the merging procedure and will be described
later. The incremental update process of the distance matrix is a more efficient option
than that of computing all distances from scratch. It is, of course, assumed that sufficient
memory is available to maintain the distance matrix. If this is not the case, then the distance
matrix will need to be fully recomputed in each iteration, and such agglomerative methods
become less attractive. For termination, either a maximum threshold can be used on the
distances between two merged clusters or a minimum threshold can be used on the number
of clusters at termination. The former criterion is designed to automatically determine the
natural number of clusters in the data but has the disadvantage of requiring the specification
of a quality threshold that is hard to guess intuitively. The latter criterion has the advantage
of being intuitively interpretable in terms of the number of clusters in the data. The order
of merging naturally creates a hierarchical tree-like structure illustrating the relationship
between different clusters, which is referred to as a dendrogram.
The generic agglomerative procedure with an unspecified merging criterion is illustrated

in Fig. 6.7. The distances are encoded in the nt × nt distance matrix M . This matrix
provides the pairwise cluster distances computed with the use of the merging criterion. The
different choices for the merging criteria will be described later. The merging of two clusters
corresponding to rows (columns) i and j in the matrix M requires the computation of some
measure of distances between their constituent objects. For two clusters containing mi and
mj objects, respectively, there are mi · mj pairs of distances between constituent objects.
For example, in Fig. 6.8b, there are 2 × 4 = 8 pairs of distances between the constituent
objects, which are illustrated by the corresponding edges. The overall distance between the

two clusters needs to be computed as a function of these mi · mj pairs. In the following,

different ways of computing the distances will be discussed.
Group-Based Statistics
The following discussion assumes that the indices of the two clusters to be merged are
denoted by i and j, respectively. In group-based criteria, the distance between two groups
of objects is computed as a function of the mi · mj pairs of distances among the constituent
objects. The different ways of computing distances between two groups of objects are as
follows:
1. Best (single) linkage: In this case, the distance is equal to the minimum distance
between all mi · mj pairs of objects. This corresponds to the closest pair of objects
between the two groups. After performing the merge, the matrix M of pairwise dis-
tances needs to be updated. The ith and jth rows and columns are deleted and replaced
with a single row and column representing the merged cluster. The new row (column)
can be computed using the minimum of the values in the previously deleted pair of
rows (columns) in M . This is because the distance of the other clusters to the merged
cluster is the minimum of their distances to the individual clusters in the best-linkage
scenario. For any other cluster k = i, j, this is equal to min{Mik, Mjk} (for rows) and
min{Mki, Mkj } (for columns). The indices of the rows and columns are then updated
to account for the deletion of the two clusters and their replacement with a new one.
The best linkage approach is one of the instantiations of agglomerative methods that
is very good at discovering clusters of arbitrary shape. This is because the data points
in clusters of arbitrary shape can be successively merged with chains of data point
pairs at small pairwise distances to each other. On the other hand, such chaining may
also inappropriately merge distinct clusters when it results from noisy points.
2. Worst (complete) linkage: In this case, the distance between two groups of objects is
equal to the maximum distance between all mi · mj pairs of objects in the two groups.
This corresponds to the farthest pair in the two groups. Correspondingly, the matrix
M is updated using the maximum values of the rows (columns) in this case. For any
value of k = i, j, this is equal to max{Mik, Mjk} (for rows), and max{Mki, Mkj } (for
columns). The worst-linkage criterion implicitly attempts to minimize the maximum
diameter of a cluster, as defined by the largest distance between any pair of points in
the cluster. This method is also referred to as the complete linkage method.
3. Group-average linkage: In this case, the distance between two groups of objects is
equal to the average distance between all mi · mj pairs of objects in the groups. To
compute the row (column) for the merged cluster in M , a weighted average of the ith
and jth rows (columns) in the matrix M is used. For any value of k = i, j, this is
equal to mi·Mik +mj ·Mjk mi+mj (for rows), and mi ·Mki+mj ·Mkj mi+mj (for columns).
4. Closest centroid: In this case, the closest centroids are merged in each iteration. This
approach is not desirable, however, because the centroids lose information about the
relative spreads of the different clusters. For example, such a method will not discrim-
inate between merging pairs of clusters of varying sizes, as long as their centroid pairs
are at the same distance. Typically, there is a bias toward merging pairs of larger
clusters because centroids of larger clusters are statistically more likely to be closer
to each other.
5. Variance-based criterion: This criterion minimizes the change in the objective function
(such as cluster variance) as a result of the merging. Merging always results in a

worsening of the clustering objective function value because of the loss of granularity.
It is desired to merge clusters where the change (degradation) in the objective function
as a result of merging is as little as possible. To achieve this goal, the zeroth, first,
and second order moment statistics are maintained with each cluster. The average
squared error SEi of the ith cluster can be computed as a function of the number mi
of points in the cluster (zeroth-order moment), the sum Fir of the data points in the
cluster i along each dimension r (first-order moment), and the squared sum Sir of the
data points in the cluster i across each dimension r (second-order moment) according
to the following relationship;
This relationship can be shown using the basic definition of variance and is used by
many clustering algorithms such as BIRCH (cf. Chap. 7). Therefore, for each cluster,
one only needs to maintain these cluster-specific statistics. Such statistics are easy to
maintain across merges because the moment statistics of a merge of the two clusters i
and j can be computed easily as the sum of their moment statistics. Let SEi∪j denote
the variance of a potential merge between the two clusters i and j. Therefore, the
change in variance on executing a merge of clusters i and j is as follows:
This change can be shown to always be a positive quantity. The cluster pair with the
smallest increase in variance because of the merge is selected as the relevant pair to
(a) Good case with no noise (b) Bad case with noise Figure 6.9: Good and bad cases for single-linkage
clustering be merged. As before, a matrix M of pairwise values of ΔSEi∪j is maintained along
with moment statistics. After each merge of the ith and jth clusters, the ith and
jth rows and columns of M are deleted and a new column for the merged cluster
is added. The kth row (column) entry (k = i, j) in M of this new column is equal
to SEi∪j∪k − SEi∪j − SEk. These values are computed using the cluster moment
statistics. After computing the new row and column, the indices of the matrix M are
updated to account for its reduction in size.
6. Ward’s method: Instead of using the change in variance, one might also use the
(unscaled) sum of squared error as the merging criterion. This is equivalent to setting
the RHS of Eq. 6.8 to ∑d r=1(miSir − F 2 ir ). Surprisingly, this approach is a variant of
the centroid method. The objective function for merging is obtained by multiplying
the (squared) Euclidean distance between centroids with the harmonic mean of the
number of points in each of the pair. Because larger clusters are penalized by this
additional factor, the approach performs more effectively than the centroid method.
The various criteria have different advantages and disadvantages. For example, the single
linkage method is able to successively merge chains of closely related points to discover
clusters of arbitrary shape. However, this property can also (inappropriately) merge two
unrelated clusters, when the chaining is caused by noisy points between two clusters. Exam-
ples of good and bad cases for single-linkage clustering are illustrated in Figs. 6.9a and b,
respectively. Therefore, the behavior of single-linkage methods depends on the impact and
relative presence of noisy data points. Interestingly, the well-known DBSCAN algorithm (cf.
Sect. 6.6.2) can be viewed as a robust variant of single-linkage methods, and it can therefore

find arbitrarily shaped clusters. The DBSCAN algorithm excludes the noisy points between
clusters from the merging process to avoid undesirable chaining effects.
The complete (worst-case) linkage method attempts to minimize the maximum distance
between any pair of points in a cluster. This quantification can implicitly be viewed as an
approximation of the diameter of a cluster. Because of its focus on minimizing the diameter,
it will try to create clusters so that all of them have a similar diameter. However, if some of
the natural clusters in the data are larger than others, then the approach will break up the
larger clusters. It will also be biased toward creating clusters of spherical shape irrespective
of the underlying data distribution. Another problem with the complete linkage method is
that it gives too much importance to data points at the noisy fringes of a cluster because
of its focus on the maximum distance between any pair of points in the cluster. The group-
average, variance, and Ward’s methods are more robust to noise due to the use of multiple
linkages in the distance computation.
The agglomerative method requires the maintenance of a heap of sorted distances to

efficiently determine the minimum distance value in the matrix. The initial distance matrix
computation requires O(n2 · d) time, and the maintenance of a sorted heap data structure
requires O(n2 · log(n)) time over the course of the algorithm because there will be a total of
O(n2) additions and deletions into the heap. Therefore, the overall running time is O(n2 ·
d + n2 · log(n)). The required space for the distance matrix is O(n2). The space-requirement
is particularly problematic for large data sets. In such cases, a similarity matrix M cannot
be incrementally maintained, and the time complexity of many hierarchical methods will
increase dramatically to O(n3 · d). This increase occurs because the similarity computations
between clusters need to be performed explicitly at the time of the merging. Nevertheless, it
is possible to speed up the algorithm in such cases by approximating the merging criterion.
The CURE method, discussed in Sect. 7.3.3 of Chap. 7, provides a scalable single-linkage
implementation of hierarchical methods and can discover clusters of arbitrary shape. This
improvement is achieved by using carefully chosen representative points from clusters to
approximately compute the single-linkage criterion.
Practical Considerations Agglomerative hierarchical methods naturally lead to a binary tree of clusters. It
is generally difficult to control the structure of the hierarchical tree with bottom-up methods as
compared to the top-down methods. Therefore, in cases where a taxonomy of a specific
structure is desired, bottom-up methods are less desirable. A problem with hierarchical methods is that
they are sensitive to a small number of mistakes made during the merging process. For example, if an
incorrect merging decision is made at some stage because of the presence of noise in the data set, then
there is no way to undo it, and the mistake may further propagate in successive merges. In fact, some
variants of hierarchical clustering, such as single-linkage methods, are notorious for successively
merging neighboring clusters because of the presence of a small number of noisy points.
Nevertheless, there are numerous ways to reduce these effects by treating noisy data points
specially. Agglomerative methods can become impractical from a space- and time-efficiency per-
spective for larger data sets. Therefore, these methods are often combined with sampling
and other partitioning methods to efficiently provide solutions of high quality.
6.4.2 Top-Down Divisive Methods
Although bottom-up agglomerative methods are typically distance-based methods, top-

down hierarchical methods can be viewed as general-purpose meta-algorithms that can use
almost any clustering algorithm as a subroutine. Because of the top-down approach, greater
control is achieved on the global structure of the tree in terms of its degree and balance
between different branches. The overall approach for top-down clustering uses a general-purpose flat-

clustering algorithm A as a subroutine. The algorithm initializes the tree at the root node containing all
the data points. In each iteration, the data set at a particular node of the current tree is
split into multiple nodes (clusters). By changing the criterion for node selection, one can
create trees balanced by height or trees balanced by the number of clusters. If the algorithm
A is randomized, such as the k-means algorithm (with random seeds), it is possible to use
multiple trials of the same algorithm at a particular node and select the best one. A wide variety of
algorithms can be designed with different instantiations of the algorithm A
and growth strategy. Note that the algorithm A can be any arbitrary clustering algorithm,
and not just a distance-based algorithm.
6.4.2.1 Bisecting k-Means
The bisecting k-means algorithm is a top-down hierarchical clustering algorithm in which

each node is split into exactly two children with a 2-means algorithm. To split a node into
two children, several randomized trial runs of the split are used, and the split that has the
best impact on the overall clustering objective is used. Several variants of this approach use
different growth strategies for selecting the node to be split. For example, the heaviest node
may be split first, or the node with the smallest distance from the root may be split first.
These different choices lead to balancing either the cluster weights or the tree height.
Data Mining Applications
Customer Segmentation. This is one of the most widespread applications. Businesses use data
mining to understand their customers. Cluster detection algorithms discover clusters of
customers sharing the same characteristics.
Market Basket Analysis. This is a very useful application for retail. Link analysis algorithms
uncover affinities between products that are bought together. Other businesses such as upscale
auction houses use these algorithms to find customers to whom they can sell higher-value items.
Risk Management. Insurance companies and mortgage businesses use data mining to uncover
risks associated with potential customers.
Fraud Detection. Credit card companies use data mining to discover abnormal spending patterns
of customers. Such patterns can expose fraudulent use of the cards.
Delinquency Tracking. Loan companies use the technology to track customers who are likely to
default on repayments.
Demand Prediction. Retail and other businesses use data mining to match demand and supply
trends to forecast demand for specific products.
Applications in CRM (Customer Relationship Management)
Customers interact with an enterprise in many ways. CRM is an umbrella term to include the
management of all customer interactions so as to improve the profitability derivable from such
interactions. CRM applications making use of data mining are known as analytic CRM. Analytic

CRM is not confined to one industry. As it is applicable to all industries across the board, the
data mining applications of analytic CRM has very broad appeal. In general, the interactions with
customers in an organization happen throughout the three phases of the customer life cycle:
† Acquisition of a customer
† Value enhancement of a customer
† Retention of a customer
Data mining applications of analytic CRM relate to all three phases of the customer life cycle.
Acquisition of a Customer
In this first phase, you need to identify prospects and convert them into customers. A time-
honored proven method for acquiring new customers has been the direct mail campaign. In fact,
businesses conduct several direct mail campaigns in a year. When mailings are sent to
prospective customers, only a small fraction of the prospects show interest and respond. The rate
of return from the mailings may be increased if you are able to identify good prospects to whom
you can target your mailings. Data mining is effective in identifying such good prospects and
help focus the marketing efforts much more cost effectively.
Value Enhancement of a Customer
The value of a customer to an enterprise is based on the purchases of goods and services by that
customer. How can you increase the value of a customer? By selling more to the customer. You
may try to increase the volume of the same goods and services the customer normally purchases
from your enterprise. Also, if you are able to identify the additional goods and services the
customer is likely to buy based on his usual purchases, then you may offer these additional items
to the customer. This is known as cross-selling. In both cases, you may run appropriate
marketing promotions. Data mining is effective to identify customers and products for such
promotions. Another way in which data mining can help in promotions is to personalize your
marketing effort. When a customer goes to your Web site to order a product, with the use of data
mining that customer can receive a personal greeting and be presented with the specials and
other related products he or she is likely to be interested in.
Retention of a Customer
For most companies, the cost of acquiring a new customer exceeds the cost of retaining a good
customer. If the attrition rate in your company is high, say, 10%, then 100 of your 1000
customers leave each month. At a minimum, you need to replace these 100 customers every
month. The customer acquisition costs could be quite high. This situation calls for a good
customer attrition management program. For your attrition management program, you would
need to identify in advance every month those 100 customers who are likely to leave. Next you

need to know who of these 100 likely candidates are “good” customers providing value to your
company. Then you may target these “good” customers with special promotions to entice them
to stay. Data mining can be effective in customer attrition management programs.
Applications in the Retail Industry
Let us discuss very briefly how the retail industry makes use of data mining and benefits from it.
Fierce competition and narrow profit margins have plagued the retail industry. Forced by these
factors, the retail industry adopted data warehousing earlier than most other industries. Over the
years, these data warehouses have accumulated huge volumes of data. The data warehouses in
many retail businesses are mature and ripe. Also, through the use of scanners and cash registers,
the retail industry has been able to capture detailed point-of-sale data. The combination of the
two features—huge volumes of data and low-granularity data— is ideal for data mining. The
retail industry was able to begin using data mining while others were just making plans. All
types of businesses in the retail industry, including grocery chains, consumer retail chains, and
catalog sales companies, use direct marketing campaigns and promotions extensively. Direct
marketing happens to be quite critical in the industry. All companies depend heavily on direct
marketing. Direct marketing involves targeting campaigns and promotions to specific customer
segments. Cluster detection and other predictive data mining algorithms provide customer
segmentation. As this is a crucial area for the retail industry, many vendors offer data mining
tools for customer segmentation. These tools can be integrated with the data warehouse at the
back end for data selection and extraction. At the front end these tools work well with standard
presentation software. Customer segmentation tools discover clusters and predict success rates
for direct marketing campaigns. Retail industry promotions necessarily require knowledge of
which products to promote and in what combinations. Retailers use link analysis algorithms to
find affinities among products that usually sell together. As you already know, this is market
basket analysis. Based on the affinity grouping, retailers can plan their special sale items and also
the arrangement of products on the shelves. Apart from customer segmentation and market
basket analysis, retailers use data mining for inventory management. Inventory for a retailer
encompasses thousands of products. Inventory turnover and management are significant
concerns for these businesses. Another area of use for data mining in the retail industry relates to
sales forecasting. Retail sales are subject to strong seasonal fluctuations. Holidays and weekends
also make a difference. Therefore, sales forecasting is critical for the industry. The retailers turn
to the predictive algorithms of data mining technology for sales forecasting. What are the other
types of data mining uses in the retail industry? What are the questions and concerns the industry
is interested in? Here is a short list:
† Customer long-term spending patterns

† Customer purchasing frequency
† Best types of promotions
† Store plan and arrangement of promotional displays
† Planning mailers with coupons
† Types of customers buying special offerings

† Sales trends, seasonal and regular
† Manpower planning based on busy times
† Most profitable segments in the customer base
Applications in the Telecommunications Industry
The next industry we want to look at for data mining applications is telecommunications. This
industry was deregulated in the 1990s. In the United States, the cellular alternative changed the
landscape dramatically, although the wave had already hit Europe and a few pockets in Asia
earlier. Against the background of an extremely competitive marketplace, the companies
scrambled to find methods to understand their customers. Customer retention and customer
acquisition have become top priorities in their marketing. Telecommunications companies
compete with one another to design the best offerings and entice customers. No wonder this
climate of competitive pressures has driven telecommunications companies to data mining. All
the leading companies have already adopted the technology and are reaping many benefits.
Several data mining vendors and consulting companies specialize in the problems of this
industry. Customer churn is of serious concern. How many times a week do you get cold calls
from telemarketing representatives in this industry? Many data mining vendors offer products to
contain customer churn. The newer cellular phone market experiences the highest churn rate.
Some experts estimate the total cost of acquiring a single new customer is as high as $500.
Problem areas in the communications network are potential disasters. In today’s competitive
market, customers are tempted to switch at the slightest problem. Customer retention under such
circumstances becomes very fragile. A few data mining vendors specialize in data visualization
products for the industry. These products flash alert signs on the network maps to indicate
potential problem areas, enabling the employees responsible to take preventive action. 456
Below is a general list of questions and concerns of the industry where data mining applications
are helping:
† Retention of customers in the face of enticing competition

† Customer behavior indicating increased line usage in the future
† Discovery of profitable service packages
† Customers most likely to churn
† Prediction of cellular fraud
† Promotion of additional products and services to existing customers
† Factors that increase the customer’s propensity to use the phone
† Product evaluation compared to the competition
Applications in Biotechnology

In the last decade or so, biotech companies have risen and progressed to the leading edge. They
have been busy accumulating enormous volumes of data. It has become increasingly difficult to
rely on older techniques to make sense of the mountains of data and derive useful results. No
wonder the biotech industry is leaning towards the use of data mining to find patterns and
relationships from the tons of available data. Data mining techniques have become indispensable
components in today’s biological research. We cannot touch upon all the numerous data mining
applications in biotechnology. Several textbooks and journal articles deal with such applications
in detail. For our purposes here, we will briefly observe a few biotechnology applications
supported by data mining.
Data Mining in the Biopharmaceutical Industry
This industry collects huge amounts of biological data of various types. A sample of these types
of data would include clinical trial results, annotated databases of disease profiles, chemical
structures of combinatorial libraries of compounds, molecular pathways to sequences, structure –
activity relationships, and so on. Data mining has become the centerpiece to deal with the
information overload. The biopharmaceutical industry is generating much more biological and
chemical data than the industry knows what to do with. Accordingly, deciding which target and
lead compound to develop further is long, tedious, and expensive. Data mining is brought in to
address this situation and enables the users to make better use of the collected data and improve
the bottom line for the company. Several vendors offer data mining tools and services
specifically for the biopharmaceutical industry. Data mining applications for the
biopharmaceutical industry generally fall into the following major approaches based on the
category of biological data analysis desired.
Influence-Based Mining
In this case, complex data in large databases are scanned for influences between specific sets of
data along several dimensions. Usually, this type of data mining is applied where there are
significant cause-and-effect relationships between the data sets. An example of this would be
large and multivariant gene expression studies.
Affinity-Based Mining
This case is similar to influence-based mining in that data in large and complex data sets is
analyzed across several dimensions. However, in this case the mining technique identifies data
points or sets that have affinity for one another and tend to be grouped together. This approach is
useful in biological motif analysis for distinguishing accidental motifs from motifs with
biological significance.
Time-Delay Mining

In this case, the subject data set is not available immediately in complete form. The set is
collected over time and the mining technique identifies patterns that are confirmed or rejected as
the data set increases and become more robust over time. This approach is useful for analysis of
long-term clinical trials.
Trends-Based Mining.
In this case, the mining technique analyzes large data sets for changes or trends over time in
specific data sets. Changes are expected to occur because of cause-and-effect considerations in
experiments on responses to particular drugs or other stimuli over time. The responses are
collected and analyzed.
Comparative Mining.
This approach focuses on overlaying large, complex, and similar data sets for comparison. As an
example, this is useful for clinical trial meta-analysis where data might have been collected at
different sites, at different times, under similar but not necessarily exactly the same conditions.
The objective is to find dissimilarities, not similarities.
Data Mining in Drug Design and Production
Data mining is becoming more and more useful for pharmaceutical companies in design and
production of prescription and generic drugs. By and large, drug manufacturing processes may
be categorized into two methods: synthetic and fermentation. Data mining is used by
manufacturers in both methods. Let us briefly note how data mining assists in these two methods
of production.
Synthetic Method
Generally the production is guided by a production flowchart usually consisting of many steps.
In the initial step of the production process, you begin with the raw material. At the subsequent
steps the raw material is converted into a series of intermediary products through a number of
chemical reactions with other compounds. The yield of the process is the quantity of product per
unit of raw material used. The goal of the optimal production process is to increase the yield. In
this synthetic method of production, the production process is usually very long with numerous
steps. At each step, you obtain an intermediate yield, and at the end you get the overall yield.
Even if you are able to increase the intermediate yield at each step by a reasonable percentage,
your final yield will increase dramatically. So, it is important to find ways to increase the
intermediate yields at each step. Data mining is used to work on the chemical synthesis data in
each step to find the best conditions for yield enhancement at that step. Data mining has helped
to optimize chemical processes involving organic chemical reactions resulting in great economic
benefits to manufacturers.

Fermentation Method
Processes using this method produce drugs such as antibiotics in a fermentation tank. Generally
fermentation processes are very sensitive to numerous affecting factors. As such, fermentation
processes are too complicated. It is extremely difficult to find an optimization model to improve
the overall production yield. Data mining assists the production process by determining the most
optimal operation parameters to increase the overall yield significantly.
Data Mining in Genomics and Proteomics
In today’s biotech environment, postgenomic science and its numerous studies are producing
mountains of high-dimensional data. All this data will remain as mere data unless patterns and
relationships—in fact, knowledge—are discovered from the data and used effectively. An
encouraging phenomenon is the rapidly increasing use of data mining at all levels of genomics
and proteomics. Genomics is the branch of genetics that studies organisms in terms of their
genomes. Proteomics, on the other hand, is the branch of genetics that studies the full set of
proteins encoded by a genome. As you know, in recent years, both branches of genetics have
gained enormous importance and are areas of intensive research. Already there are several
applications of data mining to studies in genomics and proteomics. There is a concerted effort in
the scientific community pushing for more sophisticated data mining approaches to genomics
and proteomics.
Applications in Banking and Finance
This is another industry where you will find heavy usage of data mining. Banking has been
reshaped by regulations in the past few years. Mergers and acquisitions are more pronounced in
banking and banks have been expanding the scope of their services. Finance is an area of
fluctuation and uncertainty. The banking and finance industry is fertile ground for data mining.
Banks and financial institutions generate large volumes of detailed transactions data. Such data is
suitable for data mining. Data mining applications at banks are quite varied. Fraud detection, risk
assessment of potential customers, trend analysis, and direct marketing are the primary data
mining applications at banks. In the financial area, requirements for forecasting dominate.
Forecasting of stock prices and commodity prices with a high level of approximation can mean
large profits. Forecasting of potential financial disaster can prove to be very valuable. Neural
network algorithms are used in forecasting, options and bond trading, portfolio management, and
in mergers and acquisitions.

DWA - Unit 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWA - Unit 4

Uploaded by

Copyright:

Available Formats

18CSE487T-DATA WAREHOUSING AND ITS APPLICATIONS

Steps involved in Data Mining:

Techniques used in Data Mining

Data Mining Algorithms:

Prepared by - K. Panimalar AP/DSBS-SRMIST

• Support Vector Machines Algorithm It is formed on the basis of structural risk

• The Apriori Algorithm

Prepared by - K. Panimalar AP/DSBS-SRMIST

Prepared by - K. Panimalar AP/DSBS-SRMIST

B. Continuous Data: Continuous data are in the form of fractional numbers.

Data Mining Functionalities

Integrating Data Mining with Data Warehouse

Data Mining Task Primitives

1. Set of task-relevant data to be mined.

1. The set of task-relevant data to be mined

2. The kind of knowledge to be mined

This specifies the data mining functions to be performed, such as characterization,discrimination,

Prepared by - K. Panimalar AP/DSBS-SRMIST

3. The background knowledge to be used in the discovery process

4. The interestingness measures and thresholds for pattern evaluation

Simplicity: A factor contributing to the interestingness of a pattern is the pattern's overall

Support (A=>B) = # tuples containing both A and B / total #of tuples

5. The expected representation for visualizing the discovered patterns

Example of Data Mining Task Primitives

1. use database AllElectronics_db

Prepared by - K. Panimalar AP/DSBS-SRMIST

• Accuracy: To check whether the data entered is correct or not.

Steps Involved in Data Preprocessing:

Prepared by - K. Panimalar AP/DSBS-SRMIST

• Regression: Here data can be made smooth by fitting it to a regression

Prepared by - K. Panimalar AP/DSBS-SRMIST

• Entity identification problem: Identifying entities from multiple databases. For

Association rule mining and classification

Prepared by - K. Panimalar AP/DSBS-SRMIST

If windy = false and play = no then outlook = sunny

If windy = false and play = no then humidity = high

Frequent pattern Mining

Prepared by - K. Panimalar AP/DSBS-SRMIST

a multidimensional record of dimensionality, d = |U |, containing only binary attributes.

An itemset is a set of items. A k-itemset is an itemset that contains exactly k items.

Definition 1 (Support) The support of an itemset I is defined as the fraction of the

Prepared by - K. Panimalar AP/DSBS-SRMIST

Definition 3 (Frequent Itemset Mining: Set-wise Definition) Given a set of

Definition 4 (Maximal Frequent Itemsets) A frequent itemset is maximal at a

Prepared by - K. Panimalar AP/DSBS-SRMIST

Prepared by - K. Panimalar AP/DSBS-SRMIST

Efficient Support Counting

Prepared by - K. Panimalar AP/DSBS-SRMIST

Frequent pattern Mining without candidate

If any length k pattern is not frequent in the database, its length (k + 1)

Prepared by - K. Panimalar AP/DSBS-SRMIST

Mining Multilevel Association Rules

Prepared by - K. Panimalar AP/DSBS-SRMIST

Prepared by - K. Panimalar AP/DSBS-SRMIST

Using item or group-based minimum support (referred to as group-based support): Because

Mining Multidimensional Association Rule

buys(X, “digital camera”) ⇒ buys(X, “HP printer”).

Following the terminology used in multidimensional databases, we refer to each distinct

Prepared by - K. Panimalar AP/DSBS-SRMIST

age(X, “20...29”) ∧ occupation(X, “student”)⇒buys(X, “laptop”).

age(X, “20...29”) ∧ buys(X, “laptop”)⇒buys(X, “HP printer”).

Prepared by - K. Panimalar AP/DSBS-SRMIST

frequent itemsets (as is done for single-dimensional association rule mining), in