Data Mining: Machine Learning Tutorial

Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Data Mining
Overview & Techniques
SEMESTER: 5TH STREAM: BCA
Machine Learning Tutorial

Machine learning is a growing technology which enables computers to
learn automatically from past data. Machine learning uses various
algorithms for building mathematical models and making predictions
using historical data or information. Currently, it is being used for
various tasks such as image recognition, speech recognition, email
filtering, Facebook auto-tagging, recommender system, and many
more.
What is Machine Learning?

In the real world, we are surrounded by humans who can learn
everything from their experiences with their learning capability, and we
have computers or machines which work on our instructions. But can a
machine also learn from experiences or past data like a human does? So
here comes the role of Machine Learning.
Prepared By: Er. Rathore Suhail (Scientist-C)

Website: www.ematrixtech.in
Page | 1
Machine Learning is said as a subset of artificial intelligence that is

mainly concerned with the development of algorithms which allow a
computer to learn from the data and past experiences on their own. The
term machine learning was first introduced by Arthur Samuel in 1959.
We can define it in a summarized way as:
Machine learning enables a machine to automatically learn from data,

improve performance from experiences, and predict things without being
explicitly programmed.
With the help of sample historical data, which is known as training

data, machine learning algorithms build a mathematical model that
helps in making predictions or decisions without being explicitly
programmed. Machine learning brings computer science and statistics
together for creating predictive models. Machine learning constructs or
uses the algorithms that learn from historical data. The more we will
provide the information, the higher will be the performance.

Page | 2
A machine has the ability to learn if it can improve its performance

by gaining more data.
HOW DOES MACHINE LEARNING WORK

A Machine Learning system learns from historical data, builds the
prediction models, and whenever it receives new data, predicts the
output for it. The accuracy of predicted output depends upon the
amount of data, as the huge amount of data helps to build a better model
which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some

predictions, so instead of writing a code for it, we just need to feed the
data to generic algorithms, and with the help of these algorithms,
machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below
block diagram explains the working of Machine Learning algorithm:

Page | 3
Features of Machine Learning:

o Machine learning uses data to detect various patterns in a given
dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals
with the huge amount of the data.
Need for Machine Learning

The need for machine learning is increasing day by day. The reason
behind the need for machine learning is that it is capable of doing tasks
that are too complex for a person to implement directly. As a human, we
have some limitations as we cannot access the huge amount of data
manually, so for this, we need some computer systems and here comes
the machine learning to make things easy for us.
We can train machine learning algorithms by providing them the huge

amount of data and let them explore the data, construct the models, and
predict the required output automatically. The performance of the
machine learning algorithm depends on the amount of data, and it can be
determined by the cost function. With the help of machine learning, we
can save both time and money.
The importance of machine learning can be easily understood by its uses

cases, currently, machine learning is used in self-driving cars, cyber
fraud detection, face recognition, and friend suggestion by Facebook,
etc. Various top companies such as Netflix and Amazon have build

Page | 4
machine learning models that are using a vast amount of data to analyze
the user interest and recommend product accordingly.
Following are some key points which show the importance of

Machine Learning:
o Rapid increment in the production of data

o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from
data.
Classification of Machine Learning

At a broad level, machine learning can be classified into three types:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

Page | 5
1) Supervised Learning
Supervised learning is a type of machine learning method in which we
provide sample labeled data to the machine learning system in order to
train it, and on that basis, it predicts the output.
The system creates a model using labeled data to understand the datasets
and learn about each data, once the training and processing are done then
we test the model by providing a sample data to check whether it is
predicting the exact output or not.
The goal of supervised learning is to map input data with the output
data. The supervised learning is based on supervision, and it is the same
as when a student learns things in the supervision of the teacher. The
example of supervised learning is spam filtering.
Supervised learning can be grouped further in two categories of

algorithms:

Page | 6
o Classification
o Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns
without any supervision.
The training is provided to the machine with the set of data that has not
been labeled, classified, or categorized, and the algorithm needs to act
on that data without any supervision. The goal of unsupervised learning
is to restructure the input data into new features or a group of objects
with similar patterns.
In unsupervised learning, we don't have a predetermined result. The

machine tries to find useful insights from the huge amount of data. It can
be further classifieds into two categories of algorithms:
o Clustering
o Association
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a
learning agent gets a reward for each right action and gets a penalty for
each wrong action. The agent learns automatically with these feedbacks

Page | 7
and improves its performance. In reinforcement learning, the agent

interacts with the environment and explores it. The goal of an agent is to
get the most reward points, and hence, it improves its performance.
The robotic dog, which automatically learns the movement of his arms,
is an example of Reinforcement learning.
Descriptive and Predictive Data Mining

DESCRIPTIVE MINING:
This term is basically used to produce correlation, cross-tabulation,

frequency etc. These technologies are used to determine the similarities
in the data and to find existing patterns. One more application of
descriptive analysis is to develop the captivating subgroups in the major
part of the data available.
This analytics emphasis on the summarization and transformation of the
data into meaningful information for reporting and monitoring.
PREDICTIVE DATA MINING:
The main goal of this mining is to say something about future results not
of current behavior. It uses the supervised learning functions which are
used to predict the target value. The methods come under this type of
mining category are called classification, time-series analysis and

Page | 8
regression. Modelling of data is the necessity of the predictive analysis,

and it works by utilizing a few variables of the present to predict the
future not known data values for other variables.
Difference between Descriptive and Predictive Data Mining:

DESCRIPTIVE DATA
S.NO. COMPARISON MINING PREDICTIVE DATA MINING
It determines, what It determines, what can
happened in the past by happen in the future with
1. Basic analyzing stored data. the help past data analysis.
It produces results does not
2. Preciseness It provides accurate data. ensure accuracy.
Practical Standard reporting, Predictive modelling,
analysis query/drill down and ad- forecasting, simulation and
3. methods hoc reporting. alerts.
It requires data
aggregation and data It requires statistics and
4. Require mining forecasting methods
Type of
5. approach Reactive approach Proactive approach

Page | 9
Carry out the induction over
Describes the the current and past data so
characteristics of the data that predictions can be
6. Describe in a target data set. made.
 What will happen

 What happened? next?
 Where exactly is  What is the outcome
Methods(in the problem? if these trends
 What is the continue?
7. general) frequency of the  What actions are
problem? required to be taken?
KDD- Knowledge Discovery in Databases

The term KDD stands for Knowledge Discovery in Databases. It refers
to the broad procedure of discovering knowledge in data and emphasizes
the high-level applications of specific Data Mining techniques. It is a
field of interest to researchers in various fields, including artificial
intelligence, machine learning, pattern recognition, databases, statistics,
knowledge acquisition for expert systems, and data visualization.
The main objective of the KDD process is to extract information from

data in the context of large databases. It does this by using Data Mining
algorithms to identify what is deemed knowledge.

Page | 10
The Knowledge Discovery in Databases is considered as a programmed,

exploratory analysis and modeling of vast data repositories.KDD is the
organized procedure of recognizing valid, useful, and understandable
patterns from huge and complex data sets. Data Mining is the root of the
KDD procedure, including the inferring of algorithms that investigate
the data, develop the model, and find previously unknown patterns. The
model is used for extracting the knowledge from the data, analyze the
data, and predict the data.
The availability and abundance of data today make knowledge discovery

and Data Mining a matter of impressive significance and need. In the
recent development of the field, it isn't surprising that a wide variety of
techniques is presently accessible to specialists and experts.
The KDD Process

The knowledge discovery process (illustrates in the given figure) is
iterative and interactive, comprises of nine steps. The process is iterative
at each stage, implying that moving back to the previous actions might
be required. The process has many imaginative aspects in the sense that
one can’t presents one formula or makes a complete scientific
categorization for the correct decisions for each step and application
type. Thus, it is needed to understand the process and the different
requirements and possibilities in each stage.
The process begins with determining the KDD objectives and ends with
the implementation of the discovered knowledge. At that point, the loop
is closed, and the Active Data Mining starts. Subsequently, changes
would need to be made in the application domain. For example, offering
Page | 11
various features to cell phone users in order to reduce churn. This closes
the loop, and the impacts are then measured on the new data repositories
and the KDD process again. Following is a concise description of the
nine-step KDD process, beginning with a managerial step:
1. Building up an understanding of the application domain
This is the initial preliminary step. It develops the scene for

understanding what should be done with the various decisions like
transformation, algorithms, representation, etc. The individuals who are
in charge of a KDD venture need to understand and characterize the
objectives of the end-user and the environment in which the knowledge
discovery process will occur (involves relevant prior knowledge).
2. Choosing and creating a data set on which discovery will be

performed
Once defined the objectives, the data that will be utilized for the
knowledge discovery process should be determined. This incorporates
discovering what data is accessible, obtaining important data, and
afterward integrating all the data for knowledge discovery onto one set
involves the qualities that will be considered for the process. This
process is important because of Data Mining learns and discovers from
the accessible data. This is the evidence base for building the models. If
some significant attributes are missing, at that point, then the entire
study may be unsuccessful from this respect, the more attributes are
considered. On the other hand, to organize, collect, and operate
advanced data repositories is expensive, and there is an arrangement
with the opportunity for best understanding the phenomena. This

Page | 12
arrangement refers to an aspect where the interactive and iterative aspect

of the KDD is taking place. This begins with the best available data sets
and later expands and observes the impact in terms of knowledge
discovery and modeling.
3. Preprocessing and cleansing
In this step, data reliability is improved. It incorporates data clearing, for

example, Handling the missing quantities and removal of noise or
outliers. It might include complex statistical techniques or use a Data
Mining algorithm in this context. For example, when one suspect that a
specific attribute of lacking reliability or has many missing data, at this
point, this attribute could turn into the objective of the Data Mining
supervised algorithm. A prediction model for these attributes will be
created, and after that, missing data can be predicted. The expansion to
which one pays attention to this level relies upon numerous factors.
Regardless, studying the aspects is significant and regularly revealing by
itself, to enterprise data frameworks.
4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared
and developed. Techniques here incorporate dimension reduction( for
example, feature selection and extraction and record sampling), also
attribute transformation(for example, Discretization of numerical
attributes and functional transformation). This step can be essential for
the success of the entire KDD project, and it is typically very project-
specific. For example, in medical assessments, the quotient of attributes
may often be the most significant factor and not each one by itself. In
business, we may need to think about impacts beyond our control as well
as efforts and transient issues. For example, studying the impact of
advertising accumulation. However, if we do not utilize the right
transformation at the starting, then we may acquire an amazing effect
Page | 13
that insights to us about the transformation required in the next iteration.

Thus, the KDD process follows upon itself and prompts an
understanding of the transformation required.
5. Prediction and description
We are now prepared to decide on which kind of Data Mining to use, for
example, classification, regression, clustering, etc. This mainly relies on
the KDD objectives, and also on the previous steps. There are two
significant objectives in Data Mining, the first one is a prediction, and
the second one is the description. Prediction is usually referred to as
supervised Data Mining, while descriptive Data Mining incorporates the
unsupervised and visualization aspects of Data Mining. Most Data
Mining techniques depend on inductive learning, where a model is built
explicitly or implicitly by generalizing from an adequate number of
preparing models. The fundamental assumption of the inductive
approach is that the prepared model applies to future cases. The
technique also takes into account the level of meta-learning for the
specific set of accessible data.
6. Selecting the Data Mining algorithm
Having the technique, we now decide on the strategies. This stage

incorporates choosing a particular technique to be used for searching
patterns that include multiple inducers. For example, considering
precision versus understandability, the previous is better with neural
networks, while the latter is better with decision trees. For each system
of meta-learning, there are several possibilities of how it can be
succeeded. Meta-learning focuses on clarifying what causes a Data

Page | 14
Mining algorithm to be fruitful or not in a specific issue. Thus, this

methodology attempts to understand the situation under which a Data
Mining algorithm is most suitable. Each algorithm has parameters and
strategies of leaning, such as ten folds cross-validation or another
division for training and testing.
7. Utilizing the Data Mining algorithm
At last, the implementation of the Data Mining algorithm is reached. In

this stage, we may need to utilize the algorithm several times until a
satisfying outcome is obtained. For example, by turning the algorithms
control parameters, such as the minimum number of instances in a single
leaf of a decision tree.
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and
reliability to the objective characterized in the first step. Here we
consider the preprocessing steps as for their impact on the Data Mining
algorithm results. For example, including a feature in step 4, and repeat
from there. This step focuses on the comprehensibility and utility of the
induced model. In this step, the identified knowledge is also recorded for
further use. The last step is the use, and overall feedback and discovery
results acquire by Data Mining.
9. Using the discovered knowledge

Page | 15
Now, we are prepared to include the knowledge into another system for
further activity. The knowledge becomes effective in the sense that we
may make changes to the system and measure the impacts. The
accomplishment of this step decides the effectiveness of the whole KDD
process. There are numerous challenges in this step, such as losing the
"laboratory conditions" under which we have worked. For example, the
knowledge was discovered from a certain static depiction, it is usually a
set of data, but now the data becomes dynamic. Data structures may
change certain quantities that become unavailable, and the data domain
might be modified, such as an attribute that may have a value that was
not expected previously.
Data Preprocessing in Data Mining

Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.

Page | 16
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle
this part, data cleaning is done. It involves handling of missing
data, noisy data etc.
 (a) Missing Data:

This situation arises when some data is missing in the data. It can be

Page | 17
handled in various ways.

Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite
large and multiple values are missing within a tuple.
2. Fill the Missing values:

There are various ways to do this task. You can choose to fill
the missing values manually, by attribute mean or the most
probable value.
 (b) Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by

machines. It can be generated due to faulty data collection, data
entry errors etc. It can be handled in following ways:
1. Binning Method:
This method works on sorted data in order to smooth it. The

whole data is divided into segments of equal size and then
various methods are performed to complete the task. Each
segmented is handled separately. One can replace all data in a
segment by its mean or boundary values can be used to
complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression

function. The regression used may be linear (having one
independent variable) or multiple (having multiple
independent variables).
Page | 18
3. Clustering:
This approach groups the similar data in a cluster. The

outliers may be undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate
forms suitable for mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0
to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by
interval levels or conceptual levels.
4. Concept Hierarchy Generation:

Here attributes are converted from level to higher level in hierarchy.
For Example-The attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge
Page | 19
amount of data. While working with huge volume of data, analysis

became harder in such cases. In order to get rid of this, we use data
reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:

1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the
data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be

discarded. For performing attribute selection, one can use level of
significance and p- value of the attribute. The attribute having p-
value greater than significance level can be discarded.
3. Numerosity Reduction:
This enables to store the model of data instead of whole data, for
example: Regression Models.
4. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be
lossy or lossless. If after reconstruction from compressed data,
original data can be retrieved, such reduction are called lossless
reduction else it is called lossy reduction. The two effective methods
of dimensionality reduction are: Wavelet transforms and PCA
(Principal Component Analysis).
DATA MANAGEMENT ISSUES IN DATA
MINING ALGORITHMS
Page | 20
Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources. These factors also
create some issues. Here in this tutorial, we will discuss the major
issues regarding −
 Mining Methodology and User Interaction

 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.

Page | 21
Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different
users may be interested in different kinds of knowledge. Therefore
it is necessary for data mining to cover a broad range of
knowledge discovery task.
 Interactive mining of knowledge at multiple levels of
abstraction − The data mining process needs to be interactive
because it allows users to focus the search for patterns, providing
and refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery
process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at
multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data
Mining Query language that allows the user to describe ad hoc
mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once
the patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations
should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods
are required to handle the noise and incomplete objects while
mining the data regularities. If the data cleaning methods are not
there then the accuracy of the discovered patterns will be poor.

Page | 22
 Pattern evaluation − The patterns discovered should be

interesting because either they represent common knowledge or
lack novelty.
PERFORMANCE ISSUES
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order
to effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of data,
and complexity of data mining methods motivate the development
of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further
processed in a parallel fashion. Then the results from the partitions
are merged. The incremental algorithms, update databases without
mining the data again from scratch.
DIVERSE DATA TYPES ISSUES

 Handling of relational and complex types of data −The
database may contain complex data objects, multimedia data
objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.

Page | 23
 Mining information from heterogeneous databases and global

information systems − The data is available at different data
sources on LAN or WAN. These data source may be structured,
semi structured or unstructured. Therefore mining the knowledge
from them adds challenges to data mining.
CLUSTERING
Cluster is a group of objects that belongs to the same class. In other
words, similar objects are grouped in one cluster and dissimilar objects
are grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into
classes of similar objects.
Points to Remember
 A cluster of data objects can be treated as one group.
 While doing cluster analysis, we first partition the set of data into
groups based on data similarity and then assign the labels to the
groups.
 The main advantage of clustering over classification is that, it is
adaptable to changes and helps single out useful features that
distinguish different groups.

Page | 24
Applications of Cluster Analysis

 Clustering analysis is broadly used in many applications such as
market research, pattern recognition, data analysis, and image
processing.
 Clustering can also help marketers discover distinct groups in their
customer base. And they can characterize their customer groups
based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal
taxonomies, categorize genes with similar functionalities and gain
insight into structures inherent to populations.
 Clustering also helps in identification of areas of similar land use
in an earth observation database. It also helps in the identification
of groups of houses in a city according to house type, value, and
geographic location.
 Clustering also helps in classifying documents on the web for
information discovery.
 Clustering is also used in outlier detection applications such as
detection of credit card fraud.
 As a data mining function, cluster analysis serves as a tool to gain
insight into the distribution of data to observe characteristics of
each cluster.
Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data
mining −

Page | 25
 Scalability − We need highly scalable clustering algorithms to

deal with large databases.
 Ability to deal with different kinds of attributes − Algorithms
should be capable to be applied on any kind of data such as
interval-based (numerical) data, categorical, and binary data.
 Discovery of clusters with attribute shape − the clustering
algorithm should be capable of detecting clusters of arbitrary
shape. They should not be bounded to only distance measures that
tend to find spherical cluster of small sizes.
 High dimensionality − the clustering algorithm should not only
be able to handle low-dimensional data but also the high
dimensional space.
 Ability to deal with noisy data − Databases contain noisy,
missing or erroneous data. Some algorithms are sensitive to such
data and may lead to poor quality clusters.
 Interpretability − the clustering results should be interpretable,
comprehensible, and usable.
CLUSTERING METHODS
Clustering methods can be classified into the following categories −
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method

Page | 26
PARTITIONING METHOD
Suppose we are given a database of ‘n’ objects and the partitioning
method constructs ‘k’ partition of data. Each partition will represent a
cluster and k ≤ n. It means that it will classify the data into k groups,
which satisfy the following requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.
Points to remember −
 For a given number of partitions (say k), the partitioning method
will create an initial partitioning.
 Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.
HIERARCHICAL METHODS
This method creates a hierarchical decomposition of the given set of
data objects. We can classify hierarchical methods on the basis of how
the hierarchical decomposition is formed. There are two approaches
here −
 Agglomerative Approach
 Divisive Approach
AGGLOMERATIVE APPROACH
Page | 27
This approach is also known as the bottom-up approach. In this, we

start with each object forming a separate group. It keeps on merging the
objects or groups that are close to one another. It keeps on doing so
until all of the groups are merged into one or until the termination
condition holds.
DIVISIVE APPROACH
This approach is also known as the top-down approach. In this, we start
with all of the objects in the same cluster. In the continuous iteration, a
cluster is split up into smaller clusters. It is down until each object in
one cluster or the termination condition holds. This method is rigid, i.e.,
once a merging or splitting is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of
hierarchical clustering −
 Perform careful analysis of object linkages at each hierarchical
partitioning.
 Integrate hierarchical agglomeration by first using a hierarchical
agglomerative algorithm to group objects into micro-clusters, and
then performing macro-clustering on the micro-clusters.
Density-based Method

Page | 28
This method is based on the notion of density. The basic idea is to

continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold, i.e., for each data point within a
given cluster, the radius of a given cluster has to contain at least a
minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized
into finite number of cells that form a grid structure.
Advantages
 The major advantage of this method is fast processing time.
 It is dependent only on the number of cells in each dimension in
the quantized space.
MODEL-BASED METHODS
In this method, a model is hypothesized for each cluster to find the best
fit of data for a given model. This method locates the clusters by
clustering the density function. It reflects spatial distribution of the data
points.
This method also provides a way to automatically determine the number
of clusters based on standard statistics, taking outlier or noise into
account. It therefore yields robust clustering methods.
CONSTRAINT-BASED METHOD
Page | 29
In this method, the clustering is performed by the incorporation of user

or application-oriented constraints. A constraint refers to the user
expectation or the properties of desired clustering results. Constraints
provide us with an interactive way of communication with the
clustering process. Constraints can be specified by the user or the
application requirement.
WHAT IS KNOWLEDGE DISCOVERY?

Some people don’t differentiate data mining from knowledge discovery
while others view data mining as an essential step in the process of
knowledge discovery. Here is the list of steps involved in the
knowledge discovery process −
 Data Cleaning − in this step, the noise and inconsistent data is
removed.
 Data Integration − in this step, multiple data sources are
combined.
 Data Selection − in this step, data relevant to the analysis task are
retrieved from the database.
 Data Transformation − in this step, data is transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations.
 Data Mining − in this step, intelligent methods are applied in
order to extract data patterns.
 Pattern Evaluation − in this step, data patterns are evaluated.
 Knowledge Presentation − in this step, knowledge is represented.

Page | 30
The following diagram shows the process of knowledge discovery −
MEASURES OF INTERESTINGNESS
Interestingness measures play an important role in data mining,
regardless of the kind of patterns being mined. These measures are
intended for selecting and ranking patterns according to their potential
interest to the user. Good measures also allow the time and space costs

Page | 31
of the mining process to be reduced. This survey reviews the

interestingness measures for rules and summaries, classifies them from
several perspectives, compares their properties, identifies their roles in
the data mining process, gives strategies for selecting appropriate
measures for applications, and identifies opportunities for future
research in this area.
1. Objective Measure: An objective measure is based only on the

raw data. No knowledge about the user or application is required.
Most objective measures are based on theories in probability,
statistics, or information theory. Conciseness, generality,
reliability, peculiarity, and diversity depend only on the data and
patterns, and thus can be considered objective.Interestingness
Measure Classifying Criteria. These interestingness measures can
be categorized into three classifications: objective, subjective, and
semantics-based.
2. Subjective Measure: A subjective measure takes into account

both the data and the user of these data. To define a subjective
measure, access to the user’s domain or background knowledge
about the data is required. This access can be obtained by
interacting with the user during the data mining process or by
explicitly representing the user’s knowledge or expectations. In the
latter case, the key issue is the representation of the user’s
knowledge, which has been addressed by various frameworks and
procedures for data mining [Liu et al. 1997, 1999; Silberschatz and
Tuzhilin 1995, 1996; Sahar 1999]. Novelty and surprisingness
depend on the user of the patterns, as well as the data and patterns
themselves, and hence can be considered subjective.
Interestingness Measure Classifying Criteria
Page | 32
3. Semantic Measure: A semantic measure considers the semantics

and explanations of the patterns. Because semantic measures
involve domain knowledge from the user, some researchers
consider them a special type of subjective measure [Yao et al.
2006]. Utility and action ability depend on the semantics of the
data, and thus can be considered semantic. Utility-based measures,
where the relevant semantics are the utilities of the patterns in the
domain, are the most common type of semantic measure. To use a
utility-based approach, the user must specify additional knowledge
about the domain. Unlike subjective measures, where the domain
knowledge is about the data itself and is usually represented in a
format similar to that of the discovered pattern, the domain
knowledge required for semantic measures does not relate to the
user’s knowledge or expectations concerning the data. Instead, it
represents a utility function that reflects the user’s goals. This
function should be optimized in the mined results. For example, a
store manager might prefer association rules that relate to high-
profit items over those with higher statistical significance.
4. Reliability: A pattern is reliable if the relationship described by

the pattern occurs in a high percentage of applicable cases. For
example, a classification rule is reliable if its predictions are highly
accurate, and an association rule is reliable if it has high
confidence. Many measures from probability, statistics, and
information retrieval have been proposed to measure the reliability
of association rules [Ohsakiet al.2004; Tan et al. 2002].
5. Peculiarity: A pattern is peculiar if it is far away from other

discovered patterns according to some distance measure. Peculiar
patterns are generated from peculiar data (or outliers), which are
Page | 33
relatively few in number and significantly different from the rest of

the data [Knorretal. 2000; Zhong et al. 2003].Peculiar patterns may
be unknown to the user, hence interesting.
6. Diversity: A pattern is diverse if its elements differ significantly

from each other, while a set of patterns is diverse if the patterns in
the set differ significantly from each other. Diversity is a common
factor for measuring the interestingness of summaries [Hilderman
and Hamilton 2001]. According to a simple point of view, a
summary can be considered diverse if its probability distribution is
far from the uniform distribution. A diverse summary may be
interesting because in the absence of any relevant knowledge, a
user commonly assumes that the uniform distribution will hold in a
summary. According to this reasoning, the more diverse the
summary is, the more interesting it is. We are unaware of any
existing research on using diversity to measure the interestingness
of classification or association rules.
7. Novelty: A pattern is novel to a person if he or she did not know it

before and is not able to infer it from other known patterns. No
known data mining system represents everything that a user
knows, and thus, novelty cannot be measured explicitly with
reference to the user’s knowledge. Similarly, no known data
mining system represents what the user does not know, and
therefore, novelty cannot be measured explicitly with reference to
the user’s ignorance. Instead, novelty is detected by having the
user either explicitly identify a pattern as novel [Sahar 1999] or
notice that a pattern cannot be deduced from and does not
contradict previously discovered patterns. In the latter case, the
discovered patterns are being used as an approximation to the
user’s knowledge.

Page | 34
8. Surprising: A pattern is surprising (or unexpected) if it contradicts

a person’s existing knowledge or expectations [Liuetal.1997, 1999;
Silberschatz and Tuzhilin 1995, 1996]. A pattern that is an
exception to a more general pattern which has already been
discovered can also be considered surprising [Bay and Pazzani
1999; Carvalhoand Freitas 2000]. Surprising patterns are
interesting because they identify failings in previous knowledge
and may suggest an aspect of the data that needs further study. The
difference between surprisingness and novelty is that a novel
pattern is new and not contradicted by any pattern already known
to the user, while a surprising pattern contradicts the user’s
previous knowledge or expectations.
DATA MINING TECHNIQUES

Data mining includes the utilization of refined data analysis tools to find
previously unknown, valid patterns and relationships in huge data sets.
These tools can incorporate statistical models, machine learning
techniques, and mathematical algorithms, such as neural networks or
decision trees. Thus, data mining incorporates analysis and prediction.
Depending on various methods and technologies from the intersection of

machine learning, database management, and statistics, professionals in
data mining have devoted their careers to better understanding how to
process and make conclusions from the huge amount of data, but what
are the methods they use to make it happen?
In recent data mining projects, various major data mining techniques

have been developed and used, including association, classification,
clustering, prediction, sequential patterns, and regression.

Page | 35
1. Classification:
This technique is used to obtain important and relevant information

about data and metadata. This data mining technique helps to classify
data in different classes.
Data mining techniques can be classified by different criteria, as follows:
i. Classification of Data mining frameworks as per the type of

data sources mined:
This classification is as per the type of data handled. For example,
multimedia, spatial data, text data, time-series data, World Wide
Web, and so on.
ii. Classification of data mining frameworks as per the database

involved:
Page | 36
This classification based on the data model involved. For example.

Object-oriented database, transactional database, relational
database, and so on.
iii. Classification of data mining frameworks as per the kind of

knowledge discovered:
This classification depends on the types of knowledge discovered
or data mining functionalities. For example, discrimination,
classification, clustering, characterization, etc. some frameworks
tend to be extensive frameworks offering a few data mining
functionalities together.
iv. Classification of data mining frameworks according to data

mining techniques used:
This classification is as per the data analysis approach utilized,
such as neural networks, machine learning, genetic algorithms,
visualization, statistics, data warehouse-oriented or database-
oriented, etc.
The classification can also take into account, the level of user
interaction involved in the data mining procedure, such as query-
driven systems, autonomous systems, or interactive exploratory
systems.
2. Clustering:
Clustering is a division of information into groups of connected objects.

Describing the data by a few clusters mainly loses certain confine
details, but accomplishes improvement. It models data by its clusters.
Page | 37
Data modeling puts clustering from a historical point of view rooted in

statistics, mathematics, and numerical analysis. From a machine learning
point of view, clusters relate to hidden patterns, the search for clusters is
unsupervised learning, and the subsequent framework represents a data
concept. From a practical point of view, clustering plays an
extraordinary job in data mining applications. For example, scientific
data exploration, text mining, information retrieval, spatial database
applications, CRM, Web analysis, computational biology, medical
diagnostics, and much more.
In other words, we can say that Clustering analysis is a data mining

technique to identify similar data. This technique helps to recognize the
differences and similarities between the data. Clustering is very similar
to the classification, but it involves grouping chunks of data together
based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and

analyze the relationship between variables because of the presence of the
other factor. It is used to define the probability of the specific variable.
Regression, primarily a form of planning and modeling. For example,
we might use it to project certain costs, depending on other factors such
as availability, consumer demand, and competition. Primarily it gives
the exact relationship between two or more variables in the given data
set.
4. Association Rules:

Page | 38
This data mining technique helps to discover a link between two or more
items. It finds a hidden pattern in the data set.
Association rules are if-then statements that support to show the

probability of interactions between data items within large data sets in
different types of databases. Association rule mining has several
applications and is commonly used to help sales correlations in data or
medical data sets.
The way the algorithm works is that you have various data, For example,
a list of grocery items that you have been buying for the last six months.
It calculates a percentage of items being purchased together.
These are three major measurements technique:
o Lift:
This measurement technique measures the accuracy of the
confidence over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items
are purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is
purchased when item A is purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:

Page | 39
This type of data mining technique relates to the observation of data

items in the data set, which do not match an expected pattern or
expected behavior. This technique may be used in various domains like
intrusion, detection, fraud detection, etc. It is also known as Outlier
Analysis or Outlier mining. The outlier is a data point that diverges too
much from the rest of the dataset. The majority of the real-world datasets
have an outlier. Outlier detection plays a significant role in the data
mining field. Outlier detection is valuable in numerous fields like
network interruption identification, credit or debit card fraud detection,
detecting outlying in wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized

for evaluating sequential data to discover sequential patterns. It
comprises of finding interesting subsequences in a set of sequences,
where the stake of a sequence can be measured in terms of different
criteria like length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or

recognize similar patterns in transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as

trends, clustering, classification, etc. It analyzes past events or instances
in the right sequence to predict a future event.

Page | 40

Data Mining: Machine Learning Tutorial

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining: Machine Learning Tutorial

Uploaded by

Copyright:

Available Formats

Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001

Machine Learning Tutorial

What is Machine Learning?

Prepared By: Er. Rathore Suhail (Scientist-C)

Machine Learning is said as a subset of artificial intelligence that is

Machine learning enables a machine to automatically learn from data,

With the help of sample historical data, which is known as training

Prepared By: Er. Rathore Suhail (Scientist-C)

A machine has the ability to learn if it can improve its performance

HOW DOES MACHINE LEARNING WORK

Suppose we have a complex problem, where we need to perform some

Prepared By: Er. Rathore Suhail (Scientist-C)

Features of Machine Learning:

Need for Machine Learning

We can train machine learning algorithms by providing them the huge

The importance of machine learning can be easily understood by its uses

Prepared By: Er. Rathore Suhail (Scientist-C)

Following are some key points which show the importance of

o Rapid increment in the production of data

Classification of Machine Learning

Prepared By: Er. Rathore Suhail (Scientist-C)

Supervised learning can be grouped further in two categories of

Prepared By: Er. Rathore Suhail (Scientist-C)

In unsupervised learning, we don't have a predetermined result. The

Prepared By: Er. Rathore Suhail (Scientist-C)

and improves its performance. In reinforcement learning, the agent

Descriptive and Predictive Data Mining

This term is basically used to produce correlation, cross-tabulation,

PREDICTIVE DATA MINING:

Prepared By: Er. Rathore Suhail (Scientist-C)

regression. Modelling of data is the necessity of the predictive analysis,

Difference between Descriptive and Predictive Data Mining:

S.NO. COMPARISON MINING PREDICTIVE DATA MINING

It determines, what It determines, what can

happened in the past by happen in the future with

1. Basic analyzing stored data. the help past data analysis.

It produces results does not

2. Preciseness It provides accurate data. ensure accuracy.

Practical Standard reporting, Predictive modelling,

analysis query/drill down and ad- forecasting, simulation and

3. methods hoc reporting. alerts.

aggregation and data It requires statistics and

4. Require mining forecasting methods

5. approach Reactive approach Proactive approach

Prepared By: Er. Rathore Suhail (Scientist-C)

Carry out the induction over

Describes the the current and past data so

characteristics of the data that predictions can be

6. Describe in a target data set. made.

 What will happen

KDD- Knowledge Discovery in Databases

The main objective of the KDD process is to extract information from

Prepared By: Er. Rathore Suhail (Scientist-C)

The Knowledge Discovery in Databases is considered as a programmed,

The availability and abundance of data today make knowledge discovery

The KDD Process

1. Building up an understanding of the application domain

This is the initial preliminary step. It develops the scene for

2. Choosing and creating a data set on which discovery will be

Prepared By: Er. Rathore Suhail (Scientist-C)

arrangement refers to an aspect where the interactive and iterative aspect

3. Preprocessing and cleansing

In this step, data reliability is improved. It incorporates data clearing, for