Professional Documents
Culture Documents
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Data Mining
Overview & Techniques
SEMESTER: 5TH STREAM: BCA
Page | 1
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Page | 2
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Page | 3
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Page | 4
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
machine learning models that are using a vast amount of data to analyze
the user interest and recommend product accordingly.
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
Page | 5
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
1) Supervised Learning
Supervised learning is a type of machine learning method in which we
provide sample labeled data to the machine learning system in order to
train it, and on that basis, it predicts the output.
The system creates a model using labeled data to understand the datasets
and learn about each data, once the training and processing are done then
we test the model by providing a sample data to check whether it is
predicting the exact output or not.
The goal of supervised learning is to map input data with the output
data. The supervised learning is based on supervision, and it is the same
as when a student learns things in the supervision of the teacher. The
example of supervised learning is spam filtering.
Page | 6
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
o Classification
o Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns
without any supervision.
The training is provided to the machine with the set of data that has not
been labeled, classified, or categorized, and the algorithm needs to act
on that data without any supervision. The goal of unsupervised learning
is to restructure the input data into new features or a group of objects
with similar patterns.
o Clustering
o Association
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a
learning agent gets a reward for each right action and gets a penalty for
each wrong action. The agent learns automatically with these feedbacks
Page | 7
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
The robotic dog, which automatically learns the movement of his arms,
is an example of Reinforcement learning.
The main goal of this mining is to say something about future results not
of current behavior. It uses the supervised learning functions which are
used to predict the target value. The methods come under this type of
mining category are called classification, time-series analysis and
Page | 8
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
It requires data
Type of
Page | 9
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Page | 10
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
The process begins with determining the KDD objectives and ends with
the implementation of the discovered knowledge. At that point, the loop
is closed, and the Active Data Mining starts. Subsequently, changes
would need to be made in the application domain. For example, offering
Prepared By: Er. Rathore Suhail (Scientist-C)
Website: www.ematrixtech.in
Page | 11
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
various features to cell phone users in order to reduce churn. This closes
the loop, and the impacts are then measured on the new data repositories
and the KDD process again. Following is a concise description of the
nine-step KDD process, beginning with a managerial step:
Once defined the objectives, the data that will be utilized for the
knowledge discovery process should be determined. This incorporates
discovering what data is accessible, obtaining important data, and
afterward integrating all the data for knowledge discovery onto one set
involves the qualities that will be considered for the process. This
process is important because of Data Mining learns and discovers from
the accessible data. This is the evidence base for building the models. If
some significant attributes are missing, at that point, then the entire
study may be unsuccessful from this respect, the more attributes are
considered. On the other hand, to organize, collect, and operate
advanced data repositories is expensive, and there is an arrangement
with the opportunity for best understanding the phenomena. This
Page | 12
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared
and developed. Techniques here incorporate dimension reduction( for
example, feature selection and extraction and record sampling), also
attribute transformation(for example, Discretization of numerical
attributes and functional transformation). This step can be essential for
the success of the entire KDD project, and it is typically very project-
specific. For example, in medical assessments, the quotient of attributes
may often be the most significant factor and not each one by itself. In
business, we may need to think about impacts beyond our control as well
as efforts and transient issues. For example, studying the impact of
advertising accumulation. However, if we do not utilize the right
transformation at the starting, then we may acquire an amazing effect
Prepared By: Er. Rathore Suhail (Scientist-C)
Website: www.ematrixtech.in
Page | 13
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
We are now prepared to decide on which kind of Data Mining to use, for
example, classification, regression, clustering, etc. This mainly relies on
the KDD objectives, and also on the previous steps. There are two
significant objectives in Data Mining, the first one is a prediction, and
the second one is the description. Prediction is usually referred to as
supervised Data Mining, while descriptive Data Mining incorporates the
unsupervised and visualization aspects of Data Mining. Most Data
Mining techniques depend on inductive learning, where a model is built
explicitly or implicitly by generalizing from an adequate number of
preparing models. The fundamental assumption of the inductive
approach is that the prepared model applies to future cases. The
technique also takes into account the level of meta-learning for the
specific set of accessible data.
Page | 14
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and
reliability to the objective characterized in the first step. Here we
consider the preprocessing steps as for their impact on the Data Mining
algorithm results. For example, including a feature in step 4, and repeat
from there. This step focuses on the comprehensibility and utility of the
induced model. In this step, the identified knowledge is also recorded for
further use. The last step is the use, and overall feedback and discovery
results acquire by Data Mining.
Page | 15
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Now, we are prepared to include the knowledge into another system for
further activity. The knowledge becomes effective in the sense that we
may make changes to the system and measure the impacts. The
accomplishment of this step decides the effectiveness of the whole KDD
process. There are numerous challenges in this step, such as losing the
"laboratory conditions" under which we have worked. For example, the
knowledge was discovered from a certain static depiction, it is usually a
set of data, but now the data becomes dynamic. Data structures may
change certain quantities that become unavailable, and the data domain
might be modified, such as an attribute that may have a value that was
not expected previously.
Page | 16
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle
this part, data cleaning is done. It involves handling of missing
data, noisy data etc.
Page | 17
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
1. Binning Method:
2. Regression:
Page | 18
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
3. Clustering:
2. Data Transformation:
This step is taken in order to transform the data in appropriate
forms suitable for mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0
to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by
interval levels or conceptual levels.
3. Data Reduction:
Since data mining is a technique that is used to handle huge
Prepared By: Er. Rathore Suhail (Scientist-C)
Website: www.ematrixtech.in
Page | 19
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
3. Numerosity Reduction:
This enables to store the model of data instead of whole data, for
example: Regression Models.
4. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be
lossy or lossless. If after reconstruction from compressed data,
original data can be retrieved, such reduction are called lossless
reduction else it is called lossy reduction. The two effective methods
of dimensionality reduction are: Wavelet transforms and PCA
(Principal Component Analysis).
DATA MANAGEMENT ISSUES IN DATA
MINING ALGORITHMS
Prepared By: Er. Rathore Suhail (Scientist-C)
Website: www.ematrixtech.in
Page | 20
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources. These factors also
create some issues. Here in this tutorial, we will discuss the major
issues regarding −
Page | 21
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Page | 22
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
PERFORMANCE ISSUES
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms − In order
to effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of data,
and complexity of data mining methods motivate the development
of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further
processed in a parallel fashion. Then the results from the partitions
are merged. The incremental algorithms, update databases without
mining the data again from scratch.
Page | 23
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
CLUSTERING
Cluster is a group of objects that belongs to the same class. In other
words, similar objects are grouped in one cluster and dissimilar objects
are grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into
classes of similar objects.
Points to Remember
A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into
groups based on data similarity and then assign the labels to the
groups.
The main advantage of clustering over classification is that, it is
adaptable to changes and helps single out useful features that
distinguish different groups.
Page | 24
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Page | 25
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
CLUSTERING METHODS
Clustering methods can be classified into the following categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Page | 26
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
PARTITIONING METHOD
Suppose we are given a database of ‘n’ objects and the partitioning
method constructs ‘k’ partition of data. Each partition will represent a
cluster and k ≤ n. It means that it will classify the data into k groups,
which satisfy the following requirements −
Each group contains at least one object.
Each object must belong to exactly one group.
Points to remember −
For a given number of partitions (say k), the partitioning method
will create an initial partitioning.
Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.
HIERARCHICAL METHODS
This method creates a hierarchical decomposition of the given set of
data objects. We can classify hierarchical methods on the basis of how
the hierarchical decomposition is formed. There are two approaches
here −
Agglomerative Approach
Divisive Approach
AGGLOMERATIVE APPROACH
Prepared By: Er. Rathore Suhail (Scientist-C)
Website: www.ematrixtech.in
Page | 27
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
DIVISIVE APPROACH
This approach is also known as the top-down approach. In this, we start
with all of the objects in the same cluster. In the continuous iteration, a
cluster is split up into smaller clusters. It is down until each object in
one cluster or the termination condition holds. This method is rigid, i.e.,
once a merging or splitting is done, it can never be undone.
Density-based Method
Page | 28
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Grid-based Method
In this, the objects together form a grid. The object space is quantized
into finite number of cells that form a grid structure.
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in
the quantized space.
MODEL-BASED METHODS
In this method, a model is hypothesized for each cluster to find the best
fit of data for a given model. This method locates the clusters by
clustering the density function. It reflects spatial distribution of the data
points.
This method also provides a way to automatically determine the number
of clusters based on standard statistics, taking outlier or noise into
account. It therefore yields robust clustering methods.
CONSTRAINT-BASED METHOD
Prepared By: Er. Rathore Suhail (Scientist-C)
Website: www.ematrixtech.in
Page | 29
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Page | 30
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
MEASURES OF INTERESTINGNESS
Interestingness measures play an important role in data mining,
regardless of the kind of patterns being mined. These measures are
intended for selecting and ranking patterns according to their potential
interest to the user. Good measures also allow the time and space costs
Page | 31
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Page | 32
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Page | 33
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Page | 34
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
Page | 35
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
1. Classification:
Page | 36
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
2. Clustering:
Page | 37
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
3. Regression:
4. Association Rules:
Page | 38
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
This data mining technique helps to discover a link between two or more
items. It finds a hidden pattern in the data set.
The way the algorithm works is that you have various data, For example,
a list of grocery items that you have been buying for the last six months.
It calculates a percentage of items being purchased together.
o Lift:
This measurement technique measures the accuracy of the
confidence over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items
are purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is
purchased when item A is purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
Page | 39
Head Office: 2nd Floor Shah Complex Maharaj Bazar Opp Jan Bakery Srinagar, 190001
Branch Office: 1st Floor Khan Complex near HDFC Bank Nishat Srinagar, 191121
Cell: +91-7006968379, +91-9419058379 Email: training@ematrixtech.in
6. Sequential Patterns:
7. Prediction:
Page | 40