Professional Documents
Culture Documents
Data Mining 5 Semester Bca
Data Mining 5 Semester Bca
PRESIDENCY
Presidency COLLEGE
College
(Autonomous)
(Autonomous)
5 SEMESTER BCA
By
Dr. J. Vijay Fidelis
Reaccredited by Associate Professor, Dept of Computer Applications
NAAC with A+ Presidency college, Bangalore-24
Presidency
Group
UNIT – IV CONCEPT DESCRIPTION
Presidency College
(Autonomous)
Data mining deals with the kind of patterns that can be mined.
From Data analysis point of view, we can classify the data mining into the
following two categories:-
1. Predictive data mining
2. Descriptive data mining
It specifies the characteristics of the data in a It executes the induction over the current
target data set. and past data so that prediction can happen.
Reaccredited by
NAAC with A+
It needs data aggregation and data mining. It needs statistics and data forecasting
procedures.
Presidency
Group It provides precise data. It produces outcomes without ensuring
accuracy.
list of descriptive functions
Presidency College
(Autonomous)
❑ Class/Concept Description
❑ Mining of Frequent Patterns
Reaccredited by
NAAC with A+
❑ Mining of Associations
❑ Mining of Correlations
Presidency
❑ Mining of Clusters
Group
a) Class/Concept Description:
Presidency College
(Autonomous)
Reaccredited by
Note: The descriptive and predictive data mining techniques have
NAAC with A+ huge applications in data mining; they are used to mine the types of
patterns.
The descriptive analysis is used to mine data and specify the current
data on past events.
Presidency
Group
In contrast, the predictive analysis gives the answers to all queries
related to recent or previous data that move across using historical
data as the main principle for decision.
DATA GENERALISATION
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
DATA GENERALIZATION
Presidency College
Declarative Generalization,
Declarative generalization involves manually
deciding how large your data bin sizes will be in
any given scenario.
Reaccredited by ▪ .
NAAC with A+
Presidency
Group
Identifiers used in Data Generalization in Data
Mining
Presidency College
(Autonomous)
Identifiers are data points about a subject that can determine their
identity and link to other personal information.
Reaccredited by
NAAC with A+
Presidency
Group
When is Data Generalization Important?
Presidency College
(Autonomous)
Data generalization:
Data generalization allows you to replace a data value
with a less precise one using a few different techniques,
which preserves data utility and protects against some
types of attacks that could lead to re-identification of
Reaccredited by
NAAC with A+
individuals or reveal private information
.
Data Generalization vs. Data Mining
Presidency College
(Autonomous)
. Data generalization is the process of summarizing data by replacing
relatively low-level numbers with higher-level concepts.
In contrast, data mining involves investigating and analyzing vast data
blocks to uncover relevant patterns and trends.
Reaccredited by
NAAC with A+
Presidency
Group
Approaches to Data Generalization
Presidency College
(Autonomous)
Presidency
Group
Statistical Measures in Large databases
Presidency College
(Autonomous)
There are several descriptive statistical measures to mine in large
databases in data mining i.e used for knowledge discovery in large
databases.
These measures are listed below:
❑ Measuring Central Tendency.
❑ Measuring the Dispersion of Data.
Reaccredited by ❑ Boxplot Analysis.
NAAC with A+
❑ Visualization of Boxplot Dispersion.
❑ Histogram Analysis.
❑ Quantile Plot.
Presidency
Group
❑ Quantile-Quantile Plot.
❑ Scatter Plot.
❑ Loess Curve.
Measuring Central Tendency:
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Measuring Central Tendency:
Presidency College
(Autonomous)
Median
Reaccredited by
NAAC with A+ The middle value of the sorted dataset is called the median. Consider a
dataset comprising ‘n’ elements.
▪ It is a holistic measure of data.
▪ Given in order, It is nothing but the middlemost value of the dispersed
data.
Presidency
Group ▪ If there are odd no values then the middle value will be the median.
▪ If there are even no values then the median is the average of two middle
values.
Measuring Central Tendency
Presidency College
(Autonomous)
Mode
It is nothing but the value that occurs most frequently in the data.
If there is only one mode in the data then it is a unimodal data.
If there are two modes in the data then it is bimodal data.
If there are three modes in the data then it is trimodal data.
The empirical formula of mode is, median-mode=3*(mean-median).
Reaccredited by Example 1. Consider the weight (in kg) of 5 children as 36, 40, 32, 42,
NAAC with A+ 30. Let’s compute mean, median, and mode:
Solution:
Mean = (36 + 40 + 32 + 42 + 30)/5 = 180/5 = 36kg
Median: Arrange the data in ascending order: 30, 32, 36, 40, 42 The middle
Presidency value is 36. So, median = 36kg.
Group
Example 2. Consider the ages of five employees as 30, 30, 32, 38, 60
years. Calculate the measures of central tendency.
Solution:
Mean = (30 + 30 + 32 + 38 + 60)/5 = 190/5 = 38 years
Median: Arrange the data in ascending order: 30, 30, 32, 38, 60. The
middlemost value is 32. So, median = 32 years
Mode: 30 years occurs most number of ties, so mode = 30 years
Reaccredited by
NAAC with A+
In this example, we saw that mean, median and mode have different values.
Presidency
Group
Measuring Central Tendency
Presidency College
(Autonomous)
Presidency
Group
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency No clear mode as all the data value occurs the same
Group
number of times.
Case 2: Skewed distribution
Presidency College
(Autonomous)
Presidency
Group
MEAN MEDIAN
Presidency College
(Autonomous)
. Assume that his salary changes from 38k per month to 85k per month. This
is a case of right skew as the data value has been shifted towards the right.
According to the figure, we expect that mean should be more than the
median.
Let us compute the new values of mean & median
New dataset has values 30, 40, 35, 32, 88
Mean = (30 + 40 + 35 + 32 + 88) = 225/5 = 45k rupees
Reaccredited by
NAAC with A+ Median:
Sort the data in ascending order.
30k, 32k, 35k, 40k,88k
Since the middlemost value in the sorted dataset is 35k, we can conclude that
Presidency median salary = 35k rupees.
Group
Thus, we saw that the mean value changed, but the median value is still 35k
rupees.
It is evident that the mean value is extremely sensitive to changes in data.
However, the median is relatively stable.
Measures of Dispersion
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Measures of Dispersion
Presidency College
(Autonomous)
Merits of Range
It is the simplest of the measure of dispersion
Easy to calculate
Easy to understand
Independent of change of origin
Reaccredited by
NAAC with A+
Demerits of Range
It is based on two extreme observations. Hence, get
Presidency affected by fluctuations
Group
2. Variance: Deduct the mean from each data in the set then squaring each of
them and adding each square and finally dividing them by the total no of
values in the data set is the variance. Variance (σ2)=Σ(X−μ)2/N
Reaccredited by
NAAC with A+ Merits of Standard Deviation
Squaring the deviations overcomes the drawback of ignoring signs in mean
deviations
Suitable for further mathematical treatment
Presidency Least affected by the fluctuation of the observations
Group
The standard deviation is zero if all the observations are constant
Independent of change of origin
Measures of Dispersion
Presidency College
(Autonomous)
Reaccredited by 4. Quartiles and Quartile Deviation: The quartiles divide a data set
NAAC with A+
into quarters. The first quartile, (Q1) is the middle number between
the smallest number and the median of the data. The second quartile,
(Q2) is the median of the data set. The third quartile, (Q3) is the
middle number between the median and the largest number.
Presidency
Group Quartile deviation or semi-inter-quartile deviation is
Q = ½ × (Q3 – Q1)
Measures of Dispersion
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Qualitative data
Presidency
Group
Graphing:
Presidency College
(Autonomous)
Representing Data: Graphics don’t just report data, they show trends and
patterns. The graphic used is determined by the types of data collected.
Pie charts, bar graphs, histograms, scatterplots
Graphing: Is an important way of visually representing data
Provides a significant amount of information
Moves from reporting data to showing trends and patterns
Relationships are more easily identified in a graphic representation as
Reaccredited by compared to a table.
NAAC with A+
Presidency
Group
GRAPHING
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Pie Chart: The area of the circle is proportional
to the frequency
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Pie charts are used in data handling and are circular charts divided up
into segments which each represent a value.
Pie charts are divided into sections (or 'slices') to represent values of
Presidency different sizes.
Group For example, in this pie chart, the circle represents a whole class.
Histogram
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Scatterplot:
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Correlation Between Variables:
Presidency College
(Autonomous)
Correlation is the relationship between two variables.
Correlation is positive when the values increase together
Correlation is negative when one value decreases as the other
increases.
Reaccredited by
NAAC with A+
Presidency
Group
Association rule mining:
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
How does Association Rule Learning work?
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+ Confidence
Confidence indicates how often the rule has been found to be true. Or how
often the items X and Y occur together in the dataset when the occurrence of
X is already given. It is the ratio of the transaction that contains X and Y to the
number of records that contain X.
Presidency
Group
LIFT
Presidency College
(Autonomous)
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and
Y are independent of each other. It has three possible values
Reaccredited by
NAAC with A+
It is the ratio of the observed support measure and expected support if X and
Y are independent of each other. It has three possible values:
If Lift= 1: The probability of occurrence of antecedent and consequent is
independent of each other.
Lift>1: It determines the degree to which the two itemsets are dependent to
Presidency
Group
each other.
Lift<1: It tells us that one item is a substitute for other items, which means
one item has a negative effect on another.
What is Apriori Algorithm?
Presidency College
(Autonomous)
Apriori algorithm refers to an algorithm that is used in mining frequent
products sets and relevant association rules.
Generally, the apriori algorithm operates on a database containing a huge
number of transactions. For example, the items customers but at a Big Bazar.
Apriori algorithm helps the customers to buy their products with ease and
increases the sales performance of the particular store.
Presidency
Group
Presidency College
(Autonomous)
Support
Support refers to the default popularity of any product. You find the
support as a quotient of the division of the number of transactions
comprising that product by the total number of transactions. Hence, we
get
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)
= 400/4000 = 10 percent.
Confidence
Reaccredited by Confidence refers to the possibility that the customers bought both
NAAC with A+
biscuits and chocolates together. So, you need to divide the number of
transactions that comprise both biscuits and chocolates by the total
number of transactions to get the confidence.
= 200/400
Presidency = 50 percent.
Group
It means that 50 percent of customers who bought biscuits bought
chocolates also.
Lift
Presidency College
(Autonomous)
Consider the above example; lift refers to the increase in the ratio of
the sale of chocolates when you sell biscuits. The mathematical
equations of lift are given below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
= 50/10 = 5
It means that the probability of people buying both biscuits and
Reaccredited by chocolates together is five times more than that of purchasing the
NAAC with A+
biscuits alone.
If the lift value is below one, it requires that the people are unlikely to
buy both the items together. Larger the value, the better is the
combination
Presidency
Group
How does the Apriori Algorithm work in Data
Mining?
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
The Apriori Algorithm makes the given
assumptions
Presidency College
(Autonomous)
Presidency
Group
step 2
Presidency College
(Autonomous)
Create pairs of products such as RP, RO, RM, PO, PM,
OM.
You will get the given frequency table
Reaccredited by RP 4
NAAC with A+
RO 3
RM 2
Presidency
PO 4
Group
PM 3
OM 2
Apriori algorithm
Presidency College
(Autonomous)
Step 3
Implementing the same threshold support of 50 percent and consider the
products that are more than 50 percent. In our case, it is more than 3
Thus, we get RP, RO, PO, and PM
Step 4
Now, look for a set of three products that the customers buy together. We get
the given combination.
Reaccredited by
NAAC with A+ RP and RO give RPO
PO and PM give POM
Presidency
Group
Apriori algorithm
Presidency College
(Autonomous)
RPO 4
POM 3
Reaccredited by
NAAC with A+
Step 5
Calculate the frequency of the two itemsets, and you will get the given frequency
table.
If you implement the threshold assumption, you can figure out that the customers'
Presidency set of three products is RPO.
Group We have considered an easy example to discuss the apriori algorithm in data
mining. In reality, you find thousands of such combinations.
APRIORI ALGORITHM
Presidency College
(Autonomous)
In the above example, you can see that the RPO combination was the
frequent itemset. Now, we find out all the rules using RPO.
RP-O, RO-P, PO-R, O-RP, P-RO, R-PO
You can see that there are six different combinations. Therefore, if you have n
elements, there will be 2n - 2 candidate association rules.
Reaccredited by
NAAC with A+
Presidency
Group
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group
Presidency College
(Autonomous)
Reaccredited by
NAAC with A+
Presidency
Group