You are on page 1of 86

DATA MINING

PRESIDENCY
Presidency COLLEGE
College
(Autonomous)
(Autonomous)

5 SEMESTER BCA

By
Dr. J. Vijay Fidelis
Reaccredited by Associate Professor, Dept of Computer Applications
NAAC with A+ Presidency college, Bangalore-24

Presidency
Group
UNIT – IV CONCEPT DESCRIPTION
Presidency College
(Autonomous)

Data mining deals with the kind of patterns that can be mined.
From Data analysis point of view, we can classify the data mining into the
following two categories:-
1. Predictive data mining
2. Descriptive data mining

Reaccredited by 1. Predictive data mining:


NAAC with A+
The term 'Predictive' means to predict something, so predictive
data mining is the analysis done to predict the future event or other data or
trends.
The main goal of predictive mining is to predict future results
Presidency
Group
rather than current behavior.
EG: the previous data may enable the shopkeeper to project what will
happen in future in the business, enabling business people to plan
accordingly..
Advantages of Predictive mining in business
Presidency College
(Autonomous)

Following are the given most important business


benefits of Predictive mining.

❑ It increases the company production.


❑ It reduces risks in business.
Reaccredited by
NAAC with A+ ❑ It helps business analysts to make better decisions
in a business organization.
❑ It helps to maintain a competitive environment.
Presidency
Group
2.Descriptive Data mining
Presidency College
(Autonomous)

As the name suggests, descriptive mining "describe" the


data. Once the data is captured, we convert it into human
interpretable form.
Descriptive analytics is useful because it enables us to learn from
the past.

Reaccredited by Descriptive mining is usually used to provide correlation, cross-


NAAC with A+
tabulation, frequency, etc.
These techniques are used to determine the data regularities and to
reveal patterns.
Presidency
Group
It targets the summarization and conversion of data into meaningful
data for reporting and monitoring.
What Can Descriptive Analytics Tell Us?
Presidency College
(Autonomous)
Descriptive Data Mining provides descriptive
information about the data.
Descriptive Data Mining can help us in various fields;
some of the examples are:
❖Compare the pre and post-assessment of the test.
Reaccredited by
NAAC with A+
❖Collecting the survey and analyzing the people’s
opinions.
❖Understanding the total stock in inventory.
Presidency
❖Finding the average dollars spent per customer and
Group year-over-year change in sales.
Descriptive data mining Predictive data mining
Descriptive mining is usually used to provide The term 'Predictive' means to predict
Presidency College correlation, cross-tabulation, frequency, etc. something, so predictive data mining is the
(Autonomous)
analysis done to predict the future event or
other data or trends.

It is based on the reactive approach. It is based on the proactive approach.

It specifies the characteristics of the data in a It executes the induction over the current
target data set. and past data so that prediction can happen.
Reaccredited by
NAAC with A+

It needs data aggregation and data mining. It needs statistics and data forecasting
procedures.

Presidency
Group It provides precise data. It produces outcomes without ensuring
accuracy.
list of descriptive functions
Presidency College
(Autonomous)

❑ Class/Concept Description
❑ Mining of Frequent Patterns
Reaccredited by
NAAC with A+
❑ Mining of Associations
❑ Mining of Correlations
Presidency
❑ Mining of Clusters
Group
a) Class/Concept Description:
Presidency College
(Autonomous)

Class/Concept refers to the data to be associated with the


classes or concepts.

For example, in a company, the classes of items for sales include


computer and printers, and concepts of customers include big
spenders and budget spenders.
Reaccredited by Such descriptions of a class or a concept are called class/concept
NAAC with A+ descriptions.

These descriptions can be derived by the following two ways −


❖ Data Characterization − This refers to summarizing data of
Presidency
Group
class under study. This class under study is called as Target Class.
❖ Data Discrimination − It refers to the mapping or classification
of a class with some predefined group or class.
b) Mining of Frequent Patterns:
Presidency College
(Autonomous)
Frequent patterns are those patterns that occur frequently in
transactional data.
Eg: Finding a set of students who frequently show poor
performance in semester examinations
Here is the list of kind of frequent patterns −
❖ Frequent Item Set − It refers to a set of items that frequently
appear together, for example, milk and bread.
Reaccredited by
NAAC with A+
❖ Frequent Subsequence − A sequence of patterns that occur
frequently such as purchasing a camera is followed by memory
card.
Presidency
Group ❖ Frequent Sub Structure − Substructure refers to different
structural forms, such as graphs, trees, or lattices, which may be
combined with item-sets or subsequences.
MINING OF ASSOCIATION AND MINING OF
CORELATIONS
Presidency College
(Autonomous)

c) Mining of Association: Associations are used in retail sales to


identify patterns that are frequently purchased together. This process
refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows
that 70% of time milk is sold with bread and only 30% of times
Reaccredited by biscuits are sold with bread.
NAAC with A+

d) Mining of Correlations: It is a kind of additional analysis


performed to uncover interesting statistical correlations between
Presidency
associated-attribute-value pairs or between two item sets to analyse
Group that if they have positive, negative or no effect on each other.
Eg: measuring everyone's height and weight, you could then
compare heights and weights and see if they have any
relationship to each other
MINING OF CLUSTERS
Presidency College
(Autonomous)
e) Mining of Clusters: Cluster refers to a group of similar kind of
objects.
Cluster analysis refers to forming group of objects that are very
similar to each other but are highly different from- the objects in
other clusters.

Reaccredited by
Note: The descriptive and predictive data mining techniques have
NAAC with A+ huge applications in data mining; they are used to mine the types of
patterns.
The descriptive analysis is used to mine data and specify the current
data on past events.
Presidency
Group
In contrast, the predictive analysis gives the answers to all queries
related to recent or previous data that move across using historical
data as the main principle for decision.
DATA GENERALISATION
Presidency College
(Autonomous)

Data generalization is the process of broadening the


classification of data in a database.
This helps a user expand out from the data to provide a
broader picture of trends or insights.
Data generalization summarizes data by replacing
Reaccredited by
NAAC with A+ relatively low-level values (including numeric value for
attribute age) with high-level concepts (including young,
middle-aged, and senior).
Presidency
Therefore, it is a process that abstracts a huge set of task-
Group relevant information in a database from a relatively low
conceptual level to higher conceptual levels.
EXAMPLE OF DATA GENERALISATION
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
DATA GENERALIZATION
Presidency College

Data generalization in data mining substitutes a


(Autonomous)

precise value with a less accurate value, which may appear


counter intuitive. Still, it is a widely practical and used
technique in data mining, analysis, and secure storage.
Two Forms of Data Generalization in Data Mining
There are two main forms of data generalization in data
Reaccredited by mining:
NAAC with A+
Automated and Declarative
Automated Generalization :
Automated generalization uses algorithms to
Presidency
Group determine the minimum amount of generalization or
distortion required to ensure proper privacy while
retaining accuracy
DECLARATIVE GENERALIZATION
Presidency College
(Autonomous)

Declarative Generalization,
Declarative generalization involves manually
deciding how large your data bin sizes will be in
any given scenario.
Reaccredited by ▪ .
NAAC with A+

Presidency
Group
Identifiers used in Data Generalization in Data
Mining
Presidency College
(Autonomous)

Identifiers are data points about a subject that can determine their
identity and link to other personal information.

There are two main types of identifiers:

Direct identifiers are information that explicitly identifies a


Reaccredited by
NAAC with A+
person, such as a name, a social security number, or biometric data.

Indirect, or quasi, identifiers Quasi-identifiers, or indirect


identifiers, are personal attributes that are true about, but not
Presidency necessarily unique, to an individual. Examples are one's age or date
Group
of birth, race, salary, educational attainment, occupation,
marital status and zip code..
Direct and quasi identifier
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
When is Data Generalization Important?
Presidency College
(Autonomous)

Data generalization:
Data generalization allows you to replace a data value
with a less precise one using a few different techniques,
which preserves data utility and protects against some
types of attacks that could lead to re-identification of
Reaccredited by
NAAC with A+
individuals or reveal private information

Data generalization is also the process of creating a


Presidency
more broad categorization of data in a database,
Group essentially 'zooming out' from the data to create a more
general picture of trends or insights it provides

.
Data Generalization vs. Data Mining
Presidency College
(Autonomous)
. Data generalization is the process of summarizing data by replacing
relatively low-level numbers with higher-level concepts.
In contrast, data mining involves investigating and analyzing vast data
blocks to uncover relevant patterns and trends.

Reaccredited by
NAAC with A+

Presidency
Group
Approaches to Data Generalization
Presidency College
(Autonomous)

There are two basic approaches to Data Generalization in Data


Mining:
Data Cube Approach
❑ a data cube makes data easier to understand.
❑ It is very helpful when displaying data with dimensions as
specific gauges of business needs.
Reaccredited by
NAAC with A+
❑ Every cube dimension reflects a different aspect of the
database, such as daily, monthly, or yearly sales.
❑ A data cube’s data allows for analysing nearly all figures for
virtually any or all customers, sales agents, products, among
Presidency other things.
Group
❑ As a result, a data cube can assist in identifying trends and
analysing performance.
Data Cube Approach
Presidency College
(Autonomous)

❑ It is also known as the OLAP approach or Online


Analytical Processing.
❑ The Data cube gets used to holding the computation
and results in this method.
❑ On a data cube, roll-up and drill-down procedures get
Reaccredited by
NAAC with A+ employed.
❑ Aggregate functions like count(), sum(), average(),
and max() are commonly used in these procedures.
Presidency
❑ These materialized, you can then use perspectives for
Group decision-making, information discovery, and
various other uses.
Attribute Oriented Induction
Presidency College
(Autonomous)
Attribute Oriented Induction is a database mining technique
that compresses the original data collection into a
generalized relation, resulting in concise and
comprehensive information about the huge datasets.

❑ Attribute generalization in data mining is a query-


oriented, generalization-based technique to online data
analysis.
Reaccredited by
NAAC with A+ ❑ Generalizations get made using this method based on
varying values of each attribute within the relevant data
set..
❑ Before an OLAP or data mining query gets submitted for
processing, it performs offline aggregation.
Presidency
Group ❑ It does not get restricted to specific metrics or categorical
data.
Examples of Data Generalization
Presidency College
(Autonomous)

Market Basket Analysis is one of the most well-known examples of


data generalization in data mining.
Market Basket Analysis is a method for analyzing the purchases
made by a customer in a supermarket.
The idea is to use the concept to identify the things that a customer
buys together. What are the chances that if a person buys bread, they
Reaccredited by will also buy butter? This analysis aids in the promotion of company
NAAC with A+
offers and discounts. Data mining is used to do the same thing.
Business reporting for sales or marketing,
management reporting,
Presidency business process management (BPM),
Group
budgeting and forecasting,
financial reporting, and similar sectors commonly use Market
Basket Analysis.
Analytical characterization
Presidency College
(Autonomous)

Analytical characterization is used to help and identifying


the weakly relevant, or irrelevant attributes. We can exclude these
unwanted irrelevant attributes when we prepare the data for mining

Analytical Characterization is a very important activity in data


mining due to the following reasons :
Reaccredited by
NAAC with A+ ❑ Due to the limitation of the OLAP tool about handling the
complex objects.
❑ Due to the lack of an automated generalization, we must
explicitly tell the system which attributes are irrelevant and must
Presidency be removed, and similarly, we must explicitly tell the system which
Group
attributes are relevant and must be included in the class
characterization.
Analytical characterization
Presidency College
(Autonomous)

Example: We want to characterize the class or in other words, we can


say that suppose we want to compare the classes. Now the confusing
question is that What if we are not sure which attribute we should
include for the class characterization or class comparison? If we specify
too many attributes, then these attributes can be a solid reason to slow
down the overall process of data mining. We can solve this problem
Reaccredited by with the help of analytical characterization.
NAAC with A+
It is the measure of attribute relevance analysis that can be used
to help identify irrelevant or weakly relevant attributes that can
be excluded from the concept description process.
The incorporation of this processing step into class
Presidency
Group characterization or comparison is referred to as analytical
characterization or analytical comparison.
Relevance Analysis
Presidency College
(Autonomous)

Relevance analysis is a data-driven approach to determining what


it is that your potential customers want from your business on the web,
how content and the web site should be structured, and what terms and
phrases should be used to appeal to your target market (this is essentially
extremely detailed digital marketing).

Analysis of Attribute Relevance:


Reaccredited by
NAAC with A+
1. Data Collection:
It is collecting the data for both the target class and the contrasting class by
query processing.
2. Preliminary relevance analysis using conservative AOI:
Presidency
This step identifies a set of dimensions and attributes on which the
Group selected relevance measure is to be applied.
The relation obtained by such an application of Attribute
Oriented Induction is called the candidate relation of the mining task.
Relevance Analysis
Presidency College
(Autonomous)
3. Remove irrelevant and weakly relevant attributes using the
selected relevance analysis:
❑ We evaluate each attribute in the candidate relation using the
selected relevance analysis measure.
❑ This step results in an initial target class working relation and
initial contrasting class working relation.
Reaccredited by
NAAC with A+
4.Generate the concept description using AOI:
We need to perform the Attribute Oriented Induction process using a
less conservative set of attribute generalization thresholds.
Presidency
Group
Relevance Measures
Presidency College
(Autonomous)

Quantitative relevance measure


determines the classifying power of an attribute
within a set of data.
Some of the methods of quantitative
Reaccredited by relevance measure are:
NAAC with A+

❑ Information Gain (ID3)


❑ Gain Ratio (C4.5)
Presidency
Group
❑ Gini Index
❑contingency table statistics
❑ Uncertainty Coefficient
Mining Class Comparisons
Presidency College
(Autonomous)

(Discriminating between different classes)


Class discrimination or comparison (here after
referred to as class comparison) mines
descriptions that distinguish a target class from
Reaccredited by
its contrasting classes.
NAAC with A+
Notice that the target and contrasting classes
must be comparable in the sense that they share
similar dimensions and attributes
Presidency
Group
Class Comparisons Methods and
Implementations
Presidency College
(Autonomous)

1. Data Collection: The set of relevant data in the database is


collected by query processing and is partitioned into target class
and contrasting class.

2. Dimension relevance analysis: If there are many dimensions and


analytical comparison is desired, then dimension relevance analysis
Reaccredited by should be performed on these classes and only the highly relevant
NAAC with A+
dimensions are included in the further analysis.

3.Synchronous Generalization: Generalization is performed on the


Presidency
target class to the level controlled by user-or expert –specified
Group dimension threshold, which results in a prime target class
relation/cuboid. The concepts in the contrasting class(es) are
generalized to the same level as those in the prime target class
relation/cuboid, forming the prime contrasting class relation/cuboid.
4.Presentation of the derived comparison:
Presidency College
(Autonomous)

The resulting class comparison description can be


visualized in the form of tables, graphs and rules.
This presentation usually includes a “contrasting”
measure (such as count%) that reflects the comparison
Reaccredited by
between the target and contrasting classes.
NAAC with A+

Presidency
Group
Statistical Measures in Large databases
Presidency College
(Autonomous)
There are several descriptive statistical measures to mine in large
databases in data mining i.e used for knowledge discovery in large
databases.
These measures are listed below:
❑ Measuring Central Tendency.
❑ Measuring the Dispersion of Data.
Reaccredited by ❑ Boxplot Analysis.
NAAC with A+
❑ Visualization of Boxplot Dispersion.
❑ Histogram Analysis.
❑ Quantile Plot.
Presidency
Group
❑ Quantile-Quantile Plot.
❑ Scatter Plot.
❑ Loess Curve.
Measuring Central Tendency:
Presidency College
(Autonomous)

Central tendency is a descriptive summary of a dataset through a single


value that reflects the centre of the data distribution., central tendency is a
branch of descriptive statistics.
The central tendency is one of the most quintessential concepts in statistics.

Reaccredited by
NAAC with A+

Presidency
Group
Measuring Central Tendency:
Presidency College
(Autonomous)

Statistics is an important branch of mathematics that is


widely used in a variety of traditional disciplines like economics,
commerce, research, surveys, etc. In this present digital age,
emerging technologies like data science and machine learning have
boomed up.
These technologies are also centered around statistics. After all,
Reaccredited by
statistics is all about the collection, interpretation, and presentation
NAAC with A+ of data. Basically, statistics provide insights into the data.
Mean
Median
Presidency
Mode
Group
Measuring Central Tendency
Presidency College
(Autonomous)
“Average” value is termed as the mean of the dataset. It is very easy to
calculate the mean.
Steps to calculate Mean:
Step 1. Count the number of data values. Let it be n.
Step 2. Add all the data values. Let the sum be s.
Step 3. Mean = Sum of all data values (s)/Total number of data values(n)

Median
Reaccredited by
NAAC with A+ The middle value of the sorted dataset is called the median. Consider a
dataset comprising ‘n’ elements.
▪ It is a holistic measure of data.
▪ Given in order, It is nothing but the middlemost value of the dispersed
data.
Presidency
Group ▪ If there are odd no values then the middle value will be the median.
▪ If there are even no values then the median is the average of two middle
values.
Measuring Central Tendency
Presidency College
(Autonomous)
Mode
It is nothing but the value that occurs most frequently in the data.
If there is only one mode in the data then it is a unimodal data.
If there are two modes in the data then it is bimodal data.
If there are three modes in the data then it is trimodal data.
The empirical formula of mode is, median-mode=3*(mean-median).

Reaccredited by Example 1. Consider the weight (in kg) of 5 children as 36, 40, 32, 42,
NAAC with A+ 30. Let’s compute mean, median, and mode:
Solution:
Mean = (36 + 40 + 32 + 42 + 30)/5 = 180/5 = 36kg
Median: Arrange the data in ascending order: 30, 32, 36, 40, 42 The middle
Presidency value is 36. So, median = 36kg.
Group

Mode: 36 kg occurs most number of times, so mode = 36 kg


In this example, we saw that mean, median and mode are same.
Measuring Central Tendency
Presidency College
(Autonomous)

Example 2. Consider the ages of five employees as 30, 30, 32, 38, 60
years. Calculate the measures of central tendency.
Solution:
Mean = (30 + 30 + 32 + 38 + 60)/5 = 190/5 = 38 years
Median: Arrange the data in ascending order: 30, 30, 32, 38, 60. The
middlemost value is 32. So, median = 32 years
Mode: 30 years occurs most number of ties, so mode = 30 years
Reaccredited by
NAAC with A+

In this example, we saw that mean, median and mode have different values.

Presidency
Group
Measuring Central Tendency
Presidency College
(Autonomous)

Example 3. Five students A, B, C, D, E appeared in a test and


scored 80, 95, 90, 85, and 100 marks respectively. Find the
mean?
Solution:
Total number of students = 5
Sum of marks = 80 + 95 + 90 + 85 +100 = 450
Reaccredited by
NAAC with A+
Mean = Sum of marks/total number of students = 450/5 = 90 marks

Presidency
Group
Presidency College
(Autonomous)

Distributions and Mean: Mean is highly impacted by the extreme values


in the dataset. If the dataset is symmetric, the mean value is located exactly
at the center. However, in skewed distributions, the mean value is pulled
away from the center.
Case 1: Symmetric distribution
Consider a symmetric distribution. Assume the monthly salary of employees
in an organization as 30k, 40k, 35k, 32k, 38k rupees.
Reaccredited by
NAAC with A+ Mean = (30 + 40 + 35 + 32 + 38)/5 = 175/5 = 35k rupees
Median: Sort the data in ascending order. 30k, 32k, 35k, 38k, 40k.
Since the middlemost value in the sorted dataset is 35k. We can conclude
that median salary = 35k rupees.
Presidency No clear mode as all the data value occurs the same
Group
number of times.
Symmetric distribution
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Mean = Median = mode in symmetric distribution

Presidency No clear mode as all the data value occurs the same
Group
number of times.
Case 2: Skewed distribution
Presidency College
(Autonomous)

In skewed distribution where one value is exceptionally different from other


values, the mean value changes drastically.

Mean > median in right skew


Reaccredited by
NAAC with A+ distribution

Presidency
Group
MEAN MEDIAN
Presidency College
(Autonomous)

. Assume that his salary changes from 38k per month to 85k per month. This
is a case of right skew as the data value has been shifted towards the right.
According to the figure, we expect that mean should be more than the
median.
Let us compute the new values of mean & median
New dataset has values 30, 40, 35, 32, 88
Mean = (30 + 40 + 35 + 32 + 88) = 225/5 = 45k rupees
Reaccredited by
NAAC with A+ Median:
Sort the data in ascending order.
30k, 32k, 35k, 40k,88k
Since the middlemost value in the sorted dataset is 35k, we can conclude that
Presidency median salary = 35k rupees.
Group
Thus, we saw that the mean value changed, but the median value is still 35k
rupees.
It is evident that the mean value is extremely sensitive to changes in data.
However, the median is relatively stable.
Measures of Dispersion
Presidency College
(Autonomous)

Dispersion is the state of getting dispersed or spread. Statistical dispersion


means the extent to which a numerical data is likely to vary about an average
value.
In other words, dispersion helps to understand the distribution of the data.

Reaccredited by
NAAC with A+

Presidency
Group
Measures of Dispersion
Presidency College
(Autonomous)

▪ The measure of dispersion shows the homogeneity or


the heterogeneity of the distribution of the
observations.
▪ In statistics, the measures of dispersion help to
interpret the variability of data i.e. to know how much
homogenous or heterogeneous the data is.
Reaccredited by
NAAC with A+ ▪ In simple terms, it shows how squeezed or scattered
the variable is
▪ the measure of dispersion shows the scatterings of the
data.
Presidency
Group ▪ It tells the variation of the data from one another and
gives a clear idea about the distribution of the data
Types of Measures of Dispersion
Presidency College
(Autonomous)
There are two main types of dispersion methods in statistics which are:
Absolute Measure of Dispersion
Relative Measure of Dispersion

1. Absolute Measure of Dispersion: The measures which express the


scattering of observation in terms of distances i.e., range, quartile
deviation. The measure which expresses the variations in terms of the
Reaccredited by average of deviations of observations like mean deviation and standard
NAAC with A+ deviation.
The types of absolute measures of dispersion are:
1. Range: It is simply the difference between the maximum value and the
minimum value given in a data set. Example: 1, 3,5, 6, 7 =>
Presidency
Group
Range = 7 -1= 6
Range = X max – X min
RANGE
Presidency College
(Autonomous)

Merits of Range
It is the simplest of the measure of dispersion
Easy to calculate
Easy to understand
Independent of change of origin
Reaccredited by
NAAC with A+

Demerits of Range
It is based on two extreme observations. Hence, get
Presidency affected by fluctuations
Group

A range is not a reliable measure of dispersion


Dependent on change of scale
Measures of Dispersion
Presidency College
(Autonomous)

2. Variance: Deduct the mean from each data in the set then squaring each of
them and adding each square and finally dividing them by the total no of
values in the data set is the variance. Variance (σ2)=Σ(X−μ)2/N

3. Standard Deviation: The square root of the variance is known as the


standard deviation i.e. S.D. = √σ.

Reaccredited by
NAAC with A+ Merits of Standard Deviation
Squaring the deviations overcomes the drawback of ignoring signs in mean
deviations
Suitable for further mathematical treatment
Presidency Least affected by the fluctuation of the observations
Group
The standard deviation is zero if all the observations are constant
Independent of change of origin
Measures of Dispersion
Presidency College
(Autonomous)

Demerits of Standard Deviation


Not easy to calculate
Difficult to understand for a layman
Dependent on the change of scale

Reaccredited by 4. Quartiles and Quartile Deviation: The quartiles divide a data set
NAAC with A+
into quarters. The first quartile, (Q1) is the middle number between
the smallest number and the median of the data. The second quartile,
(Q2) is the median of the data set. The third quartile, (Q3) is the
middle number between the median and the largest number.
Presidency
Group Quartile deviation or semi-inter-quartile deviation is
Q = ½ × (Q3 – Q1)
Measures of Dispersion
Presidency College
(Autonomous)

Merits of Quartile Deviation


All the drawbacks of Range are overcome by quartile deviation
It uses half of the data
Independent of change of origin
The best measure of dispersion for open-end classification
Demerits of Quartile Deviation
It ignores 50% of the data
Reaccredited by
NAAC with A+ Dependent on change of scale
Not a reliable measure of dispersion

Characteristics of Measures of Dispersion


Presidency
A measure of dispersion should be rigidly defined
Group It must be easy to calculate and understand
Not affected much by the fluctuations of observations
Based on all observations
2. A relative measure of dispersion:
Presidency College
(Autonomous)

We use a relative measure of dispersion for comparing


distributions of two or more data set and for unit free
comparison.
They are the
❑ The coefficient of range,
Reaccredited by
NAAC with A+ ❑ the coefficient of mean deviation,
❑ the coefficient of quartile deviation,
❑ the coefficient of variation, and
Presidency
Group
❑ the coefficient of standard deviation
Graph Displays of Basic Statistical Class Description:
Presidency College
(Autonomous)
Introduction
Data are measurements or observations that are collected as a source
of information. There are a variety of different types of data, and different
ways to represent data.

A data unit is one entity (such as a person or business) in the


population being studied, about which data are collected. A data unit is
also referred to as a unit record or record.
Reaccredited by
NAAC with A+ A data item is a characteristic (or attribute) of a data unit which is
measured or counted, such as height, country of birth, or income. A data
item is also referred to as a variable because the characteristic may vary
between data units, and may vary over time.

Presidency An observation is an occurrence of a specific data item that is recorded


Group about a data unit.

A dataset is a complete collection of all observations.


QUANTITAIVE & QUALITATIVE MEASURES
Presidency College
(Autonomous)

Quantitative data are measures of values or counts and


are expressed as numbers.
Qualitative data are measures of 'types' and may be
represented by a name, symbol, or a number code.
Frequency counts:
Reaccredited by
NAAC with A+ The number of times an observation occurs
(frequency) for a data item (variable) can be shown
for both quantitative and qualitative data.
Presidency
Group
The graphs below arrange the quantitative
and qualitative data to show
Presidency College
(Autonomous)
Quantitative data

Reaccredited by
NAAC with A+

Qualitative data

Presidency
Group
Graphing:
Presidency College
(Autonomous)
Representing Data: Graphics don’t just report data, they show trends and
patterns. The graphic used is determined by the types of data collected.
Pie charts, bar graphs, histograms, scatterplots
Graphing: Is an important way of visually representing data
Provides a significant amount of information
Moves from reporting data to showing trends and patterns
Relationships are more easily identified in a graphic representation as
Reaccredited by compared to a table.
NAAC with A+

Presidency
Group
GRAPHING
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
Pie Chart: The area of the circle is proportional
to the frequency
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Pie charts are used in data handling and are circular charts divided up
into segments which each represent a value.
Pie charts are divided into sections (or 'slices') to represent values of
Presidency different sizes.
Group For example, in this pie chart, the circle represents a whole class.
Histogram
Presidency College
(Autonomous)

A histogram is a graphical representation of data points organized into


user-specified ranges. Similar in appearance to a bar graph, the histogram
condenses a data series into an easily interpreted visual by taking many data
points and grouping them into logical ranges or bins.

Reaccredited by
NAAC with A+

Presidency
Group
Scatterplot:
Presidency College
(Autonomous)

Show association between two numerical variables


Data plotted as Cartesian (X,Y) coordinates
Suggests relationships between variables

Car price according to age of the car

Reaccredited by
NAAC with A+

Presidency
Group
Correlation Between Variables:
Presidency College
(Autonomous)
Correlation is the relationship between two variables.
Correlation is positive when the values increase together
Correlation is negative when one value decreases as the other
increases.

Reaccredited by
NAAC with A+

Presidency
Group
Association rule mining:
Presidency College
(Autonomous)

Finding frequent patterns, associations, correlations, or


causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
Applications:
Reaccredited by
NAAC with A+
Market Basket Data Analysis
Cross‐Marketing
Catalog Design
Presidency Loss‐ Leader Analysis
Group
Clustering,
Classification
For example, if a customer buys bread, he most likely can also buy
butter, eggs, or milk, so these products are stored within a shelf or
mostly nearby. Consider the below diagram:
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
How does Association Rule Learning work?
Presidency College
(Autonomous)

Association rule learning works on the concept of If and Else


Statement, such as if A then B.
Here the If element is called antecedent, and then statement is called
as Consequent. These types of relationships where we can find out
some association or relation between two items is known as single
cardinality.
Reaccredited by
NAAC with A+ It is all about creating rules, and if the number of items increases,
then cardinality also increases accordingly. So, to measure the
associations between thousands of data items, there are several
metrics.
Presidency These metrics are given below:
Group
Support
Confidence
Lift
Let's understand each of them:
Presidency College
(Autonomous)
Support
Support is the frequency of A or how frequently an item appears in the
dataset. It is defined as the fraction of the transaction T that contains the
itemset X. If there are X datasets, then for transactions T, it can be written as:

Reaccredited by
NAAC with A+ Confidence
Confidence indicates how often the rule has been found to be true. Or how
often the items X and Y occur together in the dataset when the occurrence of
X is already given. It is the ratio of the transaction that contains X and Y to the
number of records that contain X.
Presidency
Group
LIFT
Presidency College
(Autonomous)
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and
Y are independent of each other. It has three possible values

Reaccredited by
NAAC with A+
It is the ratio of the observed support measure and expected support if X and
Y are independent of each other. It has three possible values:
If Lift= 1: The probability of occurrence of antecedent and consequent is
independent of each other.
Lift>1: It determines the degree to which the two itemsets are dependent to
Presidency
Group
each other.
Lift<1: It tells us that one item is a substitute for other items, which means
one item has a negative effect on another.
What is Apriori Algorithm?
Presidency College
(Autonomous)
Apriori algorithm refers to an algorithm that is used in mining frequent
products sets and relevant association rules.
Generally, the apriori algorithm operates on a database containing a huge
number of transactions. For example, the items customers but at a Big Bazar.
Apriori algorithm helps the customers to buy their products with ease and
increases the sales performance of the particular store.

Reaccredited by Components of Apriori algorithm


NAAC with A+
The given three components comprise the apriori algorithm.
Support
Confidence
Lift
Presidency
Group
Let's take an example to understand this
concept.
Presidency College
(Autonomous)
We have already discussed above; you need a huge database containing a
large no of transactions.
Suppose you have 4000 customers transactions in a Big Bazar. You have
to calculate the Support, Confidence, and Lift for two products, and you may
say Biscuits and Chocolate.
This is because customers frequently buy these two items together.
Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain
Reaccredited by Chocolate, and these 600 transactions include a 200 that includes Biscuits
NAAC with A+ and chocolates. Using this data, we will find out the support, confidence,
and lift.

Presidency
Group
Presidency College
(Autonomous)
Support
Support refers to the default popularity of any product. You find the
support as a quotient of the division of the number of transactions
comprising that product by the total number of transactions. Hence, we
get
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)
= 400/4000 = 10 percent.
Confidence
Reaccredited by Confidence refers to the possibility that the customers bought both
NAAC with A+
biscuits and chocolates together. So, you need to divide the number of
transactions that comprise both biscuits and chocolates by the total
number of transactions to get the confidence.
= 200/400
Presidency = 50 percent.
Group
It means that 50 percent of customers who bought biscuits bought
chocolates also.
Lift
Presidency College
(Autonomous)
Consider the above example; lift refers to the increase in the ratio of
the sale of chocolates when you sell biscuits. The mathematical
equations of lift are given below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
= 50/10 = 5
It means that the probability of people buying both biscuits and
Reaccredited by chocolates together is five times more than that of purchasing the
NAAC with A+
biscuits alone.
If the lift value is below one, it requires that the people are unlikely to
buy both the items together. Larger the value, the better is the
combination
Presidency
Group
How does the Apriori Algorithm work in Data
Mining?
Presidency College
(Autonomous)

We will understand this algorithm with the help of an example


Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil,
Milk, Apple}. The database comprises six transactions where 1 represents
the presence of the product and 0 represents the absence of the product.

Reaccredited by
NAAC with A+

Presidency
Group
The Apriori Algorithm makes the given
assumptions
Presidency College
(Autonomous)

All subsets of a frequent itemset must be frequent.


The subsets of an infrequent item set must be infrequent.
Fix a threshold support level. In our case, we have fixed it at 50 percent.
Step 1
Make a frequency table of all the products that appear in all the transactions.
Now, short the frequency table to add only those products with a threshold
support level of over 50 percent. We find the given frequency table.
Reaccredited by
NAAC with A+ Thebelow table indicates the products frequently bought by the customers.

Presidency
Group
step 2
Presidency College
(Autonomous)
Create pairs of products such as RP, RO, RM, PO, PM,
OM.
You will get the given frequency table

Itemset Frequency (Number of


transactions)

Reaccredited by RP 4
NAAC with A+
RO 3

RM 2

Presidency
PO 4
Group
PM 3

OM 2
Apriori algorithm
Presidency College
(Autonomous)

Step 3
Implementing the same threshold support of 50 percent and consider the
products that are more than 50 percent. In our case, it is more than 3
Thus, we get RP, RO, PO, and PM
Step 4
Now, look for a set of three products that the customers buy together. We get
the given combination.
Reaccredited by
NAAC with A+ RP and RO give RPO
PO and PM give POM

Presidency
Group
Apriori algorithm
Presidency College
(Autonomous)

Itemset Frequency (Number of transactions)

RPO 4
POM 3

Reaccredited by
NAAC with A+
Step 5
Calculate the frequency of the two itemsets, and you will get the given frequency
table.
If you implement the threshold assumption, you can figure out that the customers'
Presidency set of three products is RPO.
Group We have considered an easy example to discuss the apriori algorithm in data
mining. In reality, you find thousands of such combinations.
APRIORI ALGORITHM
Presidency College
(Autonomous)
In the above example, you can see that the RPO combination was the
frequent itemset. Now, we find out all the rules using RPO.
RP-O, RO-P, PO-R, O-RP, P-RO, R-PO
You can see that there are six different combinations. Therefore, if you have n
elements, there will be 2n - 2 candidate association rules.

Advantages of Apriori Algorithm


It is used to calculate large itemsets.
Reaccredited by
NAAC with A+ Simple to understand and apply.

Disadvantages of Apriori Algorithms


Apriori algorithm is an expensive method to find support since the
calculation has to pass through the whole database.
Presidency
Group Sometimes, you need a huge number of candidate rules, so it becomes
computationally more expensive.
APRIORI
Presidency College
(Autonomous)

Apriori is an algorithm for frequent itemset mining and association


rule learning over relational databases.
It proceeds by identifying the frequent individual items in the
database and extending them to larger and larger item sets as long as
those item sets appear sufficiently often in the database.
Reaccredited by
NAAC with A+ • Join Step : Ck is generated by joining Lk‐1with itself

• Prune Step :Any (k‐1)‐itemset that is not frequent cannot be a


subset of a frequent k‐itemset
Presidency
Group
Pseudo‐code:
Presidency College
(Autonomous)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk != ∮; k++) do
begin
Ck+1 = candidates generated from Lk;
Reaccredited by
NAAC with A+
For each transaction t in database do
increment the count of all candidates in Ck+1 contained in t

Lk+1 = candidates in Ck+1 with min_support


Presidency
Group
end
return ∪kLk
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group

You might also like