APEX INSTITUTE OF TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Data Mining and Warehousing (22CSH-380)
Faculty: Dr. Preeti Khera (E16576)
Lecture – 2.2.1, 2.2.2 & 2.2.3
Measuring Central Tendency, Measuring DISCOVER . LEARN . EMPOWER
Dispersion of Data, Graph Displays of Basic
Statistical class Description
June 4, 2025 1
Data Mining and Warehousing : Course Objectives
COURSE OBJECTIVES
The Course aims to:
1. Develop understanding key concepts of data mining and obtain knowledge about
how to extract useful characteristics from data using data pre-processing techniques.
2. Demonstrate methods to apply and analyze relevant attributes, perform statistical
measure to look for meaningful variation in data, and mine association rules for
transactional datasets.
3. Teach use and application of data mining techniques such as classification, decision
tree, neural networks, back propagation and many more, in various applications.
June 4, 2025 2
COURSE OUTCOMES
On completion of this course, the students shall be able to:-
Understand the concept of Data mining and usage of various tools
CO1
for data warehousing and data mining.
Demonstrate the strengths and weaknesses of different methods of
CO2
meaningful data mining.
Apply association rule, classification, and clustering algorithms for
CO3
large data sets.
Evaluate and employ correct data mining techniques depending on
CO4
characteristics of the dataset.
Verify and formulate the performance of various data mining
CO5
techniques according to the dataset.
June 4, 2025 3
Unit-2 Syllabus
Unit-2
Concept Description: Definition, Data Generalization, Analytical Characterization,
Analysis of attribute relevance, Mining Class comparisons, Statistical measures in large
Databases. Measuring Central Tendency, Measuring Dispersion of Data, Graph Displays
of Basic Statistical class Description, Mining Association Rules in Large Databases,
Association rule mining, mining Single-Dimensional Boolean Association rules from
Transactional Databases – Apriori Algorithm, Mining Multilevel Association rules from
Transaction Databases and Mining Multi- Dimensional Association rules from Relational
Databases.
June 4, 2025 4
Table of Content
• Measuring Central Tendency
• Measuring Dispersion of Data
• Graph Displays of Basic Statistical class Description
June 4, 2025 5
Mining Data Dispersion Characteristics
• Motivation
• To better understand the data: central tendency, variation and
spread
• Data dispersion characteristics
• median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
• Data dispersion: analyzed with multiple granularities of precision
• Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
• Folding measures into numerical dimensions
• Boxplot or quantile analysis on the transformed cube
June 4, 2025 6
Measuring the Central Tendency
1 n
• Mean x xi n
n i 1 w x i i
• Weighted arithmetic mean x i1
n
w i
• Median: A holistic measure i1
• Middle value if odd number of values, or average of the middle two valuesif
values are even
n / 2 ( f )l
• otherwise estimated by interpolation also median L1 ( )c
f median
• Mode
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
• Empirical formula: mean mode 3 (mean median)
June 4, 2025 7
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, M, Q3, max
• Boxplot: ends of the box are the quartiles, median is marked,
whiskers, and plot outlier individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation
• Variance s2: (algebraic,
n
scalable computation) n n
1 1 1
[ xi ( xi ) 2 ]
2 2 2
s ( xi x )
n 1 i 1 n 1 i 1 n i 1
• Standard deviation s is the square root of variance s2
June 4, 2025 8
• Variance
1 n 1 1 2
2
s i ( x x ) 2
i
x 2
x
i
n 1 i 1 n 1 n
• Standard deviation: the square root of the variance
• Measures spread about the mean
• It is zero if and only if all the values are equal
• Both the deviation and the variance are algebraic
June 4, 2025 9
Graph Displays of Basic Statistical class Description
Boxplot Analysis
• Data is represented with a box
• The ends of the box are at the first and
third quartiles, i.e., the height of the
box is IRQ
• The median is marked by a line within
the box
• Whiskers: two lines outside the box
extend to Minimum and Maximum
June 4, 2025 10
Histogram Analysis
• Graph displays of basic statistical class descriptions
• Frequency histograms
• A univariate graphical method
• Consists of a set of rectangles that reflect the counts or
frequencies of the classes present in the given data
June 4, 2025 11
Quantile Plots
• Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
• Plots quantile information
• For a data xi data sorted in increasing order, fi indicates
that approximately 100 fi% of the data are below or equal
to the value xi
June 4, 2025 12
Scatter Plots
• Provides a first look at bivariate data to see clusters of points, outliers
• Each pair of values is treated as a pair of coordinates and plotted as points in the
plane
• Used for determining a relationship, pattern, or trend between two numeric
attributes. Two attributes, X, and Y, are correlated if one attribute implies the
other. Correlations can be positive, negative, or null (uncorrelated).
Negative Correlation
Positive Correlation
June 4, 2025 13
Graphic Displays of Basic Statistical Descriptions
• Histogram: A univariate graphical method that consists of a set of rectangles
that reflect the counts or frequencies of the classes present in the given data
• Boxplot: Data is represented with a box. The ends of the box are at the first
and third quartiles, i.e., the height of the box is IRQ
• Quantile plot: each value xi is paired with fi indicating that approximately
100 fi % of data are xi
• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant
distribution against the corresponding quantiles of another
• Scatter plot: each pair of values is a pair of coordinates and plotted as points
in the plane
• Loess (local regression) curve: add a smooth curve to a scatter plot to provide
better perception of the pattern of dependence
June 4, 2025 14
Summary
• Mining Data Dispersion Characteristics
• Measures of Central Tendency
• Measures of dispersion of data
• Graphic Displays of Basic Statistical Description
June 4, 2025 15
Assignment
• Explain Statistical Measures in Large Databases.
• Discuss difference between box plot and scatter plot?
• Describe all the measures of data dispersion with definition and
formulas.
June 4, 2025 16
References
TEXT BOOKS
T1: Tan, Steinbach and Vipin Kumar. Introduction to Data Mining, Pearson Education, 2016.
T2: Zaki MJ, Meira Jr W, Meira W. Data mining and machine learning: Fundamental concepts and algorithms.
Cambridge University Press; 2020 Jan 30.
T3: King RS. Cluster analysis and data mining: An introduction. Mercury Learning and Information; 2015 May
12.
REFERENCE BOOKS
R1: Pei, Han and Kamber. Data Mining: Concepts and Techniques, Elsevier, 2011.
R2: Halgamuge SK, Wang L, editors. Classification and clustering for knowledge discovery. Springer Science
& Business Media; 2005 Sep 2.
R3: Bhatia P. Data mining and data warehousing: principles and practical techniques. Cambridge University
Press; 2019 Jun 27.
JOURNALS
• [Link]
• [Link] 17
• [Link]
References
RESEARCH PAPER
Alasadi SA, Bhaya WS. Review of data preprocessing techniques in data mining. Journal of Engineering and Applied Sciences.
2017 Sep;12(16):4102-7.
Freitas AA. A survey of evolutionary algorithms for data mining and knowledge discovery. InAdvances in evolutionary
computing: theory and applications 2003 Jan 1 (pp. 819-845). Berlin, Heidelberg: Springer Berlin Heidelberg.
Kumbhare TA, Chobe SV. An overview of association rule mining algorithms. International Journal of Computer Science and
Information Techno[Link]
logies. 2014 Feb;5(1):927-30.
Srivastava S. Weka: a tool for data preprocessing, classification, ensemble, clustering and association rule mining. International
Journal of Computer Applications. 2014 Jan 1;88(10).
Dol SM, Jawandhiya PM. Classification technique and its combination with clustering and association rule mining in
educational data mining—A survey. Engineering Applications of Artificial Intelligence. 2023 Jun 1; 122:106071.
• WEB LINK
[Link]
• VIDEO LINK
[Link] 18
THANK YOU
For queries
Email: preeti.e16576@[Link]