Data Mining: Summary Statistics

30-09-2020
DATA MINING:
Summary Statistics
Prof. Sherica Lavinia Menezes
Asst. Professor
Computer Engineering Department
Goa College of Engineering
SUMMARY STATISTICS
Summary statistics such as mean and standard deviation of a set of

values and visualization techniques such as scatter plots and
histograms are the standard methods that are widely employed for
data exploration.
Exploratory Data Analysis (EDA) is a specific domain that deals

with exploring the input data. It was created in 1970 by statistician
John Tukey. Summary statistics is a small topic of Data Mining that
overlaps with EDA.
1
30-09-2020
AGENDA
The IRIS dataset

01 Measures of
location: Mean and
Measures of
02 Median
Spread: Range and

Variance
03 Multivariate
Summary
04 Statistics
LEARNING OBJECTIVES
Define different
measures of
exploring data
01
02 Compare the mean
and median
Compare the range

and variance 03
Compute the
04 mean/median/range/vari
ance of a given dataset
2
30-09-2020
The IRIS Dataset
5 attributes:
150 flowers, 50 each
Sepal Length in cms
from 3 species: Iris
Sepal Width in cms
Setosa, Iris
Petal Length in cms
Versicolor, Iris
Petal Width in cms
Virginica
Class: {setosa, versicolor, virginica}
Frequency and Mode
Given a categorical attribute 𝒙 which can take the values

𝒗 𝟏 , 𝒗 𝟐 , 𝒗 𝟑 , … … , 𝒗 𝒊 , … . . , 𝒗 𝒌 and a set of 𝒎 objects the frequency
of a value 𝒗 𝒊 is given as:
𝒏𝒐. 𝒐𝒇 𝒐𝒃𝒋𝒆𝒄𝒕𝒔 𝒘𝒊𝒕𝒉 𝒂𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆 𝒗𝒂𝒍𝒖𝒆 𝒗 𝒊

𝒇𝒓𝒆𝒒 𝒗 𝒊 =
𝒎
Mode of the attribute is the value that has the highest

frequency
3
30-09-2020
Examples of frequency and

mode
Value Freq
Build wind float 70/214 = 0.33
Build wind non-float 76/214 = 0.36
Vehic wind float 17/214 = 0.08
Vehic wind non-float 0/214 = 0
Containers 13/214 = 0.06
Tableware 9/214 = 0.04
Headlamps 29/214 = 0.14
MODE
Frequency and Mode
For attributes having equal distribution of data

these measures do not prove useful
For continuous data single value may not occur

more than once therefore these measures do not
prove helpful
Mode can be used to indicate any missing value in

categorical attribute
4
30-09-2020
PERCENTILES
Can be used for ordered data
Given ordinal or continuous attribute 𝒙 and a

number 𝒑 between 𝟎 𝒂𝒏𝒅 𝟏𝟎𝟎, 𝒑 𝒕𝒉 percentile 𝒙𝒑 is
a value of 𝒙 such that 𝒑% of observed values of 𝒙
are less than 𝒙𝒑 .
Percentile Sepal length Sepal width

0 4.3 2.0
10 4.8 2.5
20 5.0 2.7
30 5.2 2.8
40 5.6 3.0
How to compute Percentiles
Attribute x of m
Sort dataset in
instances and Index = m*p ascending order
percentile p
N Index is
𝑥𝑝% = 𝑥[𝑖𝑛𝑑𝑒𝑥] Round up index whole
number?
Y
𝑥𝑝%
= (𝑥 𝑖𝑛𝑑𝑒𝑥
+ 𝑥 𝑖𝑛𝑑𝑒𝑥 + 1 )/2
10
5
30-09-2020
Example of Percentiles
{85, 34, 42, 51, 84, 86, 78, 85, 87, 69, 94, 74,
65, 56, 97}
{34, 42, 51, 56, 65, 69, 74, 78, 84, 85, 85, 86,
87, 94, 97}
m = 15; find the 80% percentile

Index = 15 * 80 = 12
{34, 42, 51, 56, 65, 69, 74, 78, 84, 85, 85, 86,
87, 94, 97}
𝟖𝟔 + 𝟖𝟕
𝒙𝟖𝟎% = = 𝟖𝟔. 𝟓
𝟐
11
Measures of Location: MEAN and MEDIAN

Mainly for continuous data
Measure the location of a set of values in a

given attribute.
Let {𝒙 𝟏 , 𝒙𝟐 , … . . , 𝒙𝒎 } be values of attributes x for

given 𝒎 objects, where 𝒙𝟏 ≤ 𝒙𝟐 ≤∙∙∙∙∙≤ 𝒙𝒎 .
𝒙𝟏 = 𝒎𝒊𝒏(𝒙) 𝒙𝒎 = 𝒎𝒂𝒙(𝒙)
𝟏 𝒎
𝒎𝒆𝒂𝒏 𝒙 = ෍ 𝒙𝒊
𝒎 𝒊=𝟏
𝒙 𝒊𝒇 𝒎 𝒊𝒔 𝒐𝒅𝒅 𝒊. 𝒆. 𝒎 = 𝟐𝒓 + 𝟏
𝒎𝒆𝒅𝒊𝒂𝒏 𝒙 = ቊ 𝒓+𝟏
𝟏/𝟐(𝒙 𝒓 + 𝒙𝒓+𝟏 ) 𝒊𝒇 𝒎 𝒊𝒔 𝒆𝒗𝒆𝒏 𝒊. 𝒆. 𝒎 = 𝟐𝒓
12
6
30-09-2020
Compute the mean and median

for the following dataset
{85, 34, 42, 51, 84, 86, 78, 85, 87, 69, 94, 74,
65, 56, 97}
{34, 42, 51, 56, 65, 69, 74, 78, 84, {2, 34, 42, 51, 56, 65, 69, 74, 78,
85, 85, 86, 87, 94, 97} 84, 85, 85, 86, 87, 94, 97}
Mean = 1/15(34 + 42 + 51 + 56 + 65 Mean = 1/16(2 + 34 + 42 + 51 + 56

+ 69 + 74 + 78 + 84 + 85 + 85 + 86 + 65 + 69 + 74 + 78 + 84 + 85 + 85
+ 87 + 94 + 97) = 72.5 + 86 + 87 + 94 + 97) = 68
𝟕𝟒+𝟕𝟖
Median(x) = 𝟕𝟖 Median(x) = = 𝟕𝟔
𝟐
13
Mean and Median
Mean is sometimes interpreted as the middle

value but does not work well for outliers and
skewed data
In case data is skewed or prone to noise or
outliers then the median works as a more
accurate measure.
Trimmed Mean: Assuming values are sorted
• A percentage p between 0 to 100 is specified
• Top (p/2)% and bottom (p/2)% values are
thrown out
• Mean is computed on the remainder data
14
7
30-09-2020
Example of Trimmed Mean

{1, 2, 34, 42, 51, 78, 69, 94, 74, 65, 56, 97}
{1, 2, 34, 42, 51, 56, 65, 69, 74, 78, 94, 97}
Mean = 55.2 Trimmed Mean (40%)
Eliminate top 20% and bottom 20% values
{1, 2, 34, 42, 51, 56, 65, 69, 74, 78, 94, 97}
𝟏
𝑻𝑴@𝟒𝟎% = ∗ 𝟑𝟒 + 𝟒𝟐 + 𝟓𝟏 + 𝟓𝟔 + 𝟔𝟓 + 𝟔𝟗 + 𝟕𝟒 + 𝟕𝟖 = 𝟓𝟖. 𝟔
𝟖
15
Measures of Spread: RANGE and VARIANCE

Measure of dispersion or spread of set of
values for continuous data
Let {𝒙 𝟏 , 𝒙𝟐 , … . . , 𝒙𝒎 } be values of attributes x for

given 𝒎 objects, where 𝒙𝟏 ≤ 𝒙𝟐 ≤∙∙∙∙∙≤ 𝒙𝒎 .
𝒓𝒂𝒏𝒈𝒆 𝒙 = 𝒎𝒂𝒙 𝒙 − 𝒎𝒊𝒏 𝒙 = 𝒙𝒎 − 𝒙𝟏
𝒎
𝟏
𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆 𝒙 = 𝒔𝟐𝒙 = 𝒙 )𝟐
∗ ෍( 𝒙𝒊 − ഥ
𝒎−𝟏
𝒊=𝟏
𝒔𝒕𝒅𝒅𝒆𝒗 𝒙 = 𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆(𝒙) = 𝒔𝒙
16
8
30-09-2020
Measures of Spread: RANGE and VARIANCE

Range can be misleading if most values are
concentrated in narrow band of values but
small number of values are of extreme nature
𝟏, 𝟐, 𝟑, 𝟒, 𝟓, 𝟔, 𝟕, 𝟖, 𝟗, 𝟏𝟎 {𝟏, 𝟐, 𝟑, 𝟒, 𝟓, 𝟔, 𝟕, 𝟖, 𝟗, 𝟑𝟎}
Range(x1) = 09 range(x2) = 29
Therefore for such cases variance serves as a

better measure
Mean is sensitive to outliers therefore variance

also is sensitive to outliers
17
More Robust Measures

Absolute average deviation
𝒎
𝟏
𝑨𝑨𝑫 𝒙 = ∗ ෍ |𝒙𝒊 − ഥ 𝒙|
𝒎
𝒊=𝟏
Median Absolute Deviation

𝑴𝑨𝑫 𝒙
= 𝒎𝒆𝒅𝒊𝒂𝒏 ({|𝒙𝟏 − ഥ
𝒙|, |𝒙𝟐 − ഥ
𝒙|, … … , |𝒙𝒎 − ഥ
𝒙|})
Interquartile Range
𝑰𝑸𝑹 𝒙 = 𝒙𝟕𝟓% − 𝒙𝟐𝟓%
18
9
30-09-2020
More Robust Measures
Measure Sepal Length Sepal Width Petal Length Petal Width
Range 3.6 2.4 5.9 2.4
Std_dev 0.8 0.4 1.8 0.8
AAD 0.7 0.3 1.6 0.6
MAD 0.7 0.3 1.2 0.7
IQR 1.3 0.5 3.5 1.5
19
MULTIVARIATE SUMMARY STATISTICS

Measure of location or spread of set of values for
continuous data containing several attributes
Let ഥ 𝒙𝟏 , ഥ
𝒙 = {ഥ 𝒙𝟐 , … . . , ഥ
𝒙𝒏 } be mean values of n
attributes.
Compute spread and location for each attribute

Covariance matrix and correlation matrix
𝒎
𝟏
𝒄𝒐𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆 𝒙𝒊, 𝒙𝒋 = ∗ ෍ (𝒙 𝒌𝒊 − 𝒙𝒊 )(𝒙𝒌𝒋 − 𝒙𝒋 )
𝒎−𝟏
𝒌=𝟏
Covariance is the measure of the degree to

which two attributes vary together and
depends on magnitude of the variable
20
10
30-09-2020
THANKS
CREDITS: This presentation template was created

by Slidesgo, including icons by Flaticon, and
infographics & images by Freepik
Please keep this slide for attribution.
21
11

Data Mining: Summary Statistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining: Summary Statistics

Uploaded by

Copyright:

Available Formats

30-09-2020

Summary statistics such as mean and standard deviation of a set of

Exploratory Data Analysis (EDA) is a specific domain that deals

The IRIS dataset

Spread: Range and

Compare the range

The IRIS Dataset

Frequency and Mode

Given a categorical attribute 𝒙 which can take the values

𝒏𝒐. 𝒐𝒇 𝒐𝒃𝒋𝒆𝒄𝒕𝒔 𝒘𝒊𝒕𝒉 𝒂𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆 𝒗𝒂𝒍𝒖𝒆 𝒗 𝒊

Mode of the attribute is the value that has the highest

Examples of frequency and

Frequency and Mode

For attributes having equal distribution of data

For continuous data single value may not occur

Mode can be used to indicate any missing value in

Given ordinal or continuous attribute 𝒙 and a

Percentile Sepal length Sepal width

How to compute Percentiles

m = 15; find the 80% percentile

Measures of Location: MEAN and MEDIAN

Measure the location of a set of values in a

Let {𝒙 𝟏 , 𝒙𝟐 , … . . , 𝒙𝒎 } be values of attributes x for

Compute the mean and median

Mean = 1/15(34 + 42 + 51 + 56 + 65 Mean = 1/16(2 + 34 + 42 + 51 + 56

Mean and Median

Mean is sometimes interpreted as the middle

Example of Trimmed Mean

Mean = 55.2 Trimmed Mean (40%)

Eliminate top 20% and bottom 20% values

Measures of Spread: RANGE and VARIANCE

Let {𝒙 𝟏 , 𝒙𝟐 , … . . , 𝒙𝒎 } be values of attributes x for

𝒓𝒂𝒏𝒈𝒆 𝒙 = 𝒎𝒂𝒙 𝒙 − 𝒎𝒊𝒏 𝒙 = 𝒙𝒎 − 𝒙𝟏

Measures of Spread: RANGE and VARIANCE

Therefore for such cases variance serves as a

Mean is sensitive to outliers therefore variance

More Robust Measures

Median Absolute Deviation

More Robust Measures

Measure Sepal Length Sepal Width Petal Length Petal Width

Range 3.6 2.4 5.9 2.4

Std_dev 0.8 0.4 1.8 0.8

AAD 0.7 0.3 1.6 0.6

MAD 0.7 0.3 1.2 0.7

IQR 1.3 0.5 3.5 1.5

MULTIVARIATE SUMMARY STATISTICS

Compute spread and location for each attribute

Covariance is the measure of the degree to

CREDITS: This presentation template was created

You might also like