You are on page 1of 11

30-09-2020

DATA MINING:
Summary Statistics
Prof. Sherica Lavinia Menezes
Asst. Professor
Computer Engineering Department
Goa College of Engineering

SUMMARY STATISTICS

Summary statistics such as mean and standard deviation of a set of


values and visualization techniques such as scatter plots and
histograms are the standard methods that are widely employed for
data exploration.

Exploratory Data Analysis (EDA) is a specific domain that deals


with exploring the input data. It was created in 1970 by statistician
John Tukey. Summary statistics is a small topic of Data Mining that
overlaps with EDA.

1
30-09-2020

AGENDA

The IRIS dataset


01 Measures of
location: Mean and

Measures of
02 Median

Spread: Range and


Variance
03 Multivariate
Summary

04 Statistics

LEARNING OBJECTIVES
Define different
measures of
exploring data
01
02 Compare the mean
and median

Compare the range


and variance 03
Compute the

04 mean/median/range/vari
ance of a given dataset

2
30-09-2020

The IRIS Dataset

5 attributes:
150 flowers, 50 each
Sepal Length in cms
from 3 species: Iris
Sepal Width in cms
Setosa, Iris
Petal Length in cms
Versicolor, Iris
Petal Width in cms
Virginica
Class: {setosa, versicolor, virginica}

Frequency and Mode

Given a categorical attribute 𝒙 which can take the values


𝒗 𝟏 , 𝒗 𝟐 , 𝒗 𝟑 , … … , 𝒗 𝒊 , … . . , 𝒗 𝒌 and a set of 𝒎 objects the frequency
of a value 𝒗 𝒊 is given as:

𝒏𝒐. 𝒐𝒇 𝒐𝒃𝒋𝒆𝒄𝒕𝒔 𝒘𝒊𝒕𝒉 𝒂𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆 𝒗𝒂𝒍𝒖𝒆 𝒗 𝒊


𝒇𝒓𝒆𝒒 𝒗 𝒊 =
𝒎

Mode of the attribute is the value that has the highest


frequency

3
30-09-2020

Examples of frequency and


mode
Value Freq
Build wind float 70/214 = 0.33
Build wind non-float 76/214 = 0.36
Vehic wind float 17/214 = 0.08
Vehic wind non-float 0/214 = 0
Containers 13/214 = 0.06
Tableware 9/214 = 0.04
Headlamps 29/214 = 0.14

MODE

Frequency and Mode

For attributes having equal distribution of data


these measures do not prove useful

For continuous data single value may not occur


more than once therefore these measures do not
prove helpful

Mode can be used to indicate any missing value in


categorical attribute

4
30-09-2020

PERCENTILES
Can be used for ordered data

Given ordinal or continuous attribute 𝒙 and a


number 𝒑 between 𝟎 𝒂𝒏𝒅 𝟏𝟎𝟎, 𝒑 𝒕𝒉 percentile 𝒙𝒑 is
a value of 𝒙 such that 𝒑% of observed values of 𝒙
are less than 𝒙𝒑 .

Percentile Sepal length Sepal width


0 4.3 2.0
10 4.8 2.5
20 5.0 2.7
30 5.2 2.8
40 5.6 3.0

How to compute Percentiles

Attribute x of m
Sort dataset in
instances and Index = m*p ascending order
percentile p

N Index is
𝑥𝑝% = 𝑥[𝑖𝑛𝑑𝑒𝑥] Round up index whole
number?

Y
𝑥𝑝%
= (𝑥 𝑖𝑛𝑑𝑒𝑥
+ 𝑥 𝑖𝑛𝑑𝑒𝑥 + 1 )/2

10

5
30-09-2020

Example of Percentiles
{85, 34, 42, 51, 84, 86, 78, 85, 87, 69, 94, 74,
65, 56, 97}

{34, 42, 51, 56, 65, 69, 74, 78, 84, 85, 85, 86,
87, 94, 97}

m = 15; find the 80% percentile


Index = 15 * 80 = 12

{34, 42, 51, 56, 65, 69, 74, 78, 84, 85, 85, 86,
87, 94, 97}
𝟖𝟔 + 𝟖𝟕
𝒙𝟖𝟎% = = 𝟖𝟔. 𝟓
𝟐

11

Measures of Location: MEAN and MEDIAN


Mainly for continuous data

Measure the location of a set of values in a


given attribute.

Let {𝒙 𝟏 , 𝒙𝟐 , … . . , 𝒙𝒎 } be values of attributes x for


given 𝒎 objects, where 𝒙𝟏 ≤ 𝒙𝟐 ≤∙∙∙∙∙≤ 𝒙𝒎 .

𝒙𝟏 = 𝒎𝒊𝒏(𝒙) 𝒙𝒎 = 𝒎𝒂𝒙(𝒙)

𝟏 𝒎
𝒎𝒆𝒂𝒏 𝒙 = ෍ 𝒙𝒊
𝒎 𝒊=𝟏

𝒙 𝒊𝒇 𝒎 𝒊𝒔 𝒐𝒅𝒅 𝒊. 𝒆. 𝒎 = 𝟐𝒓 + 𝟏
𝒎𝒆𝒅𝒊𝒂𝒏 𝒙 = ቊ 𝒓+𝟏
𝟏/𝟐(𝒙 𝒓 + 𝒙𝒓+𝟏 ) 𝒊𝒇 𝒎 𝒊𝒔 𝒆𝒗𝒆𝒏 𝒊. 𝒆. 𝒎 = 𝟐𝒓

12

6
30-09-2020

Compute the mean and median


for the following dataset
{85, 34, 42, 51, 84, 86, 78, 85, 87, 69, 94, 74,
65, 56, 97}

{34, 42, 51, 56, 65, 69, 74, 78, 84, {2, 34, 42, 51, 56, 65, 69, 74, 78,
85, 85, 86, 87, 94, 97} 84, 85, 85, 86, 87, 94, 97}

Mean = 1/15(34 + 42 + 51 + 56 + 65 Mean = 1/16(2 + 34 + 42 + 51 + 56


+ 69 + 74 + 78 + 84 + 85 + 85 + 86 + 65 + 69 + 74 + 78 + 84 + 85 + 85
+ 87 + 94 + 97) = 72.5 + 86 + 87 + 94 + 97) = 68

𝟕𝟒+𝟕𝟖
Median(x) = 𝟕𝟖 Median(x) = = 𝟕𝟔
𝟐

13

Mean and Median

Mean is sometimes interpreted as the middle


value but does not work well for outliers and
skewed data
In case data is skewed or prone to noise or
outliers then the median works as a more
accurate measure.
Trimmed Mean: Assuming values are sorted
• A percentage p between 0 to 100 is specified
• Top (p/2)% and bottom (p/2)% values are
thrown out
• Mean is computed on the remainder data

14

7
30-09-2020

Example of Trimmed Mean


{1, 2, 34, 42, 51, 78, 69, 94, 74, 65, 56, 97}

{1, 2, 34, 42, 51, 56, 65, 69, 74, 78, 94, 97}

Mean = 55.2 Trimmed Mean (40%)

Eliminate top 20% and bottom 20% values

{1, 2, 34, 42, 51, 56, 65, 69, 74, 78, 94, 97}

𝟏
𝑻𝑴@𝟒𝟎% = ∗ 𝟑𝟒 + 𝟒𝟐 + 𝟓𝟏 + 𝟓𝟔 + 𝟔𝟓 + 𝟔𝟗 + 𝟕𝟒 + 𝟕𝟖 = 𝟓𝟖. 𝟔
𝟖

15

Measures of Spread: RANGE and VARIANCE


Measure of dispersion or spread of set of
values for continuous data

Let {𝒙 𝟏 , 𝒙𝟐 , … . . , 𝒙𝒎 } be values of attributes x for


given 𝒎 objects, where 𝒙𝟏 ≤ 𝒙𝟐 ≤∙∙∙∙∙≤ 𝒙𝒎 .

𝒓𝒂𝒏𝒈𝒆 𝒙 = 𝒎𝒂𝒙 𝒙 − 𝒎𝒊𝒏 𝒙 = 𝒙𝒎 − 𝒙𝟏

𝒎
𝟏
𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆 𝒙 = 𝒔𝟐𝒙 = 𝒙 )𝟐
∗ ෍( 𝒙𝒊 − ഥ
𝒎−𝟏
𝒊=𝟏

𝒔𝒕𝒅𝒅𝒆𝒗 𝒙 = 𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆(𝒙) = 𝒔𝒙

16

8
30-09-2020

Measures of Spread: RANGE and VARIANCE


Range can be misleading if most values are
concentrated in narrow band of values but
small number of values are of extreme nature

𝟏, 𝟐, 𝟑, 𝟒, 𝟓, 𝟔, 𝟕, 𝟖, 𝟗, 𝟏𝟎 {𝟏, 𝟐, 𝟑, 𝟒, 𝟓, 𝟔, 𝟕, 𝟖, 𝟗, 𝟑𝟎}

Range(x1) = 09 range(x2) = 29

Therefore for such cases variance serves as a


better measure

Mean is sensitive to outliers therefore variance


also is sensitive to outliers

17

More Robust Measures


Absolute average deviation
𝒎
𝟏
𝑨𝑨𝑫 𝒙 = ∗ ෍ |𝒙𝒊 − ഥ 𝒙|
𝒎
𝒊=𝟏

Median Absolute Deviation


𝑴𝑨𝑫 𝒙
= 𝒎𝒆𝒅𝒊𝒂𝒏 ({|𝒙𝟏 − ഥ
𝒙|, |𝒙𝟐 − ഥ
𝒙|, … … , |𝒙𝒎 − ഥ
𝒙|})

Interquartile Range
𝑰𝑸𝑹 𝒙 = 𝒙𝟕𝟓% − 𝒙𝟐𝟓%

18

9
30-09-2020

More Robust Measures

Measure Sepal Length Sepal Width Petal Length Petal Width

Range 3.6 2.4 5.9 2.4

Std_dev 0.8 0.4 1.8 0.8

AAD 0.7 0.3 1.6 0.6

MAD 0.7 0.3 1.2 0.7

IQR 1.3 0.5 3.5 1.5

19

MULTIVARIATE SUMMARY STATISTICS


Measure of location or spread of set of values for
continuous data containing several attributes

Let ഥ 𝒙𝟏 , ഥ
𝒙 = {ഥ 𝒙𝟐 , … . . , ഥ
𝒙𝒏 } be mean values of n
attributes.

Compute spread and location for each attribute


Covariance matrix and correlation matrix

𝒎
𝟏
𝒄𝒐𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆 𝒙𝒊, 𝒙𝒋 = ∗ ෍ (𝒙 𝒌𝒊 − 𝒙𝒊 )(𝒙𝒌𝒋 − 𝒙𝒋 )
𝒎−𝟏
𝒌=𝟏

Covariance is the measure of the degree to


which two attributes vary together and
depends on magnitude of the variable

20

10
30-09-2020

THANKS

CREDITS: This presentation template was created


by Slidesgo, including icons by Flaticon, and
infographics & images by Freepik
Please keep this slide for attribution.

21

11

You might also like