You are on page 1of 62

Course STAT2: STATISTICAL Teacher: AMITA PAL

Interdisciplinary Statistical

STRUCTURES IN DATA (SSD) Research Unit (ISRU)


ISI Kolkata

Postgraduate Diploma in Business Analytics (PGDBA): 2022-24 Batch


Descriptive Statistics (contd.) 2

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Example: Insect Data 3

With Insecticide A With Insecticide B


Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Measures of Central Tendency 4

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Central Tendency 5

• A data point around which most of the points are located


• Some sort of a central point
• Common Measures
• Mode: the most frequent value
• Median: the middle-most observation in the ordered
dataset
• Mean or Arithmetic Mean: average value

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Mode: The most Frequent Observation 6

• For a data set (3, 7, 3, 9, 9, 3, 5, 1, 8, 5)


(left histogram), the unique mode is 3.
• For a data set (2, 4, 9, 6, 4, 6, 6, 2, 8, 2)
(right histogram), there are two modes: 2
and 6.
• A distribution with a single mode is said to
be unimodal. A distribution with more
than one mode is said to be bimodal,
trimodal, etc., or in general, multimodal.
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Computing the Mode from Grouped Data 7

When raw data is not available


• Data is provided in the grouped
form P

• Mode is computed by linear


interpolation within the class
having the highest frequency.
L H

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Mode from Grouped Data (contd.) 8

• Let
• 𝑓𝑓𝑀𝑀 denote the frequency of the class interval with highest frequency
(the modal class interval)
• 𝐿𝐿, 𝐻𝐻 denote the lower and upper boundaries of the modal class interval
• 𝑓𝑓𝐿𝐿 and 𝑓𝑓𝐻𝐻 represent the frequencies corresponding to the class
intervals just before and after the modal class interval
• Then the estimated mode is
𝑓𝑓𝑀𝑀 − 𝑓𝑓𝐿𝐿
𝑀𝑀 = 𝐿𝐿 + (𝐻𝐻 − 𝐿𝐿)
Statistical Structures in Data, PGDBA Programme, ISI, 2022
𝑓𝑓𝐻𝐻 − 𝑓𝑓𝐿𝐿 October 12, 2022
Mode: Illustration 9

With Insect Data

MODE

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Multimodal data 10

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Median: The Middle-most Observation 11

• MIDDLE-MOST VALUE in an ordered array of sample observations


• Unaffected by extremely large and extremely small values

MEDIAN

Smallest Largest

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Median: Example with an Odd Number of
Observations 12

Ordered observations (17 in number)


3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 22
• The median is the 9th ordered observation, that is, 15.
• since (𝑛𝑛 + 1)/2 = (17 + 1)/2 = 9
• Observe that
• If the largest observation 22 is replaced by 100, the median The median is
remains unchanged. not affected
by outliers.
• If the smallest observation 3 is replaced by −103, the median
remans unchanged.
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Median: Example with an Even Number of
Observations 13

Ordered observations (16 in number)


4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 22
• The median is the average of the 8th and 9th ordered
observations, that is, 15 and 16, and is equal to 𝟏𝟏𝟏𝟏. 𝟓𝟓.
• since 𝑛𝑛/2 = 16/2 = 8

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Example: Insect Data 14
MEDIAN=40 MEDIAN=16.5

With Insecticide A With Insecticide B


Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Median: Computational Procedure 15

• Arrange the 𝑛𝑛 observations of the dataset in increasing order.


• If 𝑛𝑛 is odd, the median is the middle term of the ordered
array.
• If 𝑛𝑛 is even, the median is the average of the middle two
terms.

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Computing the Median from Grouped Data 16

When raw data is not available


• Data is provided in the grouped
form
• Median is computed by linear
interpolation within the class
containing the observation L H

below which 50% of the ordered


observations lie.
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Median from Grouped Data (contd.) 17

• Let
• 𝑓𝑓 denote the frequency of the class interval which contains the 𝑛𝑛⁄2-th
ordered observation (the median class interval)
• 𝐿𝐿, 𝐻𝐻 denote the lower and upper boundaries of the median class interval
• 𝑓𝑓𝐿𝐿 represent the cumulative frequency up to the median class interval
• Then the estimated median is
𝑛𝑛
− 𝑓𝑓𝐿𝐿
𝑀𝑀𝑒𝑒 = 𝐿𝐿 + (𝐻𝐻 − 𝐿𝐿) 2
Statistical Structures in Data, PGDBA Programme, ISI, 2022 𝑓𝑓 October 12, 2022
Arithmetic Mean (or, simply, Mean) 18

• Is the average of a group of numbers


• Computed by summing all values in the data set
and dividing the sum by the number of values in
the data set
• Affected by each value in the data set, including
extreme values (outliers)
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Computing the Mean from Grouped Data 19

When raw data is not available and data is provided in the


grouped form
• Mean is computed by
• assuming that the 𝑓𝑓𝑖𝑖 observations in the 𝑖𝑖-th class interval can be
approximated by 𝑥𝑥𝑖𝑖 , the mid-point of the class interval, 𝑖𝑖 = 1,2, … , 𝑘𝑘.
• The arithmetic mean is given by
𝑘𝑘 with 𝑛𝑛 = ∑𝑘𝑘𝑖𝑖=1 𝑓𝑓𝑖𝑖 .
1
𝑥𝑥̅ = � 𝑓𝑓𝑖𝑖 𝑥𝑥𝑖𝑖
𝑛𝑛
Statistical Structures in Data, PGDBA Programme, ISI, 2022
𝑖𝑖=1 October 12, 2022
Illustration 20

Class Interval Mid-point Frequency


ℓ1 − 𝑢𝑢1 ℓ1 + 𝑢𝑢1 𝑓𝑓1
𝑥𝑥1 =
2
ℓ2 − 𝑢𝑢2 ℓ2 + 𝑢𝑢2 𝑓𝑓2
𝑥𝑥2 =
2
ℓ3 − 𝑢𝑢3 ℓ3 + 𝑢𝑢3 𝑓𝑓3
𝑥𝑥3 =
2
⋮ ⋮ ⋮
ℓ𝑘𝑘 − 𝑢𝑢𝑘𝑘 ℓ𝑘𝑘 + 𝑢𝑢𝑘𝑘 𝑓𝑓𝑘𝑘
𝑥𝑥𝑘𝑘 = 1 𝑘𝑘 705
2 𝑥𝑥̅ = ∑𝑖𝑖=1 𝑓𝑓𝑖𝑖 𝑥𝑥𝑖𝑖 = =28.2
𝑛𝑛 25
ℓ𝑖𝑖 : lower class boundary 𝑢𝑢𝑖𝑖 : upper class boundary
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Example: Insect Data 21
MEAN=39.74 MEAN=17.24

With Insecticide A With Insecticide B


Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Quartiles 22

Divide a dataset into four subgroups


•¼ of the dataset is below the 1st quartile (Q1)
•½ of the dataset is below the 2nd quartile (Q2)
•¾ of the dataset is below the 3rd quartile (Q3)

Smallest Largest
Q1 Q2 Q3

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Quartiles 23

Q1 Q2 Q3

25% 25% 25% 25%

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Example: Insect Data 24

Q2=40 Q3=42
Q1=39

Statistical Structures in Data, PGDBA Programme, ISI, 2022 With Insecticide A October 12, 2022
Generalization: Quantiles 25

• Sample quantiles are sets of cut points dividing the observations in a


sample into continuous intervals with equal proportions of 𝑞𝑞: positive
integer
observations.
• There is one fewer quantile than the number of groups created.
• A set of 𝑞𝑞-quantiles are values that partition a finite set of values
into q subsets of (nearly) equal sizes. There are 𝑞𝑞 − 1 such quantiles,
one for each integer 𝑘𝑘 satisfying 0 < 𝑘𝑘 < 𝑞𝑞.
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Special Cases 26

• Quartiles (𝑞𝑞 = 4), denoted by 𝑄𝑄1 , 𝑄𝑄2 , 𝑄𝑄3


• Deciles (𝑞𝑞 = 10), denoted by
𝐷𝐷1 , 𝐷𝐷2 , ⋯ , 𝐷𝐷9
• Percentiles (𝑞𝑞 = 100), denoted by
𝑃𝑃1 , 𝑃𝑃2 , ⋯ , 𝑃𝑃99
• Observe that the median is the same
as 𝑄𝑄2 , 𝐷𝐷5 , 𝑃𝑃50
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Fractiles 27

• Let p be a number lying between 0 and 1.


• Then the p-fractile (𝑧𝑧𝑝𝑝 ) of a set of observations
{𝑥𝑥1 , 𝑥𝑥2 , 𝑥𝑥3 , ⋯ , 𝑥𝑥𝑛𝑛 } on a variable X is that value 𝑧𝑧𝑝𝑝 of the
variable for which a fraction p of the observations are less
than it.
• In other words, # 𝑖𝑖: 𝑥𝑥(𝑖𝑖) ≤ 𝑧𝑧𝑝𝑝
= 𝑝𝑝.
𝑛𝑛
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Collections of Fractiles 28

𝑧𝑧𝑝𝑝1 , 𝑧𝑧𝑝𝑝2 , ⋯ 𝑧𝑧𝑝𝑝𝑘𝑘−1


where Special cases
0 < 𝑝𝑝𝑖𝑖 < 1, for 𝑖𝑖 = 1, 2, … , 𝑘𝑘 − 1
1 • Quartiles (𝑘𝑘 = 4)
and 𝑝𝑝𝑖𝑖 − 𝑝𝑝𝑖𝑖−1 = with 𝑝𝑝0 = 0, • Deciles (𝑘𝑘 = 10)
𝑘𝑘
𝑝𝑝𝑘𝑘 = 1. • Percentiles (𝑘𝑘 = 100)

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Computation of Quantiles: Ungrouped Data 29

Special Case: Percentiles


• Step 1: Arrange the data in ascending order.
𝑝𝑝
• Step 2: Compute 𝑖𝑖 = (𝑛𝑛 + 1)
100
• If 𝑖𝑖 is an integer, the 𝑝𝑝-th percentile is the average of the 𝑖𝑖-th and (𝑖𝑖 +
1)-th ordered observations.
• if 𝑖𝑖 is not an integer then round up to the nearest integer and take the
value at that position or use simple interpolation to locate the value
of percentile between 𝑖𝑖-th and (𝑖𝑖 + 1)-th ordered observations.
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Illustration 30

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Computation of Quantiles (contd.) 31

Ungrouped Data Grouped Data

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Measures of Dispersion or Spread 32

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Dispersion 33

• Two data sets may have the


same central tendency and
yet be different.
• One reason could be
difference in spread or
variability of the data points
with respect to the central
tendency
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Measures of Variability 34

• Describe the spread or the dispersion in a dataset.


• Common Measures of Variability
• Range
• Interquartile Range
• Mean Absolute Deviation
• Variance
• Standard Deviation
• Coefficient of Variation
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Range 35

• The difference between the largest and the smallest 35 41 44 45


values in a dataset
37 41 44 46
• Simple to compute
• Ignores information contained in data points other 37 43 44 46
than the two extremes
• Example: 39 43 44 46

Range
40 43 44 46
=Largest - Smallest
= 48 - 35 = 13 40 43 45 48

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Interquartile Range 36

INTERQUARTILE RANGE RANGE


• Difference between the first • The difference between the
and third quartiles largest and the smallest values
𝐼𝐼𝐼𝐼𝐼𝐼 = 𝑄𝑄3 − 𝑄𝑄1 in a set of data

• Range of the “middle half”


• Less influenced by extremes

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Mean Absolute Deviation (MAD) 37

• Data set: 5, 9, 16, 17, 18


• Mean:
+5
∑ 𝑥𝑥 65
𝑥𝑥̅ = = = 13 -8 -4 +3 +4
𝑛𝑛 5
• Deviations from the mean: 0 5 10 �
𝒙𝒙 15 20
-8, -4, 3, 4, 5
• Mean deviation is 0! A more realistic measure of variability is the mean
absolute deviation (MAD) or the average of the absolute
• Does not reflect the actual picture. values of the deviations. In this example, MAD = 4.8
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Mean Absolute Deviation 38

• Average of the absolute deviations from the arithmetic mean


∑ 𝑥𝑥 65
𝑥𝑥 𝑥𝑥 − 𝑥𝑥̅ 𝑥𝑥 − 𝑥𝑥̄ 𝑥𝑥̄ = = = 13
𝑛𝑛 5
5 -8 8
9 -4 4
∑ 𝑥𝑥 − 𝑥𝑥̄ 24
16 3 3
𝑀𝑀𝑀𝑀𝑀𝑀 = = = 4.8
17 4 4 𝑛𝑛 5
18 5 5

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Variance and Standard Deviation 39

• Variance (𝑠𝑠 2 ): Average of the SQUARED deviations from the arithmetic mean
∑ 𝑥𝑥 65
𝑥𝑥 𝑥𝑥 − 𝑥𝑥̅ (𝑥𝑥 − ̄ 2
𝑥𝑥) 𝑥𝑥̄ =
𝑛𝑛
=
5
= 13
5 -8 64
9 -4 16 2
2
∑ (𝑥𝑥 − 𝑥𝑥)
̄ 130
16 3 9 𝑠𝑠 = = = 26
17 4 16 𝑛𝑛 5
18 5 25 𝑠𝑠 = 5.1
• Standard deviation (s): positive square root of the variance
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Coefficient of Variation 40

• Ratio of the standard deviation to the mean, expressed as a


percentage
• Unit-free measure of relative dispersion
𝑠𝑠
𝐶𝐶. 𝑉𝑉. = × 100
𝑥𝑥̅

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Measures of Shape 41

• Skewness
• Absence of symmetry
• Majority of extreme values to one side of a
distribution
• Kurtosis
• Peakedness/flatness of a distribution

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Shape Descriptors 42

Measures of Skewness Measures of Kurtosis

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Skewness 43

Negatively Symmetric Positively


Skewed (Not Skewed) Skewed

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Skewness 44

Mean Mode Mean Mode Mean


Median
Median Median
Mode
Negatively Symmetric Positively
Skewed (Not Skewed) Skewed

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Measures of Skewness based on Quartiles 45

• Galton’s (Bowley’s) measure


𝑄𝑄3 − 𝑄𝑄2 − 𝑄𝑄2 − 𝑄𝑄1
𝑆𝑆 =
𝑄𝑄3 − 𝑄𝑄1
• Interpretation (also applies to the measures listed in the next
slide)
• If 𝑆𝑆 < 0, the distribution is negatively skewed (skewed to the left).
• If 𝑆𝑆 = 0, the distribution is symmetric (not skewed).
• If 𝑆𝑆 > 0, the distribution is positively skewed (skewed to the right).

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Pearson’s Skewness Coefficients 46

𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 − 𝑀𝑀𝑜𝑜𝑜𝑜𝑜𝑜
• Pearson’s Skewness coefficient (of the first type) 𝑆𝑆1 =
𝑠𝑠𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑
3 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 − 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
• Pearson’s skewness coefficient (of the second type) 𝑆𝑆2 =
𝑠𝑠𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑

• The second is equivalent to the first in view of the empirical relationship


for moderately skewed distributions whereby
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 − 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 ≈ 3(𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 − 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀)

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Kurtosis 47

Peakedness of a
distribution Leptokurtic
• Leptokurtic
• high and thin
Mesokurtic
Platykurtic
• Mesokurtic
• normal in shape
• Platykurtic
• flat and spread out
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Box and Whisker Plots (Box Plots) 48

• Graphic display of a distribution


• Reveals
• central tendency
• dispersion
• skewness

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Box Plots 49

• Five descriptive measures are used:


• Median, Q2
• First quartile, Q1
• Third quartile, Q3
• Minimum value in the data set
• Maximum value in the data set

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Box Plots (contd.) 50

To identify outliers, one of the following sets of fences is used (IQR


= Q3 - Q1):
• Inner Fences
• Lower inner fence = Q1 - 1.5 IQR
• Upper inner fence = Q3 + 1.5 IQR

• Outer Fences
• Lower outer fence = Q1 - 3.0 IQR
• Upper outer fence = Q3 + 3.0 IQR

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Box and Whisker Plot 51

WHISKER BOX WHISKER

Minimum Q1 Q2 Q3 Maximum

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Skewness and Box Plots 52
S<0 S=0 S>0

Negatively Symmetric Positively


Skewed (Not Skewed) Skewed
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
A Typical Box and Whisker Plot 53

1
Upper quartile
} Whisker

}
.5

Inter-quartile
Median
range (IQR)
Lower quartile

} Whisker
0
-.5

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Comparing Different Datasets with Boxplots 54

1
.5
0
-.5

1 2 3
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Whiskers: Variations 55

• The minimum and maximum for the data


• The lowest data point within 1.5×IQR of the lower quartile,
and the highest datapoint still within 1.5 IQR of the upper
quartile (the Tukey boxplot)
• One standard deviation above and below the mean of the
data
• The 9th percentile and the 91st percentile
• The 2nd percentile and the 98th percentile.
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Detection of Outliers 56

• Any data not included between the whiskers is plotted as


an outlier with a dot or a similar symbol

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Detection of Outliers with Fences 57

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Motivation for the fences 58

Using the properties of


the normal distribution

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Variations of the Box Plot 59

Variable-width Box Plots


• Illustrate the size of each
group whose data is being
plotted by making the width of
the box proportional to the
size of the group.
• A popular convention is to
make the box width
proportional to the square root
of the size of the group

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Variations of the Box Plot 60

Notched Box Plots


• There is a "notch" or narrowing of the box around the
median.
• Notches are useful in offering a rough guide to significance
of difference of medians.
• If the notches of two boxes do not overlap, this offers
evidence of a statistically significant difference between
the medians.
• The width of the notches is proportional IQR of the sample
and inversely proportional to the square root of the size of
the sample.
Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Comparison with Notched Boxplots 61

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022
Example: Insect data 62

With Insecticide A

With Insecticide B

Statistical Structures in Data, PGDBA Programme, ISI, 2022 October 12, 2022

You might also like