You are on page 1of 17
CENTRAL TENDENCY AND DISPERSION: For grouped data Applied Statistics and Computing Lab Indian School of Business

CENTRAL TENDENCY AND DISPERSION:

For grouped data

Applied Statistics and Computing Lab

Indian School of Business

Applied Statistics and Computing Lab

Learning goals • Understanding data with class intervals • Learning to evaluate various measures of central

Learning goals

Understanding data with class intervals

Learning to evaluate various measures of central tendency and dispersion, for grouped data

Applied Statistics and Computing Lab

2

Introduction • We studied measures of central tendency and • dispersion for discrete data The data

Introduction

We studied measures of central tendency and

dispersion for discrete data The data was represented in form of a list

How do we deal with data with class intervals?

Can we find a value that represents a given class interval?

Class intervals could emerge from both discrete as well as continuous data

We would look at a dataset consisting of ‘N’ observations, distributed across ‘n’ classes

Applied Statistics and Computing Lab

3

Class mark • Class mark is the midpoint of a class interval between 60 and 70,

Class mark

Class mark is the midpoint of a class interval

between 60 and 70, (60 is the lower limit and 70 is the upper limit)

Calculated as the arithmetic mean of the class limits

E.g. if we are looking at the number of students whose scores lie

60 + 70

2

= 65

is the class mark or the midpoint of the class interval 60-70

Class mark cannot be determined for a data with open classes (intervals indicated by open bracket on either sides)

In case of overlapping classes (where the upper limit of a class and the lower limit of the next one are equal), we assign that overlapping value to that class where the value is the lower limit

Applied Statistics and Computing Lab

4

Cumulative frequency • Cumulative frequency is the frequency of values up to the upper limit of

Cumulative frequency

Cumulative frequency is the frequency of values up to the upper limit of the corresponding class interval

For the class, denote its frequency as , cumulative frequency as and class mark as

Class #

Class interval

Frequency

Cumulative frequency

Class mark

 

1

=

 

+

=

2

2

= +

 

+

=

2

3

= + +

 

+

=

2

 

i

= + ++

 

+

=

2

 

n

= + + ++ =

 

+

=

2

Applied Statistics and Computing Lab

Example

Example • Weights from the ‘body measurement’ data used earlier • Weight values are given up

Weights from the ‘body measurement’ data used earlier

Weight values are given up to one decimal point

Class interval

Frequency

Cumulative frequency

Class mark

40-49.9

27

27

44.95

50-59.9

124

  • 151 54.95

 

60-69.9

120

  • 271 64.95

 

70-79.9

115

  • 386 74.95

 

80-89.9

87

  • 473 84.95

 

90-99.9

25

  • 498 94.95

 

100-109.9

8

  • 506 104.95

 

110-119.9

1

  • 507 114.95

 

Total

507

-

-

Applied Statistics and Computing Lab

6

Summation of all values In ungrouped data set, the value of each observation is considered In

Summation of all values

In ungrouped data set, the value of each observation is considered In grouped data, that is not possible How can we account for all the values of a dataset? As the class mark or the midpoint is considered to represent every value belonging to that particular class interval, that value holds to be a proxy for all the values Can repeat the class mark as many times as the number of values belonging to that class interval; nothing but the frequency Hence, =

Applied Statistics and Computing Lab

7

Means

Means • For data consisting of ‘N’ observations distributed across ‘n’ distinct class intervals, = ∑

For data consisting of ‘N’ observations distributed across ‘n’ distinct

class intervals, = ∑

= ̅=

=

=

For the weights data, Arithmetic mean = 69.15 Geometric mean = 67.88 Harmonic mean = 66.64

Applied Statistics and Computing Lab

8

Median

Median • • • • How do we determine the value that has 50% of the

How do we determine the value that has 50% of the data on each of its

two sides?

Initially we can at least determine the class interval in which the value

would lie, the median class

Let be the upper limit and be the lower limit of the median class

Let indicate the frequency of the median class and indicate the

cumulative frequency of the class preceding the median class, then

= +

( )( )

2

This is obtained under the assumption that cumulative frequency

increases from every class to another

Applied Statistics and Computing Lab

9

Median (contd.) • Total 507 observations ∴ = 253.5 observation splits the data into 2 equal

Median (contd.)

• Total 507 observations ∴ = 253.5 observation splits the data into 2 equal halves Class
Total 507 observations ∴
= 253.5 observation splits the data into 2 equal halves
Class interval
Frequency
Cumulative
frequency
40-49.9
27
27
50-59.9
124
151
Median class, as the
253.5 th observation
60-69.9
120
271
would lie in this
70-79.9
115
386
interval
80-89.9
87
473
90-99.9
25
498
100-109.9
8
506
110-119.9
1
507
Total
507
-
( − )( − )
(69.9 − 60)( 507 − 151)
2
2
= +
= 60 +
= 68.46
120
10
Applied Statistics and Computing Lab

Quantiles

Quantiles • • • • Suppose ‘k’ is the number of quantiles k=4 for quartiles, k=10

Suppose ‘k’ is the number of quantiles k=4 for quartiles, k=10 for deciles and k=100 for percentiles The quantile is the ( ) value of the data

Must note that ( ) is not the numerical value of the quantile, it is only the position corresponding to the quantile when the data is organised in an ascending order

For median i.e. the 2 nd quartile, it was the ( ) = ( ) value Using cumulative frequencies, we can then determine the class to which the given quantile belongs As per the notations used earlier,

]
]

= + [ = 4 and = 2 gives the median

  • 3

  • 4 } − { +

= = { +

1 4 − }
1
4
}

Where, the s and s refer to the lower and upper limits of the corresponding quantile classes

Applied Statistics and Computing Lab

11

Mode

Mode • Can easily identify the class interval with the highest frequency; the modal class •

Can easily identify the class interval with the highest frequency; the modal class

How do we determine the value which has the highest density?

Formula given by:

= + [

]

where;

u ≡

Applied Statistics and Computing Lab

12

Mode (contd.) Class interval Frequency 40-49.9 27 50-59.9 124 60-69.9 120 70-79.9 115 80-89.9 87 90-99.9

Mode (contd.)

Class interval Frequency 40-49.9 27 50-59.9 124 60-69.9 120 70-79.9 115 80-89.9 87 90-99.9 25 100-109.9
Class interval
Frequency
40-49.9
27
50-59.9
124
60-69.9
120
70-79.9
115
80-89.9
87
90-99.9
25
100-109.9
8
110-119.9
1
Total
507
Modal class, class interval with the highest frequency
Modal class, class
interval with the
highest frequency
Mode (contd.) Class interval Frequency 40-49.9 27 50-59.9 124 60-69.9 120 70-79.9 115 80-89.9 87 90-99.9

= +

− − + −
− + −

= 50 + 59.9 − 50

124 − 27
124 − 27

124 − 27 + 124 − 120 = 59.51

Applied Statistics and Computing Lab

13 13

Absolute deviations • For data consisting of ‘N’ observations distributed across ‘n’ distinct class intervals, ∑

Absolute deviations

For data consisting of ‘N’ observations distributed across ‘n’ distinct class intervals,

− ̅
− ̅

=

=

where, is the class mark

Applied Statistics and Computing Lab

14

Central moments • For grouped data consisting of ‘N’ observations distributed across ‘n’ distinct class intervals,

Central moments

For grouped data consisting of ‘N’ observations distributed across ‘n’ distinct class intervals,

= = where, is the class mark

( ̅)

= =

( ̅)

=

Coefficient of skewness and kurtosis can be calculated accordingly

Applied Statistics and Computing Lab

15

Conclusion • We can verify that the values obtained with the formulae for grouped data, are

Conclusion

We can verify that the values obtained with the formulae for grouped data, are very close to the values obtained by considering the data as ungrouped

In many situation, describing data using class intervals is more insightful

Therefore these formulae can be useful for quick hand calculation

In this age of extensive computational power, these measures can be calculated without dividing the data into class intervals

Yet, these formulae are important from theoretical point of view

Applied Statistics and Computing Lab

16

Thank you Applied Statistics and Computing Lab

Thank you

Applied Statistics and Computing Lab