You are on page 1of 58

Handling Data

Summarizing Data
Numerical Summary Measures

Diriba D. (MPH-Epidemiology) 
Tel: 0917793542
E-mail: debaba.tolosa@gmail.com

August, 2022

04/19/2023 1
Numerical Summary Measures

• Although frequency distributions serve useful purposes, they


don’t summarize data by indicating the average value and the
spread of the values.
• Single numbers which quantify the characteristics of a
distribution of values.
 Measures of central tendency (location)
 Measures of dispersion (variability)

04/19/2023 2
04/19/2023 3
04/19/2023 4
MCT…
m

Characteristics of a good MCT


• A MCT is good or satisfactory if it possesses the following
characteristics.
1. It should be based on all the observations
2. It should not be affected by the extreme values
3. It should have a definite value
4. It should not be subjected to complicated and
tedious calculations
5. It should be capable of further algebraic treatment
6. It should be stable with regard to sampling
• The most common measures of central tendency include:
– Mean
– Median
– Mode
04/19/2023 5
04/19/2023 6
04/19/2023 7
04/19/2023 8
04/19/2023 9
04/19/2023 10
04/19/2023 11
Properties of the Arithmetic Mean

 For a given set of data there is one and only one


arithmetic mean (uniqueness).
 Easy to calculate and understand (simple).
 Influenced by each and every value in a data set and
hence affected by extreme values.

04/19/2023 12
Pros and cons of mean

04/19/2023 13
Exercise 1
Calculate the mean of the following data
1 5 4 3 2 6 7 8 8 10

04/19/2023 14
2. Median

• The median is the value which divides the data set into
two equal parts.
• If the number of values is odd, the median will be the
middle value when all values are arranged in order of
magnitude.
• When the number of observations is even, there is no
single middle value but two middle observations.
• In this case, the median is the mean of these two middle
observations, when all observations have been arranged
in the order of their magnitude.
04/19/2023 15
04/19/2023 16
04/19/2023 17
Solution

04/19/2023 18
04/19/2023 19
Properties of the median

• There is only one median for a given set of data


(uniqueness)
•The median is easy to calculate
•Median is a positional average and hence it is
insensitive to very large or very small values

04/19/2023 20
Pros and Cons of median

04/19/2023 21
Exercise 2
What will be the median of the following data ?
A) 7 ,2, 5,9,10,12,16
B) 8,7,2,5,12,16

04/19/2023 22
4. Mode

• Value that occurs most often


• It is not influenced by extreme values.
• It is possible to have more than one mode or
no mode.
• Used for either numerical or categorical data
• It is not a good summary of the majority of the data
• May be used for describing qualitative data
Example: the type of diagnosis most frequently
occurring in the GRH-The modal diagnosis
04/19/2023 23
Example 1
• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
Mode is ___________
Example 2
• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
There are ______modes = _________
This distribution is said to be “bi-modal”
Example 3
• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
Mode of these data is/are________
Some distributions have more than one mode:
 Unimodal: A distribution with one mode
 Bimodal: A distribution with two modes
 Trimodal: A distribution with 3 modes

04/19/2023 24
Pros and Cons of Mode

04/19/2023 25
Which MCT is best
with a given set of data?

• The mean can be used for discrete and continuous data


• The median is appropriate for discrete and continuous
data as well, but can also be used for ordinal data
• The mode can be used for all types of data, but may be
especially useful for nominal and ordinal measurements

04/19/2023 26
04/19/2023 27
Measures of Dispersion

04/19/2023 28
04/19/2023 29
Range (R)

• The difference between the largest and smallest


observations in a sample.
Range = Maximum value – Minimum value
Example: Data values: 5, 9, 12, 16, 23, 34, 37, 42
Range = 42-5 = 37
• Data set with higher range exhibit more variability

04/19/2023 30
Example

04/19/2023 31
Properties of range

• It is the simplest crude measure and can be easily


understood
• It takes into account only two values which causes it
to be a poor measure of dispersion
• Very sensitive to extreme outliers

04/19/2023 32
Percentiles and quartiles

• Given a set of n observations x1, x2,….xn the pth percentile


is the value of x such that p percent or less of the
observations are less than p and (100-p) percent or less of the
observations are greater than p

• Just as the median is the value above and below which lie
half the set of data, one can define measures (above or
below) which lie other fractional parts of the data.

• The median divides the data into two equal parts


• Other location parameters
04/19/2023 33
Percentiles and quartiles
• Subscript on p serves to distinguish one percentile from
another
25th percentile =first quartile
50th percentile =median
75th percentile =third quartile
Q1= 0.25(n+1)th ordered observation
Q2 = 0.5(n+1)th ordered observation
Q3 = 0.75(n+1)th ordered observation

04/19/2023 34
Quartiles

04/19/2023 35
04/19/2023 36
04/19/2023 37
Inter-Quartile Range (IQR)

• The range provides a crude measure of the


variability in a set of data(computed from only two
values)
• IQR indicates the spread of the middle 50% of the
observations in a data set
IQR = Q3- Q1
• A large IQR indicates a large amount of variability
among the middle 50% of the observations and a
small IQR indicates a small amount of variability
04/19/2023 38
04/19/2023 39
Example:
• Suppose the first and third quartile for weights of girls
12 months of age are 8.8 Kg and 10.2 Kg, respectively.
IQR = 10.2 Kg – 8.8 Kg = 1.4
i.e., 50% of the infant girls weigh between 8.8 and 10.2
Kg.

04/19/2023 40
04/19/2023 41
Properties of IQR:

• It is a simple measure
• It encloses the central 50% of the observations
• Since it excludes the lowest and highest 25%
values, it eliminates the outlier problem

04/19/2023 42
• The variance is the average of the squares of the
deviations of individual values taken from the
mean of that set.

04/19/2023 43
04/19/2023 44
• Variance is used to measure the dispersion of
values relative to the mean.
• When values are close to their mean (narrow
range) the dispersion is less than when there is
scattering over a wide range.

04/19/2023 45
04/19/2023 46
• The main disadvantage of variance is that its unit is the
square of the unit of the original measurement values
• A variance of a distribution of weight is not expressed in
Kg, but in Kg²
Weight = 36.5 Kg, S² = 257 Kg²
• The variance gives more weight to the extreme values as
compared to those which are near to mean value,
because the difference is squared in variance.

04/19/2023 47
Standard deviation (σ, s)
• Standard deviation, is based on deviations from the
mean of the data.
• It is the square root of the variance.
• This produces a measure having the same
scale as that of the individual values.
• Most commonly used
• Shows variation about the mean

04/19/2023 48
04/19/2023 49
04/19/2023 50
Example 2

04/19/2023 51
04/19/2023 52
Properties of SD

• The SD has the advantage of being expressed in the


same units of measurement as the mean

• SD is considered to be the best measure of dispersion


and is used widely because of the properties of the
theoretical normal curve.

• However, if the units of measurements of variables


of two data sets is not the same, then their variability
can’t be compared by comparing the values of SD.

04/19/2023 53
Coefficient of variation (CV)

• When two data sets have different units of


measurements, or their means differ sufficiently in
size, the CV should be used as a measure of dispersion.

• Data with less coefficient of variation is considered


more consistent.
• It expresses the standard deviation as a percentage of the
mean.

04/19/2023 54
04/19/2023 55
04/19/2023 56
Example

04/19/2023 57
“Give a man a fish, and you
feed him for a day. Teach a
man to fish, and you feed
him for a lifetime.”  Thank
you
04/19/2023 58

You might also like