You are on page 1of 252

Biostatistics

Data Presentation
• Dr. Dharmendra Dubey
• Assistant professor
• Symbiosis Institute of health sciences
Statistics

• Statistics dealing with the methods of data collection, their


compilation, tabulation and analysis to provide meaningful and valid
interpretations. In other words, it deals with the scientific treatment
of data derived from individuals.
Statistics

• Statistics is the science which deals with collection, classification and


tabulation of numerical facts as the basis for explanation, description
and comparison of phenomenon. – By Lovitt
Biostatistics
• Statistical method applied in the field of medicine, biology and
public health are termed as ‘biostatistics’, also called ‘biometry’,
which means the ‘measurement of life’.

• Biostatistics is known by names-


• - Medical statistics,
• - Health statistics and
• - Vital statistics
Medical statistics

• Statistics related to clinical and laboratory parameters, their relationship,


prediction after treatment, clinical trials, bioassays, diagnostic analysis,
quality control, etc. may be included in ‘medical statistics’.


Health statistics

• Statistics related to the health of the people in a community;


epidemiology of disease; association of socioeconomic and
demographic variables, personality and behavioral variables,
environmental factors and nutrition with the occurrence of various
diseases; control and prevention of disease, promotion of health, etc.,
are included in ‘health statistics.
Vital statistics

• Statistics related to the vital events in life – birth , illness, death, marriage,
divorce, adoption, etc. – their rates of occurrence, causes of increase or
decrease in the vital rates, expectation of life at birth and at a given age, etc.
are included in vital statistics.
Examples in Biostatistics
• Smoking causes cancer: “But not everyone who smokes gets cancer, and some
people who never smoke get cancer.” The cause and effect chain has an element of
uncertainty in it: studies investigating the effects of smoking need to appropriately
address this uncertainty.
Examples in Biostatistics

• In designing health facilities, planners need to take into account demographic shifts,
changes in “standard of care,” economics, and new technology. The uncertainties
associated with these variables need to be considered.
Tabulation:

Tabulation is the process of summarizing classified or grouped data in the


form of a table so that it is easily understood and an investigator is quickly able to locate
the desired information.
Advantages of Tabulation:

Statistical data arranged in a tabular form serve following objectives:

1. It simplifies complex data and the data presented are easily understood.
2. It facilitates comparison of related facts.
3. It facilitates computation of various statistical measures like averages,
dispersion, correlation etc.
4. Tabulated data are good for references and they make it easier to present the
information in the form of graphs and diagrams.
Preparing a Table:
An ideal table should consist of the following main parts:

1. Table number
2. Title of the table
3. Captions or column headings
4. Stubs or row designation
5. Body of the table
6. Footnotes
7. Sources of data
A model structure of a table is given below:
Type of Tables:
Tables may be classified as follows:

1. Simple or one-way table

2. Two way table

3. Manifold table
Simple or one-way Table:

For example:

The blank table given below may be used to show the number of adults in
different occupations in a locality.
Two-way Table:

Example:
The caption may be further divided in respect of ‘ sex’
Manifold Table:

Example:
Table shown below shows three characteristics namely, occupation, sex and
marital status
FREQUENCY DISTRIBUTION

Frequency distribution is a series when a number of observations with similar or


closely related values are put in separate bunches or groups, each group being in order
of magnitude in a series.
A frequency distribution is constructed for three main reasons:

1. To facilitate the analysis of data.


2. To estimate frequencies of the unknown population distribution from the
distribution of sample data and
3. To facilitate the computation of various statistical measures
Type of frequency distribution:

There are two types of frequency distribution:


1. Discrete (or) Ungrouped frequency distribution
2. Continuous frequency distribution
Type of frequency distribution:

Discrete (or) Ungrouped frequency distribution Continuous frequency distribution


Nature of class:

1. Class limits
2. Class Interval
3. Width or size of the class interval
4. Range
5. Mid-value or mid-point
6. Frequency
7. Number of class intervals
8. Size of the class interval
Class limits:

The class limits are the lowest and the highest values that can be
included in the class.

For example, take the class 30-40. The lowest value of the class is 30 and
highest class is 40.
Class Interval

The class interval may be defined as the size of each grouping of data.

For example, 50-75, 75-100, 100-125… are class intervals.


Width or size of the class interval

The difference between the lower and upper class limits is


called Width or size of class interval and is denoted by ‘ C’.

Range:

The difference between largest and smallest value of the observation is called the
range and is denoted by ‘ R’ i.e
R = Largest value – Smallest value
R=L-S
Mid-value or mid-point:

The central point of a class interval is called the mid value or mid-point. It is found
out by adding the upper and lower limits of a class and dividing the sum by 2

For example, if the class interval is 20-30 then the mid-value is


Frequency:

Number of observations falling within a particular class interval is called frequency


of that class.

Let us consider the frequency distribution of weights if persons working in a


company.
Number of class intervals

The number of class intervals can vary from 5 to 15.


The number of classes can be determined by the formula,
K = 1 + 3. 322 log𝟏𝟎 𝑵
Where
N = Total number of observations
log = logarithm of the number
K = Number of class intervals.
Number of class intervals

Thus if the number of observation is 10, then the number of class intervals is

K = 1 + 3. 322 log 10 = 4.322 ≈ 4


If 100 observations are being studied, the number of class interval is

K = 1 + 3. 322 log 100 = 7.644 ≈ 8


and so on.
Types of class intervals:

There are three methods of classifying the data according to class intervals
namely,

1. Exclusive method
2. Inclusive method
3. Open-end classes
Exclusive method:

When the class intervals are so fixed that the upper limit of one class is the
lower limit of the next class; it is known as the exclusive method of classification.
The following data are classified on this basis.
Inclusive method:

In this method, the overlapping of the class intervals is avoided. Both the lower and
upper limits are included in the class interval.

This type of classification may be used for a grouped frequency


distribution for discrete variable like members in a family, number of workers in a factory
etc., where the variable may take only integral values.

It cannot be used with fractional values like age, height, weight etc.
Inclusive method:

This method may be illustrated as follows:


Open end classes:

A class limit is missing either at the lower end of the first class interval or at
the upper end of the last class interval or both are not specified.
Types of class intervals:

Let us consider the weights in kg of 50 college students.

Construct a frequency distribution table using suitable class interval.


Solution:

Given that,
Number of college students, N= 50
Highest value, H= 64
Lowest

value, L= 32
Range, R = H-L
=64-32=32
Solution:

Thus the number of class interval is 7 and size of each class is 5. The
required size of each class is 5.

The required frequency distribution is prepared using tally marks as given below:
Percentage Frequency:

Any type of frequency from a frequency distribution is called percentage


frequency if expressed in percentage with respect to total frequency, that is

𝐹 𝑟 𝑒 𝑞 𝑢 𝑒 𝑛 𝑐 𝑦 𝑜 𝑓 𝑎 𝑐 𝑒 𝑟 𝑡 𝑎 𝑖 𝑛 𝑐𝑙𝑎𝑠𝑠
Percentage Frequency of a class= ×
𝑇𝑜𝑡𝑎𝑙
100 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Percentage Frequency:

An example is given below to construct a percentage frequency table.

Percentage Frequency of a class

=
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑒𝑟𝑡𝑎𝑖𝑛 𝑐𝑙𝑎𝑠𝑠 × 100

𝑇𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Relative Frequency:

Relative frequency refers to the ratio of the number of frequency of a certain class and
total number of frequency existing in a frequency distribution.

Relative Frequency =
𝑭𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚 𝒐𝒇 𝒂 𝒄𝒆𝒓𝒕𝒂𝒊𝒏 𝒄𝒍𝒂𝒔𝒔
𝑻𝒐𝒕𝒂𝒍 𝑭𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚
Relative Frequency:

For example, the frequency density of a frequency distribution is given below:


Class Frequency Relative Relative Frequency
Frequency
= 𝑭𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚 𝒐𝒇 𝒂 𝒄𝒆𝒓𝒕𝒂𝒊𝒏 𝒄𝒍𝒂𝒔𝒔
𝑻𝒐𝒕𝒂𝒍 𝑭𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚

30-35 4 4
44
= 0.09
35-40 10 0.23
40-45 20 0.45
45-50 8 0.18
50-55 2 0.05
Cumulative Frequency:

The total frequency of all values less than the upper class boundary of a
given class interval is called the cumulative frequency upto and including that class
interval.
Cumulative Frequency:
Diagrams:

A diagram is a visual form for presentation of statistical data, highlighting their basic
facts and relationship.
Significance of Diagrams and Graphs:
Diagrams and graphs are extremely useful because of the following reasons:
1. They are attractive and impressive.
2. They make data simple and intelligible.
3. They make comparison possible
4. They save time and labour.
5. They have universal utility.
6. They give more information.
7. They have a great memorizing effect.
Types of diagrams:

They may be divided under the following heads:

1. One-dimensional diagrams
2. Two-dimensional diagrams
3. Three-dimensional diagrams
4. Pictograms and Cartograms
One-dimensional diagrams:

These diagrams are in the form of bar or line charts and can be classified
as:

1. Line Diagram
2. Simple Diagram
3. Multiple Bar Diagram
4. Sub-divided Bar Diagram
5. Percentage Bar Diagram
Line graphs or line diagrams
• Data having some order such as the year-wise immunization
coverage or age-wise incidence of a disease can be represented by
a line diagram.
Bar diagram

• Qualitative or categorical data can also be represented by another diagram called


the bar diagram.

• In this diagram, we show the category of the variable on the X-axis and the
frequencies on the Y-axis on a graph paper.
Bar diagram
• A bar for each category of the variable is erected and the height of the bar is
proportional to the frequency of that category.

• Each bar should have an equal width.

• Since the data is of a qualitative nature (discrete), bars should not be next to each
other and there should be an equal gap between two successive bars.
Simple Bar Diagram:
Adjacent bar diagram
• Bar diagrams can be extended to compare two or more data sets (of a qualitative
nature) with regard to the same variable. The bars of each category of the variable
for the different data sets are drawn adjacent to each other.

• The principles of bar width and the gap between the bars will remain similar to
those in the simple bar diagram.
Multiple Bar Diagram:
Component bar diagram
• The adjacent bar diagram does not convey anything about the
similarity of the relative proportions of each type in the two
centers being compared.

• To depict this aspect, we can present the data in another form of


the bar diagram called the component bar diagram. As the name
indicates, all the components of data form the bars being
compared.
Sub-divided Bar Diagram:
Percentage bar diagram:
Percentage bar diagram:
Pie chart

• The pie diagram is generally used to represent qualitative or categorical data.

• Before beginning to represent the data by a pie diagram, we should have the frequencies of
different categories at hand.

• For each category find the proportionate degrees of the total of 360 in the circle.

• For example, if a category has a 25% frequency of the total, it should be allotted 25% of 360 or
90. So a sector with 90 to represents this category of the variable.
Pie Diagram

Draw a Pie diagram for the following data of production of sugar in quintals of
various countries.
Pie Diagram
Graphs:

A graph is a visual form of presentation of statistical data. A graph is more


attractive than a table of figure.
Some important types of graphs:
1. Histogram
2. Frequency Polygon
3.Frequency Curve

4. Ogive
Histogram
• The most popular of all diagrams is the histogram, which is used to depict the
frequency distribution of a quantitative variable.

• The class intervals are shown on the horizontal axis (X-axis) and corresponding
frequencies in the form of vertical bars on the vertical axis (Y-axis).

• The width of each bar need not be the same as it depends on the width of the
class intervals.
Histogram:
Example: Draw a histogram for the following data.
Solution:
Histogram:
Example: Draw a histogram for the following data.
Example:

For the following data, draw the histogram.


Histogram:
Solution:
For drawing a histogram, the frequency distribution should be continuous. If it is not
continuous, then first make it continuous as follows.
Example:
For the following data, draw the histogram.
Solution:
When the class intervals are unequal, a
correction for unequal class intervals must be
made.

The frequencies are adjusted as follows:


The frequency of the class 30-50 shall be divided by two
since the class interval is in double.
Similarly the class interval 50- 80 can be
divided by 3.
Histogram
Frequency Polygon
• To facilitate comparison, another diagram called the frequency
polygon is used.

• This diagram is made by simply joining the mid-points of the


tops of all the bars with a straight line, and then removing the
bars.

• It is fairly easy to draw, and useful in comparing two or more


distribution, as each can be represented on the same graph.
Frequency Polygon:
Example: Draw a frequency polygon for the following data.
Solution:
Frequency Curve:

If the middle point of the upper boundaries of the rectangles of a histogram is


corrected by a smooth freehand curve, then that diagram is called frequency curve. The
curve should begin and end at the base line.
Example:
Draw a frequency curve for the following data.
Solution:
Ogives:
For a set of observations, we know how to construct a frequency distribution.
In some cases we may require the number of observations less than
a given value or more than a given value.

There are two methods of constructing ogive namely:

1. The ‘ less than ogive’ method


2. The ‘more than ogive’ method.
Example
Draw the Ogives for the following data.
Solution:
Tabular and Graphical Procedures
Data
Data
Qualitative
Qualitative Data
Data Quantitative
Quantitative Data
Data

Tabular
Tabular Graphical
Graphical Tabular
Tabular Graphical
Graphical
Methods
Methods Methods
Methods Methods
Methods Methods
Methods

• Frequency • Bar Graph • Frequency Dist. • Dot Plot


Distribution • Pie Chart • Rel. Freq. Dist. • Histogram
• Relative Freq. • % Freq. Dist. • Ogive
Distribution • Cum. Freq. Dist. • Stem-and-
• Percent Freq. • Cum. Rel. Freq. Leaf Display
Distribution Distribution • Scatter
• Crosstabulation • Cum. % Freq. Diagram
Distribution
• Crosstabulation
Measures of Dispersion

• Dr. Dharmendra Dubey


• Assistant Professor
• Symbiosis Institute of Health Sciences, Pune
Measures of Dispersion

• The distance of different values from the central value is


called dispersion.
Definition
• Measures of dispersion are descriptive statistics that describe
how similar a set of scores are to each other
• The more similar the scores are to each other, the lower the
measure of dispersion will be

• The less similar the scores are to each other, the higher the
measure of dispersion will be

• In general, the more spread out a distribution is, the larger


the measure of dispersion will be
Classification of
measures of
dispersion

The absolute measures The relative measures


of dispersion of dispersion

Range Coefficient of Range


Mean deviation Coefficient of mean deviation
Standard deviation Coefficient of Variation
Quartile deviation Coefficient of quartile deviation
The four important absolute measures of dispersion are as follows:

i. Range
ii. Mean or average deviation
iii.Standard deviation
iv. Quartile deviation
When the data is in different units, in such a situation we may use the relative
dispersion. A relative dispersion is independent of original units. Generally, relative
measures of dispersion are expressed in terms of ratio, percentage etc.

The relative measures of dispersion are as follows:


1. Coefficient of range
2. Coefficient of mean deviation
3. Coefficient of variation
4. Coefficient of quartile deviation
The range of a set of observation is the
difference between two extreme values, i.e the
difference between the maximum and
minimum values.
Therefore, it indicates the limits within all observations
fall.
In the form of an equation:
Range = Highest value – Lowest value
What is the RANGE?
How do we find it?
• The RANGE is the difference between the lowest and
highest values.
63 73 84 86 88 95 97 97

97 34 is the RANGE
-63 or spread
34 of this set of data
Let us consider a set of observations 𝑥1, 𝑥2,𝑥 3, ……………… , 𝑥 𝑛 and 𝑋𝐻 is
maximum and 𝑋𝐿 is minimum.
Then Range = 𝑋𝐻 − 𝑋𝐿.

Example:
Find out the range of the set of observations,
-7, -2, -4, 0, 8.

Solution:
Here, maximum value, 𝑋𝐻 = 8 and
minimum value, 𝑋𝐿 = −7
Range For Grouped data:
In this case, the range is the difference between the upper boundary of the highest
class and the lower boundary of the lowest class.
Then Range = 𝑋𝑈 − 𝑋𝐿
Where,
𝑋𝑈 = The upper boundary of the highest class.
𝑋𝐿= The lowest boundary of the highest class.
Example: determine the range from the following frequency distribution.
Salary (TK.) 1700-1800 1800-1900 1900-2000 2000-2100 2100-2200
No. of workers 420 460 500 300 200
Solution:
From the given frequency distribution, We
have,
The upper boundary of the highest class, 𝑋𝑈
= 𝑇𝐾. 2200
And the lowest boundary of the highest
class, 𝑋𝐿 = 𝑇𝐾. 1700
Then Range = 𝑋𝑈 − 𝑋𝐿 = 𝑇𝐾. 2200 −
𝑇𝐾. 1700 = 𝑇𝐾. 500
Advantages of Range:

1. It is the simplest measure of dispersion.


2. It is easy to understand and calculate the range.
3. The important merit of range is that it gives us a quick idea of the variability of a set
of data.
4. It does not depend on the measures of central tendency.
Disadvantages of Range:

1. It is influenced by the extreme values.


2. It cannot be computed for open end distribution.
3. It is no suitable for further mathematical treatment.
Use of Range:

1. It is time saving and widely used in industrial quality control, weather forecast.
2. Variations in stock exchange can be studies by range.
Measures of Dispersion
Mean deviation
Mean deviation or Average
deviation:
Definition of Mean deviation:
Mean deviation is the mean of absolute deviations of the items from an average
like mean, median or mode. Normally, we consider the arithmetic mean as the
average.
1. Mean deviation for ungrouped data:
If 𝑥 , 𝑥 ,𝑥 ,……………… , 𝑥 be a set of n observations or values, then the mean
1 2 3 𝑛
deviation is expressed and defined as:
𝑥 -
i. Mean deviation about arithmetic mean, M.D (𝑋) = 𝑖= 𝑖 ; 𝑋 = 𝐴𝑟𝑖𝑡h𝑚𝑒𝑡𝑖𝑐 𝑚𝑒𝑎𝑛
𝑛1 �𝑥
𝑥𝑖 - �
ii. Mean deviation about median, M.D (𝑋) = 𝑖=
; 𝑀𝑒 = 𝑀𝑒𝑑𝑖𝑎𝑛
𝑛1 𝑛𝑒
𝑀
𝑥𝑖 -
iii.Mean deviation about mode, M.D (𝑋) = 𝑖=
; 𝑀𝑜 = 𝑀𝑜𝑑𝑒
𝑛1 𝑛𝑀 𝑜
The Mean Deviation (cont.)
Example:
The number of patients seen in the emergency room at Ibrahim Memorial Hospital for a
sample of 5 days last year were: 103, 97, 101, 106 and 103.
Determine the mean deviation and interpret.
Example:
Calculate mean deviation
from the following
No. of Persons 6 8 10
data
Income (TK.) 0-10 10-20 20-30 30-40 40-50
12 7
50-60 60-70
4 3
Solu
tion:
Table for calculation of Mean Deviation from mean
Income (TK.) Class mid- No. of persons 𝒇𝒙 𝒙−𝒙 𝒇𝒙−𝒙
point (x) (f)
0-10 5 6 30 26 156
10-20 15 8 120 16 128
20-30 25 10 250 6 60
30-40 35 12 420 4 48
40-50 45 7 315 14 98
50-60 55 4 220 24 96
60-70 65 3 195 34 102
Total 𝑓𝑖= 50 𝑓𝑖𝑥𝑖 = 1550 𝑓𝑖 𝑥𝑖 − 𝑥 = 688
Advantages of Mean Deviation:

1. It is easy to calculate and understand.


2. It is based on all the observation.
3. It is useful measure of dispersion.
4. It is not greatly affected by extreme values.
Disadvantages of Mean Deviation:

1. It is not suitable for further mathematical treatment.

2. Algebraic positive and negative signs are ignored.

Uses of mean deviation:


1. It is used in certain economic and anthropological studies.
Example
Calculate mean deviation for grouped data

Systolic BP (mmHg) Frequency (f)

90 - 100 3
100 – 110 5
110 – 120 7
120 – 130 10
130 – 140 15
140 – 150 11
150 – 160 9
160 – 170 6
170 – 180 2
Total 68
Measures of Dispersion
Standard deviation
Standard Deviation:
The standard deviation is the most important measure of dispersion.
Definition:
The standard deviation is the positive square root of the mean of the squared deviation
from their mean of a set of observation. It can be written as

𝑠 𝑢 𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑓 𝑟 𝑜 𝑚
Standard Deviation, 𝜎 = 𝑚 𝑒 𝑎 𝑛 𝑇𝑜𝑡𝑎𝑙 𝑛𝑜. 𝑜𝑓
𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
Standard Deviation s
• Most commonly used measure of variation
• Shows variation about the mean
• Is the square root of the variance
• Has the same units as the original data

• Sample standard deviation:


For A Population:
Standard Deviation σ
• Most commonly used measure of variation
• Shows variation about the mean
• Is the square root of the population variance
• Has the same units as the original data

• Population standard deviation:


Sample Standard Deviation:
Calculation Example
Sample Data (Xi) : 10 12 14 15 17 18 18 24

n=8 Mean = X = 16

A measure of the “average”


scatter around the mean
Comparing Standard Deviations

Smaller standard deviation

Larger standard deviation


Example:
Find the standard deviation
for the following data. 75,
73, 70, 77, 72, 75, 76, 72,
74, 76
Solu
tion:
Table for calculation of standard deviation
Values,(X) (𝑥 − 𝑥) (𝑥 − 𝑥)2
75 1 1
73 -1 1
70 -4 16
77 3 9
72 -2 4
75 1 1
76 2 4
72 -2 4
74 0 0
76 2 4
𝑥𝑖 = 740 (𝑥 − 𝑥)2 = 592
Exa
mple
Calculate standard deviation from the following
:data
Income (TK.) 0-10 10-20 20-30 30-40 40-50 50-60 60-70
No. of Persons 6 8 10 12 7 4 3
Advantages of Standard
deviation:

1. It is rigidly defined.
2. It is less affected by sampling fluctuations.
3. It is useful for calculating the skewness, kurtosis,
coefficient of correlation, coefficient of variation and so on.
4. It measure the consistency of data.
Disadvantages of standard
deviation
1. It is not so easy to compute.
2. It is affected by extreme values.

Use of standard deviation:


3. It is useful for calculating the skewness, kurtosis,
coefficient of correlation, coefficient of variation
and so on.
4. It measure the consistency of data.
Example
Calculate standard deviation for grouped data

Systolic BP (mmHg) Frequency (f)

90 - 100 3
100 – 110 5
110 – 120 7
120 – 130 10
130 – 140 15
140 – 150 11
150 – 160 9
160 – 170 6
170 – 180 2
Total 68
Measures of Dispersion
Variance
The Variance
• Average (approximately) of squared deviations of values from the mean

• Sample variance:

= arithmetic mean
Where n = sample size
Xi = ith value of the variable X
Variance: The variance of a set of observations is the average of the
squares of the deviations of the observations from their mean. In
symbols, the variance of the n observations x1, x2,…xn is

Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is

Standard Deviation: Square root of the variance. The standard deviation


of the above example is 2.
What Does the Variance Formula Mean?

• One of the definitions of the mean was that it always made the sum of the scores minus the mean equal
to 0

• Thus, the average of the deviates must be 0 since the sum of the deviates must equal 0

• To avoid this problem, statisticians square the deviate score prior to averaging them

• Squaring the deviate score makes all the squared scores positive
What Does the Variance Formula Mean?

• Variance is the mean of the squared deviation scores

• The larger the variance is, the more the scores deviate, on average, away from the mean

• The smaller the variance is, the less the scores deviate, on average, from the mean

131
Relative Measures of Dispersion
Exa
mple
Calculate the coefficient of range from the following frequency
:distribution,
Age (Year) 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Frequency 4 7 15 18 16 12 8
Coefficient of Variation (CV):
Coefficient of variation is the most commonly used measure of relative measures of
dispersion. It is 100 times of a ratio of the standard deviation to the arithmetic mean. It is
denoted by C.V. and written as:

𝜎
𝐶. 𝑉 = 𝑥 × 100; 𝑥 ≠ 0
The Coefficient of Variation

• 100 times the co-efficient of dispersion based upon standard deviation is called co-
efficient of variation (C.V.).

• C.V. is the percentage variation in the mean, standard deviation being considered as
the total variation in the mean.
• For comparing the variability of two series, we calculate the C.V. for each series.
• The series having greater C.V. is said to be more variable than the other and the series
having lesser C.V. is said to be more consistent (or homogenous) than the other.
Example

 Find the coefficient of variation of the following sample set of numbers.


{1, 5, 6, 8, 10, 40, 65, 88}.
Solution:

Sample mean = (1 + 5 + 6 + 8 + 10 + 40 + 65 + 88)/8 = 223/8 = 27.875

S.D.
=(1−27.875)2+(5−27.875)2+(6−27.875)2+(8−27.875)2+(10−27.875)2+(40−27.875)
2+(65−27.875)2+(88−27.875)2=7578.875

Standard deviation:
σ= √1082.696 = 32.904

Coefficient of variation = 32.901/27.875=1.180


Example

•  A company has two sections with 40 and 65 employees respectively. Their


average weekly wages are $450 and $350. The standard deviation are 7 and
9.
• (i) Which section has a larger wage bill?.
• (ii) Which section has larger variability in wages? 
Example

• Solution: 
(i) Wage bill for section A = 40 x 450 = 18000 
Wage bill for section B = 65 x 350 = 22750 
Section B is larger in wage bill. 

• (ii) Coefficient of variance for Section A = 7/450 x 100 =1.56 % 


Coefficient of variance for Section B = 9/350 x 100 = 2.57% 

Section B is more consistent so there is greater variability in the wages of section A. 
Sample statistics versus population
parameters
Measure Population Sample
Parameter Statistic
Mean

Variance

Standard
Deviation
Summary Characteristics

 The more the data are spread out, the greater the range,
variance, and standard deviation.

 The less the data are spread out, the smaller the range,
variance, and standard deviation.

 If the values are all the same (no variation), all these
measures will be zero.

 None of these measures are ever negative.


Example:
Compute the coefficient of variation from the following
data:
Monthly Income 2501-5000 5001-7500 7501-10000 10001-12500 12501-15000
No. of Families 65 130 215 100 70
Example:
The run-scores of two cricketers for 10 innings are given
below:
Cricketers-A 114 45 0 31 75 102 198 8 0 7

Cricketers-B 15 25 18 30 11 4 23 21 31 22

Who of the two is a more consistent batsman?


Solution:
In order to find out who batsman is more consistent, we have to calculate the coefficient
of variation for each batsman.
Table for calculation of C.V.
Cricketer-A Cricketer-B
Score (x) 𝒙𝟐 Score (y) 𝒚𝟐
114 12996 15 225
45 2025 25 625
0 0 18 324
31 961 30 900
75 5625 11 121
102 10404 4 16
198 39204 23 529
8 64 21 441
0 0 31 961
7 49 22 484

𝑥 = 580 𝑥 2 = 71328 𝑦 = 200 𝑦 2 = 4626


We know, Coefficient of Variation, 𝐶. 𝑉 = 𝑥 𝜎
× 100
For Cricketer-A For Cricketer-B
Arithmetic Mean, Arithmetic Mean,
𝒙 = 𝒙 = 𝟓𝟖𝟎 = 𝟓𝟖 𝒚= =
𝒚 𝟓𝟖𝟎
= 𝟐𝟎
𝒏 𝟏𝟎 𝒏 𝟏𝟎

Standard Deviation, Standard Deviation,

𝝈= 𝒙𝟐
−(
𝒙 𝟐
) 𝝈= 𝒚𝟐
−(
𝒚 𝟐
)
𝒏 𝒏 𝒏 𝒏

= =
𝟏𝟎 𝟏𝟎
𝟕𝟏𝟑𝟐𝟖 𝟓𝟖𝟎 𝟐 𝟒𝟔𝟐𝟔 𝟐𝟎𝟎 𝟐
− ( ) −( )
= 𝟔𝟏. 𝟑𝟗 = 𝟕. 𝟗𝟏
𝟏𝟎 𝟏𝟎

Coefficient of Variation, Coefficient of Variation,


𝝈 𝟔𝟏. 𝟑𝟗 𝝈 𝟕. 𝟗𝟏
𝑪. 𝑽 = × 𝟏𝟎𝟎 = × 𝟏𝟎𝟎 𝑪. 𝑽 = × 𝟏𝟎𝟎 = × 𝟏𝟎𝟎
𝒙 𝟓𝟖 𝒚 𝟐𝟎
= 𝟏𝟎𝟓. 𝟖𝟒% = 𝟑𝟗. 𝟓𝟔%

Comment:
From the above result, we see that, C.V. (A) = 105.84% and C.V. (B) 39.56%
Since, C.V. (A)> 𝐂. 𝐕. (𝐁). Therefore, the cricketer-B is a more consistent.
Quartile Measures

• Quartiles split the ranked data into 4 segments with an


equal number of values per segment

25% 25% 25% 25%

Q1 Q2 Q3
 The first quartile, Q1, is the value for which 25% of the observations are smaller and 75%
are larger

 Q2 is the same as the median (50% of the observations are smaller and 50% are larger)

 Only 25% of the observations are greater than the third quartile
Quartile Measures: Locating Quartiles

Find a quartile by determining the value in the


appropriate position in the ranked data, where

First quartile position: Q1 = (n+1)/4 ranked value

Second quartile position: Q2 = (n+1)/2 ranked value

Third quartile position: Q3 = 3(n+1)/4 ranked value

where n is the number of observed values


Quartile Measures: Locating Quartiles

Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22

(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,

so Q1 = 12.5
Q1 and Q3 are measures of non-central location
Q2 = median, is a measure of central tendency
Example

Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22

(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
so Q1 = (12+13)/2 = 12.5

Q2 is in the (9+1)/2 = 5th position of the ranked data,


so Q2 = median = 16

Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,


so Q3 = (18+21)/2 = 19.5
Q1 and Q3 are measures of non-central location
Q2 = median, is a measure of central tendency
Quartile Measures: Calculation Rules

• When calculating the ranked position use the following


rules

• If the result is a whole number then it is the ranked position to


use

• If the result is a fractional half (e.g. 2.5, 7.5, 8.5, etc.) then
average the two corresponding data values.

• If the result is not a whole number or a fractional half then


round the result to the nearest integer to find the ranked
position.
Methods of Variability Measurement

Quartiles: Data can be divided into four regions that


cover the total range of observed values. Cut points
for these regions are known as quartiles.

In notations, quartiles of a data is the ((n+1)/4)qth


observation of the data, where q is the desired quartile
and n is the number of observations of data.
Methods of Variability Measurement

The first quartile (Q1) is the first 25% of the data. The
second quartile (Q2) is between the 25th and 50th
percentage points in the data. The upper bound of Q2 is
the median. The third quartile (Q3) is the 25% of the data
lying between the median and the 75% cut point in the
data.

Q1 is the median of the first half of the ordered


observations and Q3 is the median of the second half of
the ordered observations.
The Semi-Interquartile Range

• The semi-interquartile range (or SIR) is defined as the difference of


the first and third quartiles divided by two
• The first quartile is the 25th percentile
• The third quartile is the 75th percentile
• SIR = (Q3 - Q1) / 2
Example
Calculation of quartiles for grouped data

Systolic BP (mmHg) Frequency (f) Cumulative


frequency
90 - 100 3 3
100 – 110 5 8
110 – 120 7 15
120 – 130 10 25
130 – 140 15 40
140 – 150 11 51
150 – 160 9 60
160 – 170 6 66
170 – 180 2 68
Total 68
Solution

• 

• Where
• we calculate n/4 = 68/4 = 17
• Lower quartile class interval = 120 – 130mmHg
• Lq1 = 120
• CF = the cumulative frequency up to the lower quartile class = 15
• f = the frequency of lower quartile class = 10,
• W = the width of the lower quartile class = 10
• Lower quartile Q1 = 120 + [(10/10) X (17-15)] = 120 + 2 = 122mmHg
Solution

• 

• Where
• We calculate n/4 =3 X 68/4 = 3 X 17 = 51
• Upper quartile class interval = 140 – 150mmHg
• Lq3 = 140
• CF = the cumulative frequency up to the lower quartile class = 40
• f = the frequency of lower quartile class = 11,
• W = the width of the lower quartile class = 10
• Upper quartile Q3 = 140 + [(10/11) X (51-40)] = 140 + 10 = 150mmHg
Deciles and Percentiles

Deciles: If data is ordered and divided into 10


parts, then cut points are called Deciles

Percentiles: If data is ordered and divided into 100


parts, then cut points are called Percentiles. 25th
percentile is the Q1, 50th percentile is the Median (Q2)
and the 75th percentile of the data is Q3.
Example

• Question: Find the median, lower quartile,


upper quartile and inter-quartile range of the
following data set of scores: 19, 21, 23, 20, 23,
27, 25, 24, 31 ?
Measures of Dispersion
Interquartile range
Quartile Measures:
The Interquartile Range (IQR)
• The IQR is Q3 – Q1 and measures the spread in the middle 50% of the data

• The IQR is a measure of variability that is not influenced by outliers or extreme values

• Measures like Q1, Q3, and IQR that are not influenced by outliers are called resistant
measures
Calculating The Interquartile Range
Example:
X Median X
minimum Q1 (Q2) Q3 maximum
25% 25% 25% 25%

12 30 45 57 70

Interquartile range
= 57 – 30 = 27
The Boxplot or Box and Whisker Diagram

• The Boxplot: A Graphical display of the data.

Xsmallest -- Q1 -- Median -- Q3 -- Xlargest


Example:

25% of data 25% 25% 25% of data


of data of data

Xsmallest Q1 Median Q3 Xlargest


Shape of Boxplots

• If data are symmetric around the median then the box and central line are
centered between the endpoints

Xsmallest Q1 Median Q3 Xlargest

• A Boxplot can be shown in either a vertical or horizontal orientation


The Semi-Interquartile Range

• The semi-interquartile range (or SIR) is defined as the


difference of the first and third quartiles divided by
two
• The first quartile is the 25th percentile
• The third quartile is the 75th percentile
• SIR = (Q3 - Q1) / 2
Quartile Deviation (Or Semi-interquartile
Range):
The quartile deviation is another type of range obtained from the quartiles. It is obtained by
dividing the difference between upper quartile (𝑄3and lower quartile 𝑄1by 2.

Mathematically, the quartile deviation (Q.D) can be written as:


𝑈 𝑝 𝑝 𝑒 𝑟 𝑞 𝑢 𝑎 𝑟 𝑡 𝑖 𝑙 𝑒 - 𝐿 𝑜 𝑤 𝑒 𝑟 𝑞 𝑢 𝑎𝑟𝑡 𝑖𝑙𝑒 𝑄 3 -𝑄 1
𝑄. 𝐷 = =
2

2
The term (𝑄3 − 𝑄1) is known as the interquartile range and the quartile deviation 𝑄 3 -2𝑄 1
is also known as semi-interquartile range.
Example:
The Automobile Association checks the prices of gasoline before many holiday weekends.
Listed below are the self-service prices for a sample of 8 retail outlets during the May
2004 Memorial Day Weekend.
40, 22, 60, 30, 45, 66, 70, 55
Determine the quartile deviation, Interquartile range and semi-interquartile range.
Exa
mple
Find out the quartile deviation, interquartile range and semi-interquartile range from
:the following frequency distribution:
Class Less than 10 10-15 15-20 20-25 25-30 More than 30
Frequency 2 6 7 10 3 1
So, Quartile
Deviation,
𝑄 3-𝑄 1 23.375-14.375
Q.D = = = 4.5
2 2

Interquartile range,
IQR = (𝑄3 − 𝑄1)
= 23.375 − 14.375 = 9

Semi-interquartile Range = 𝑄 32- 𝑄 1 = 4.5


Measures of Central Tendency

Dr. Dharmendra Dubey


Assistant Professor
Symbiosis Institute of Health Sciences, Pune
Basic Statistics

Measures of Central Tendency


• Measures of central tendency are also usually called as the averages.

• They give us an idea about the concentration of the values in the


central part of the distribution.
The following are the five measures of central tendency that are in
common use

• Arithmetic Mean

• Median

• Mode

• Geometric Mean

• Harmonic Mean
On average,
I feel fine
It’s too
hot! It’s too
cold!
Measures of Central Tendency

Summarizing Data
Giveyou
Give youone
onescore
scoreor
or The Mean
measurethat
measure thatrepresents,
represents,ororisis
typicalof,
typical of,an
anentire
entiregroup
groupof
of The Median
scores
scores The Mode
Most scores tend to center toward
a point in the distribution.

frequency

score
Central Tendency
Frequency Tables & Graphs Measures of Central Tendency

33 73 Averaging
52 67 35 43 The Mean
Frequency Tabulating
Tables 35 35 39 84 47 41
52 Graphing
84 49
47 35 The Median
90
52 35 47
Graphs
43 41 56 84 35
69
35 77 39 The Mode
47 Measurement
52 65 scales92 41
49
47
The Mean
Methods of Center Measurement
Center measurement is a summary measure of the overall level of a dataset

Mean: Summing up all the observation and dividing by number of observations. Mean
of 20, 30, 40 is (20+30+40)/3 = 30.
Definition: For ungrouped data, the population mean is
the sum of all the population values divided by the total
number of population values. To compute the
population mean, use the following formula.

Sigma

Population Individual value


Individual value
Population
mean
mean

Population
Population
size
size
THE SAMPLE MEAN
Definition: For ungrouped data, the sample mean is
the sum of all the sample values divided by the
number of sample values. To compute the sample
mean, use the following formula.

Sigma

X-bar Individual value

Sample
Size
Measures of Central Tendency
as Inferential Statistics

Parameters Mean Median Mode Difference


Between
Parameter and
Statistics

Sampling Sampling Errors

Statistics Mean Median Mode


97
84
Lets find the 88
MEAN of Statistics 100
test score? 95
63
73
783 ÷ 9 +
86
97
The mean is 87 783
Example

Find the mean of the following numbers.


23, 25, 26, 29, 39, 42, 50
Merits of Arithmetic Mean

• It is rigidly defined

• It is easy to understand and easy to calculate

• It is based upon all the observations

• Of all the averages, arithmetic mean is affected least by fluctuations


of sampling.

• This property is sometimes described by saying that arithmetic mean


is a stable average.
Demerits of Arithmetic Mean
• It can not be determined by inspection nor it can be located graphically.

• Arithmetic mean can not be used if we are dealing with qualitative


characteristics which cannot be measured quantitively; such as,
intelligence, honesty, beauty, etc.

• Arithmetic mean cannot be obtained if a single observation is missing or


lost or is illegible unless we drop it out and compute the arithmetic mean of
the remaining values.

• Arithmetic mean is affected very much by extreme values. In case of


extreme items, arithmetic mean gives a distorted picture of the distribution
and the longer remains representative of the distribution.
“The Median”
The Median is the 50th percentile of a distribution
- The point where half of the observations fall below
and half of the observations fall above
In any distribution there will always be an equal number
of cases above and below the Median.

Oh my !!
Where is the
median?

Location
Median
Median: The middle value in an ordered sequence of observations. That is, to find the median we need to
order the data set and then find the middle value.

In case of an even number of observations the average of the two middle most values is the median.

For example,

to find the median of {9, 3, 6, 7, 5},

we first sort the data giving {3, 5, 6, 7, 9},

The position number is (n+1)/2, if n is odd,

then choose the middle value 6.


Median
Median: The middle value in an ordered sequence of observations. That is, to find the median we need to
order the data set and then find the middle value.
In case of an even number of observations the average of the two middle most values is the median.

For example,

If the number of observations is even, e.g., {9, 3, 6, 7, 5, 2},

we first sort the data giving {2, 3, 5, 6, 7, 9},

The position number is n/2 and (n/2) +1 if n is even.

Then the median is the average of the two middle values from the sorted sequence,

in this case, (5 + 6) / 2 = 5.5.


Merits of Median

• It is rigidly defined
• It is easily understood and is easy to calculate. In some cases it can be located
merely by inspection
• It is not at all affected by extreme values
• It can be calculated for distributions with open ended classes
Demerits of Median

• In case of even number of observations median can not be determined exactly. We merely estimate
it by taking the mean of two middle terms.

• It is not amenable to algebraic treatment.

• As compared with mean, it is affected much by fluctuations of sampling

• It is not based on all the observations, for example, the median of 10, 25, 50, 60 and 65 is 50. We
can replace the observations 10 and 25 by any two values which are smaller than 50 and the
observations 60 and 65 by any two values greater than 50 without affecting the value of median.

• This property is sometimes described by saying that median is insensitive.


Median for Grouped data
Median for Grouped data
• 

• Where,
• L = Lower limit of the class interval where the median occurs
• f = Frequency of the class where median occurs
• h = Width of the median class
• C.F. = Cumulative frequency of the class preceding the median class
Mean or Median
• The median is less sensitive to outliers (extreme scores) than the mean and thus a better measure
than the mean for highly skewed distributions, e.g. family income.

• For example
If we have given the observations 20, 30, 40, and 990.
Mean?
Median?
63 73 84 86 88 95 97 97 100

The median is 88.

Half the numbers are Half the numbers are

less than the median. greater than the median.


Median

Sounds like
MEDIUM
Think middle when you hear median.
A Hint for remembering the MODE…

The first two letters give you a hint… MOde


Most Often
Definition
• A la mode – the most popular
or that which is in fashion.

Baseball caps are a la mode today.


Solutions
I.Q. Frequency (F) Mid-value (X) FX C.F.

90 - 100 11 95 1045 11

100 - 110 27 105 2835 38

110 - 120 36 115 4140 74

120 - 130 38 125 4750 112

130 - 140 43 135 5805 115

140 - 150 28 145 4060 183

150 - 160 16 155 2480 199

160 - 170 1 165 165 200

Total 200
  25280  
Solution
• 
• Mean =

= 25280/200 = 126.4
Solution
• 
• Where, N/2 = 200/2 =100
• L = 120
• f = 125
• h = 10
• C.F. = 74
• = 120 + (260/38) =120+6.08 =126.08
Solution
•  Mode = ] X h

= 130 + [43-38]/2x43-38-28]x10
=130 + [5/20]x10
= 130+ 2.5 = 132.5
Geometric Mean
• This measure of the central tendency is called the geometric mean
(GM) and is defined as the arithmetic mean of the values taken on a
log scale.

• The GM can also be interpreted as the nth root of the product of n


observations.
For example

• If 20, 25 and 15 are weights (Kg) of three children, the GM of


these 3 observation will be the cube root of the product 20 X 25 X
15.
GM = (20 X 25 X 15)^1/3
= (7500)^1/3
= 19.57 kg.
Example
• What is the geometric mean of 2, 3, and 6 ?
Example
• What is the geometric mean of 4, 8, 3, 9, and 17 ?
Geometric Mean
GM is used more in microbiological or serological research.

One major limitation of GM is that it can not be obtained if any


observation is zero or negative, as the log values of such observations are
not defined.
Harmonic mean
• 
• Another measure of the central tendency of a frequency
distribution, called harmonic mean, is based on the reciprocals of
the observations.

• It is defined as the reciprocal of the arithmetic mean of the


reciprocals of the observations.

• Thus when we have the observations , , …………., , HM will be


Harmonic mean

• The use of HM may is appropriate when the reciprocals of the


observations seem more useful for determining the central tendency.

• HM may be used to compute the average of rate, say, typing speed


(number of words per minutes), speed of the car (KM per hour), etc.
It is used very rarely in medical or biological applications.

You might also like