Professional Documents
Culture Documents
Definition
The word ‘statistics’ has originated from the latin word ‘status’, which means ‘state’
or ‘position’. It is called so because the first statistics was the data collected on the
total people residing in Egypt for estimating the number of persons capable of
payment of tax to the king.
In olden days ‘statistics was mainly used for various administrative requirements. For
example : estimating budget requirement, amount of tax to be collected, labour
military force required, amount of clothing, food, school, hospital required etc.
The word statistics is used in two different ways – one as a plural sense and other as
a singular sense.
In the singular sense, it is the ‘science’ dealing with the method of data collection,
compilation of data, tabulation and analysis to provide meaningful and valid
interpretations.
Bio statistics
Statistical methods applied in the fields of medicine, biology and public health are
termed as biostatistics. It may be stated that the application of statistical methods to
the solution of biological problems. Biostatistics is known by many names such as
medical statistic, health statistics and vital statistics
Vital statistics : Statistics related to vital events in life such as of birth, death,
marriages, morbidity etc. These terms are overlapping and not exclusive of each
other.
Uses of Biostatistics
Statistical methods are widely used in almost all fields. Most of the basic as well as
advanced statistical methods are applied in fields such as medicine, biology, public
health etc.
Constant: A constant is a value that does not change with any situation. Example :
The value of p is 3.141, which does not change with time, place or person. The value
of ‘e’ is 2.718.
Variable : A variable is a characteristic that that can take on different values with
respect to a person, time or place, or any other factor. Example : Blood pressure,
height, weight, age, blood group, etc.
Populations under study can be finite or infinite in numbers. A population may refer
to variables of a concrete collection of objects, birth weight of infants, anthropometric
measurements of adults, nutrient contents of varieties of foods etc.
,
Sample : A small portion of the population which truly represents the population with
respect to the study characteristics of the population is known as sample. A
representative sample represents the population with respect to the characteristics
under study. Sample can be representative, if it is drawn carefully with an
appropriate size and good procedure.
Example : 100 person, selected randomly from a study population of 1000 person for
estimating the mean HB.
When the sample size increases, the statistic values estimated will be close as
possible to the population parameter values.
Variables are classified under four levels of measurement such as nominal, ordinal,
interval and ratio.
1. Nominal : Nominal variables are those variables measured with attributes that
are exhaustive and mutually exclusive.
Exhaustive means every participant in the study should be classified to any one of
the attribute. Example: The attributes of the variable religion be Hindu, Muslim,
Christian, Others. Then every participant of the study can be included any one of the
category. If the category others is not included as an attribute, a participant who is a
Buddhist is not able to mark any attribute.
mutually exclusive means every participant in the study should be in one and only
one attribute. Example : Gender (attributes : Male, Female), place of residence
(attribute : Rural, Urban), socio economic status (Attributes : Low, Average, High)
etc.
2. Ordinal : Ordinal variables are those variables measured with attributes with
the characteristics of nominal variable(exhaustive and mutually exclusive) and
can be rank-ordered. Rank-ordered means that one attribute is grater than or
less than another attribute, but not precisely state how much grater or how
much less. In this scale the relative order is meaningful, but the difference
between categories are not known.
Example : Pain (Attributes : No pain, Mild pain, Moderate pain, Severe pain),
Satisfaction – 5 point Likert scale ( Attributes : Very dissatisfied, dissatisfied,
neither dissatisfied nor satisfied, satisfied, very satisfied) etc.
If feasible, collect data using interval or ratio level measurement, because one
can recode interval or ratio level variable into nominal or ordinal variable. So it
is possible to convert high level measurement into lower level, but not the
opposite.
Example : Age of the participant of a study can record actual age based on
last birth day. In this case age is defined as ratio level measurement, which is
appropriate for parametric tests. Later the age can be recorded into ordinal
level (0-10, 10-20, 20-30 etc.) of measurement. If the age collected in
nominal or ordinal level, it is not possible to convert into ratio level.
Qualitative data: Qualitative data is defined as the data that is not precisely
measurable.
Ordinal data : If the qualitative data is expressed in some intrinsic order, such
data is called ordinal data.
(Example: pain – nil/mild/moderate/severe; income – poor / middle / high, Attitude
– strongly agree, agree, disagree, strongly disagree, etc. )..
Quantitative data are divided into two types namely discrete and continuous
Discrete data: The quantitative data that can be expressed only in whole
numbers is called discrete data. (Example: Pulse rate, Number of RBC, number
of injections given to patients etc.).
Continuous Data: The Quantitative data that can capable of assuming any
fractional value within a range of numbers is called continuous data. (Example:
weight, height, temperature).
.
DATA
Qualitative Quantitative
The form in which the data were originally collected is called raw data, which look
like a mass of numbers. Useful information is usually not immediately evident from
raw data. So rearrangement of data is essential to get information from the data.
The tabulation of qualitative data is carried out by counting the occurrence of each of
the attributes of the of the variable and present in frequency distribution.
Example : The data set of smoking status of 20 lung cancer patients collected as
yes (y) and No (n). The raw data appears y,n,y,y,y,y,n,y,n,y,n,y,n,y,n,y,n,y,y,n.
Simply count the occurrence of ‘y’ and ‘n’ of the raw data, which are 12 and 8
respectively. The resulting tabulation of the data is presented in the following Table
(b) Class Limits: Consider a class 20 – 30. The minimum value which can be
included in the class is lower class limit. The maximum value that can be
included in the class is upper class limit. So when the classes are 10 – 20
then 10 is lower class limit and 20 is the upper class limit. Now consider the
classes as 0 - 9, 10 – 19, 20 – 29, 30 – 39, 40 – 49, 50 – 59 etc. Then for the
class 20 – 29; 20 is lower limit and 29 is upper limit.
(d) Class Interval: The difference between class boundaries is called class
interval or class width or span of the class. e.g. for classes 10 – 20, 20 – 30,
30 – 40, ………… class interval is difference between 20 and 10 or 30 and 20
etc. When classes are 10 – 19, 20 – 29, 30 – 39, etc. The class interval is
difference between 19.5 and 9.5, 29.5 and 19.5 etc.
The steps for the formation of frequency distribution for quantitative data are
Find out the minimum and maximum values of the given data and divide the
total range of observation into some arbitrary intervals called class interval
Draw a table with the first column indicating the class interval. This column
should have an appropriate label along with unit of measurement.
Label ‘tally marks’ for the second column. Looking at the data, cross (‘/’) each
observation and put a tally mark ‘l’ against the interval in which that
observation falls.
Continue with the other observations, indicating every fifth tally mark in an
interval by crossing the previous four tally marks as shown IIII , so that it will
be easy to count multiples of five.
After placing the tally marks for all the observations in the appropriate groups,
count the tally marks and indicate the number as the frequency of that class in
the next column
The total of frequencies of all the class intervals will add up to the total
number of observations in the data set.
Finally give a suitable heading or titles to the table.
(i) The classes should be clearly defined and should not lead to any ambiguity.
(ii) The classed should be exhaustive, i.e., each of the given values should be
included in one of the classes.
(iii) The classes should be mutually exclusive and non-overlapping.
(iv) The classes should be of equal width. The principle, however, cannot be
rigidly followed. Eg. : The children below 15 years can be classified with
unequal class intervals such as <7 days, 7 – 28 days, 28 days – 1 year, 1 – 4
years, 5 – 9 years etc. to see the mortality pattern.
(v) Indeterminate classes, e.g., the open-end classes, less then ‘a’ or greater
than ‘b’, should be avoided as far as possible since they create difficulty in
analysis and interpretation.
(vi) The number of classes should neither be too large nor too small. It should
preferably lie between 5 and 15.
Number of children:
0,1,2,1,8,8,8,2,8,3,4,5,2,0,0,1,5,4, 1,3,5,3,6,6,5,2,1,3,7,1,3,2,1,0,5,6
Explanation: Consider first observation ‘0’ put a tally mark against ‘0’. Now
consider second observation ‘1’ put a tally mark against ‘1’. Proceed in same way.
Every fifth tally mark is to be crossed as shown in the table. Then count the number
of tally marks which is frequency.
Sol: We require to form classes of ages. Minimum age is 10 and maximum age is
89. Let us take the classes as 10 – 20, 20 – 30, ………..etc. and prepare the
following table.
From the preliminary table, a final table is constructed. In the final table the
‘tally mark column’ will be ignored. The column heading ‘frequency’ will be changed
into ‘the number of the specific items analyzed’. An ideal frequency table can have 5
to 20 classes, so that it can be accommodated in one page. The tables should be
serially numbered and there must be a proper heading, which is self explanatory. A
vertical arrangement is preferred to horizontal arrangement. Often, a percentage
column will also be appended to give a comparative picture. Thus the final table for
age of patients can be presented as follows.
The tabular form arrangement by counting the different combination that can be
formed with the two variables is called two-way table or two way classification. When
three variable consider simultaneously, the tabulation is called three-way table.
In general, more than one variable is studied, resulting in a subdivision of classes,
the classification is known as multi-way table or manifold classification.
Consider the data regarding the variable blood group of nursing students. The
simple classification will be
When consider two variables - blood group and Rh factor (+ve and –ve). The
information can be arranged in a tabular form, counting the different combination that
can be formed with the two variables, namely A+ve, A-ve, B+ve, B-ve etc. Such a
presentation is called two-way table. The following table shows the two-way and
three-way tables.
Frequency tables provide a view of the data and its principal characteristics, where
as diagrams provide visual method of examining data. Diagrams help to get a real
grasp of the overall picture rather than specific details. Many types of diagrams and
graphs are used to represent different type of data.
1. Bar Diagrams
One of the commonly seen diagrams is Bar Diagram. The different types of
commonly used bar diagram are simple, component and multiple or percentage.
In bar diagram, mark the category of the variable on X axis and frequency on
Y axis. A rectangular bar of each category of the variable is erected and the
height of the bar is proportional to frequency of that category. Each bar should
have an equal width and equal width between two successive bars.
The bar diagram representing information shown in the table regarding
distribution of blood group is presented below.
Distribution of nursing students according to blood group
Blood groups No. of nursing students
A 12
B 3
AB 7
O 18
20
18
18
16
No. of nursing students
14
12
12
10
8 7
4 3
0
A B AB O
BLOOD GROUP
The bar diagram made with the percentage value for the above table is also
made.
No. of
Blood groups nursing Percent
students
A 12 30.0
B 3 7.5
AB 7 17.5
O 18 45.0
Total 40 100.0
Simple Bar Diagram show ing the Blood Group of Students
50.0
45.0
45.0
40.0
Percentage of nursing students
35.0
30.0
30.0
25.0
20.0 17.5
15.0
10.0 7.5
5.0
0.0
A B AB O
BLOOD GROUP
20
18
2
Rh positive Rh negative
120.0
60.0
83.3 88.9
40.0
66.7 85.7
20.0
0.0
A B AB O
BLOOD GROUP
Rh positive Rh negative
There may be situations, where the total number as well as the percentage
is not given any importance but the component parts alone are taken into account
for graphical presentation. Then it can be presented as a ‘Multiple Bar
Diagram’.
Multiple Bar Diagram showing the Blood Group and
Rh Type
18
16
16
14
No. of nursing Students
12
10
10
8
6
6
4
2 2 2
2 1 1
0
A B AB O
BLOOD GROUP
Rh positive Rh negative
1. Histogram
Example :
The table below give the distribution patients according to hemoglobin level.
35
30
25
No. of patients
20
15
10
0
6 8 101 12 14
HB level
How a histogram Differ from a simple bar diagram: A histogram is different from a
bar diagram in three ways (i) In simple bar diagram, all the bars are of equal width
but in a histogram width varies with class interval (ii) in simple bar diagram there is
some gap in between bars but there is no gap between bars in histogram (iii) in
simple bar diagram, the length of the bar is proportional to the frequency of the
category, but in histogram, the area of the bar is proportional to the frequency of the
class.
2. Frequency Polygon/Curve
To draw a Frequency Polygon, the mid-values are marked along the X axis on a
suitable scale and the frequencies along the Y axis. The frequencies corresponding
to each mid-point will be shown in the chart area with plotted points.
Example : Consider the incidence of female breast cancer in Mumbai and Delhi as
given in the following Table.
From the frequency polygon, it can be seen that age pattern in the number of breast
cancer cases is almost the same up to the age of 45 years and thereafter more
cases seen in Mumbai as compared to Delhi. These differences are the absolute
differences in number of cases and not reflect the incident rates. In order to
compare, incidence rate has to calculate (number of cases per women at risk at
different Age groups)
The drawing of the Frequency curve is exactly similar to that of drawing a frequency
polygon except that the plotted points are joined by a smooth free hand curve.
If the frequency polygon is drawn for the cumulative frequency rather than absolute
frequencies of class intervals, is called cumulative frequency polygon. If the points of
cumulative frequency polygon are joined with a smooth curve instead of straight line
is called an ogive. With the help of this curve one may find out the number of
observations falling below or above a specific value, This is useful for calculation of
quartiles, deciles, percentiles, median, mode and for comparison between two or
more groups.
Example :
120
108
100 105
94
80
No. of patients
69
60
40
38
20
0
1 0 3 5 7 9 11
Days of reporting Haepatitis cases
From the less than ogive figure, for any given value on X axis, it can determine the
frequency of observations that are less than or equal to that value by extending a
line vertically from that given value where the vertical line intersects the ogive curve
120
100 108
80
No. of patients
60 70
40
39
20
14
0 3 0
1 3 5 7 9 11
Days of reporting Haepatitis cases
The greater than ogive is helpful in determining the number of observations above a
specific value, within the range of the observed data.
Less than and grreater than cumulative frequency curve of Haepatitis cases
reported
When draw both the ogives on the same graph, they intersect at some point and the
corresponding value of the variable as indicated on the x axis will be median of the
entire data.
4. Line Diagram
Let us consider the number of deaths reported from the Medical College
Health Centre area for the period from 2001 to 2004.
In this case, the number of deaths spread over a period of four years. The
objective is to see the mortality trend from a suitable diagram or a chart.
Certainly, a ‘Line diagram’ may be a better choice.
45
40 40
35
30 32
No. of deaths
30
25
25
20
15
10
5
0
2001 2002 2003 2004
Year
In line graphs, time is taken as independent variable (years, months, hours etc)
on the x axis and one or more variables on the Y axis. Corresponding to each of
such point on the X axis, the number of observations of each year is indicated by
dots or plotted points. A line obtained by joining all the plotted points is called
Line graph.
Two or more variables depends on the same time variables presented on the
same graph. This diagram will be very useful to compare the trend of two ore
more variable over a range of period. The following figure will give the
information on the variation in SBP at different time interval after surgery in both
group, a comparison in SBP at different time interval between group is also
possible.
125.0
120.0
115.0
110.0
105.0
Experimental Control
Another advantage of these diagram is to interpolate variable under study for any
given time period within the observed range.
The measurements that are describing the tendency of the data to be in the centre
(middle) are called ‘Measures of Central Tendency’ or ‘Statistical Averages’. The
measures of central tendency make reduction or condensation of a group of
observations into a single number and make possible the comparison of different set
of data.
The important measures of central tendency are – (i) The Arithmetic Mean (ii) The
Median and (iii) The Mode.
Ungrouped data
For ungrouped data the arithmetic mean ( x) can be calculated by the formula
Example 5.1: The gain in weights of 5 albino rats over a period of 5 days are 5, 6 , 4,
8, 7. The arithmetic mean or mean is
x =
5+6+4+8+7 = 30 = 6.0
5 5
Grouped data
Example:
Following data gives age in years in case of child deaths. Find the average age.
A.M. =x x.f
= =275 = 1.6 year
N 172
In this case the exact mark (x value of each student) is not known. So
we assume that all the students in ‘5 – 15 class’ might have scored 10 marks
each (the mid value of the class, which is calculated as the lower limit + upper
limit divided by 2), all the students in 15 – 25 class assumed to be scored 20
marks each and so on. The table is to be reconstructed with two more
columns viz. mid-value (x) and x.f. as follows:-
Marks awarded to each of 100 students in a class
Again, the Arithmetic Mean will be calculated with the same formula:
A.M. =x x.f
= =2400 = 24 marks
N 100
The advantage of A.M. is that it has got all the desirable properties of a good
measure of central tendency. Computation of AM uses all the observations, it
can be used for further mathematical and statistical operations, easily guess
te behavior of the sample mean over repeated samples.
Disadvantageous : (i) Sample mean can not be computed for nominal and
non-numerical ordinal data. (ii) Sample mean is sensitive to extreme values in
the data set or it will be influenced considerably by abnormal values. (iii) Mean
can not compute from grouped distribution with open ended class intervals.
B. Median
The median is the middle value, which divides the observed values into two equal
half. Since it is the middle most value by position, median is also called positional
average.
For calculation of the median, the values should be arranged in order of magnitude
and the central value represent the median.
Median = The [(n + 1)/2]th term, if the observations are arranged either in
ascending or descending order
When the total number of observations (n) is even
Median = The average of (n/2)th and [(n/2) +1)]th term, if the observations are
arranged either in ascending or descending order
Grouped data
In this case, the data is grouped (classified), Median can be done with the
help of a mathematical formula:
In this case, the ‘median-class’ is defined as the class in which N/2 lies.
Example : Calculation of median dosage of drug for the data given in table.
Median is not a good measure of central tendency because it has none of the
desirable properties except that it is easy to understand and easy to calculate if
the data is ungrouped and it has got a well defined formula. Still, it is the best
measurement, if there are few abnormal observations or some of the
observations are missed but their positions in the distribution are known.
C. Mode
The mode is that value of the variable which occurs most frequently.
Grouped data
In this case the data is grouped. So Mode can be calculated with the formula.
f1 + f2
Where l = lower boundary of the mode class,
C = Class interval
f1 = frequency in the class just preceding the mode class
f2 = frequency in the class just succeeding the mode class
The mode class is defined as the class in which the maximum frequency
lies.
Example :
Mean is easy to calculate, and understand, and is based on all the observations
and least affected by sampling fluctuations but it is affected very much by extreme
values.
Median is easy to calculated and is not affected by extreme values. With even
number of observations, it is not an exact value. It is not based on the observations
like mean.
Mode is used when the values are widely varying. For skewed distributions or
samples where there is wide variation, mode and median are useful.
MEASURES OF DISPERSION
The measures of central tendency – Mean, Median & Mode – will not be enough to
provide comparison and valid inference. Two distributions may centre around the
same point, i.e. arithmetic mean, but differ in variation from arithmetic man. Such
variation is called dispersion, spread, scatter or variability.
A ‘Good Measure of Dispersion’ must have the properties like – (i) Easy to calculate
and Easy to understand (ii) Rigidly defined (iii) Representative of the whole group (iv)
Less amenable to sampling fluctuation and (v) Amenable for further mathematical
and statistical operations.
1. Range
The Range [R] is defined as the difference between the Highest (H) and Lowest
(L) observation in a given set of data. It is calculated as: R = H – L
It is the simplest measure of dispersion, which will provide quick and easy
inference of the data.
2. Mean Deviation
The Mean Deviation (Average Deviation) is defined as the mean of the absolute
Deviation of observations from any one of the measures of central tendency.
If the deviation is taken from the A.M., it is called ‘Mean Deviation about Mean’. For
deviation from the Median, it is known as ‘M.D. about Median’ and if the deviation is
computed from the Mode, it is called the ‘M.D. about Mode’. If it is not specifically
stated, M.D. signifies the ‘Mean Deviation about Mean’.
Ungrouped data : For ungrouped data Mean deviation is calculated with the
formula: x–x
f n
4. Find the sum of x – x and divide by n = 10. Then the computation can
be done as follows
Grouped data:
Mean Deviation for grouped data can be calculated with the formula:
x – x .f
M.D. =
N
Example : Calculate the mean deviation for the distribution given below
In order to apply the formula for grouped data, table has to be reconstructed
with additional columns: x (mid-value), (x.f), , , , x (x – and
x) x – x.f
x–x
No. of x.f
Systolic Midvalue x = x.f (x – x) x–x x–x
Patients N .f
B.P. (x)
(f)
110 – 130 120 10 1200 -40 40 400
130 – 150 140 20 2800 -20 20 400
16000/
150 – 170 160 40 6400 0 0 0
100 =160
170 – 190 180 20 3600 +20 20 400
190 - 210 200 10 2000 +40 40 400
x.f =
Total ….. N = 100 x – x .f
16000
= 1600
x–x f 1600
M.D. about x (Grouped) = f = = 16 mm Hg
100
N
3. Standard Deviation
The Standard Deviation (S.D.) is defined as the ‘Root Mean Square Deviation’ or it is
the ‘square root of the average of the sum of the squares of deviation taken from the
A.M. It is denoted with the symbol ‘’ (sigma) if all the items of the population are
taken into account and denoted with ‘s’ for sample estimate.
If the square root is not computed, it is called variance. Or, the square of standard
deviation is called variance. The standard deviation is considered as the best
measure of dispersion.
S.D. = (x – x)2
n
4. Find the sum of (x – x)2and divide by n. Then the computation can be done
as follows
Systolic BP 1600
x = x = (x – x) (x – x)2
(x) n 10
140 Mean = 160 -20 400
120 -40 1600
260 +100 10000
120 -40 1600
150 -10 100
140 -20 400
200 +40 1600
170 +10 100
120 -40 1600
180 +20 400
x = 1600 = 17800
(x – x)2
(x – x)2 17800
S.D. = =
10
n
= 1780 = 42.2 mm Hg.
Grouped data : For grouped data standard deviation is calculated by the formula
(x – x)2 f
S.D. =
N
No. of x.f
Systolic Mid x = x.f
Patients N (x – x) (x – x)2 (x – x)2.f
B.P. value (x)
(f)
110 – 130 120 10 1200 16000/ -40 1600 16000
130 – 150 140 20 2800 -20 400 8000
150 – 170 160 40 6400 0 0 0
100 =160
170 – 190 180 20 3600 +20 400 8000
190 - 210 200 10 2000 +40 1600 16000
x.f = 2
Total ….. N = 100 (x – x) .f
16000
= 48000
(x – x)2 f 48000
S.D. (grouped data) = = 100 = 480 = 22 mm Hg
N
16000
Where x = x.f = 100 = 160 mm Hg
N
The Standard Deviation (S.D.) is defined as the ‘Root Mean Square Deviation’
or it is the ‘square root of the average of the sum of the squares of deviation
taken from the A.M. It is denoted with the symbol ‘’ (sigma) if all the items of the
population are taken into account and denoted with ‘s’ for sample estimate.
The square of the S.D. is called ‘Variance’, which can be calculated by ignoring
the square root sign ( ) in the formula for S.S. and denoted by ‘s2’ for sample
variance and ‘2’ for population variance.
4. Percentiles
Quartiles They are three different points located on the entire range of a
variable such as height – Q1, Q2 and Q3. Q1 or lower quartile will have 25%
observations of heights falling on its left and 75% on its right; Q 2 or median will have
50% observations on either side and Q 3 or upper will have 75% observations on its
left and 25% on its right.
Quintiles: Quintiles, four in number divide the distribution into 5 equal parts. So
th
20 percentile or first quintile will have 20% observations falling to its left and 80% to
its right.
Deciles Nine in number divide the distribution into 10 equal parts, first decile or
10 percentile will divide the distribution into 10% and 90% while 9 th decile will
th
divided into and 10% and 5 th decile will be same as median. So median of a variable
can also be called as second quartile Q2, 5th decile P5 or 50th percentile P50.
5. Quartile Deviation
The Quartile Deviation (Q.D) is defines as the average deviation within the first
Quartile and the third Quartile or it is the ‘Semi-Inter-Quartile Range’.
Q 3 – Q1
It is calculated with the Formula: Q.D. = where Q 1 = 1st Quartile
2
value and Q3 is the 3rd Quartile values.
In the case of ungrouped data, the Q.D. is not usually calculated due to limited
number of observations.
[(N / 4) – m1] c
In the case of grouped data: Q = f1 where l = Lower boundary of
1 1
the 1st Quartile class (where N/4 lies), N = Total frequency m1 = Cumulative
frequency (less than) up to the 1 st Quartile class C = Class interval & f1 = the
frequency in the 1st quartile class and
The Q.D. is usually computed if there are few abnormal observations within the
1st Quartile or after the 3 rd Quartile, so as to nullify its effect and it is not
considered as a good measure as it signifies the average deviation within the first
and third quartile only and most of the desirable properties are not satisfied.
RELATIVE MEASURES OF DISPERSION
The Measures of Dispersions – Range, M.D., S.D. and Q.D. – are always
expressed as positive and hence it is also often called as ‘Absolute measures
of dispersion’.
S.D. x 100
i.e. C.V. = A.M
Measures of relationship
Correlation
Sometimes we have to work on more than one variable at a time to understand their
relationship or the dependence of one variable on another variable. For example : to
understand the relationship between birth weight of new born and gestational age,
characteristics measured in the same person such as weight and cholesterol, weight
and height. At the other times, the same character is measured in two related groups
such as tallness in parents and tallness in children; study of intelligent quotient (IQ)
in sisters and brothers, degree of heaviness or obesity in parents and their children
and so on
If only two variables are involved, it is called ‘Bivariate Correlation’ and if more than
two variables, it is known as ‘multivariate correlation’. If the plotted points of a
correlation graph falls on or near a line it is called linear correlation and if on or near
a curve, it is curvilinear correlation and so on’
Example : If cholesterol level increases, the BP may also increase i.e. if there is an
increase in one variable, there will be a corresponding increase in the other. Similarly
if height increases, there may be a proportional increase in weight, birth weight of
new born is positively correlated with gestational age and so on
Correlation Coefficient
The coefficient of correlation is defined as the strength and direction of linear
relationship between two quantitative variables.
Direction: The sign (+ or - ) of the correlation coefficient indicates the direction of the
relationship. A correlation coefficient between 0 and 1 indicate a positive or direct
relationship, a correlation coefficient between 0 and -1 indicate a negative correlation
or inverse relationship.
Strength : The absolute value of the correlation coefficient | r | indicate the strength
of the relationship between the variables. For example, a correlation coefficient of
-0.65 is stronger than a correlation coefficient of 0.60, because the absolute value of
-0.65 is greater than the absolute value of 0.60.The correlation coefficient closer to
-1 or +1, the stronger the correlation. There are no hard and fast rules for constitute
‘strong’, ‘moderate’ and ‘weak’ correlation, some rough guideline is as given
There are statistical tables available to see whether the correlation coefficient
obtained is significant statistically. If the obtained correlation coefficient is greater
than the statistical table value for correlation co-efficient r, using the degrees of
freedom (d.f), there is significant relationship between the variable. If the obtained
correlation value is less than the statistical table value, the relationship is not
statistically significant.
Coefficient of determination
The square of the correlation coefficient (r 2) is called coefficient of determination. It is
the proportion of variance in one variable that can be explained by the other.
Example : The correlation coefficient r between birth weight of new born and
gestational age is 0.745. Coefficient of determination, r 2 = 0.555 or 55.5 %, which
implies that 55.5 % of variation in birth weight (dependent variable) can be explained
by the gestational age (independent variable) of new born.
Calculation of Pearson correlation coefficient
There are various formulae for computing the correlation coefficient ‘r’, the simplest
one is:
Co-variance (x, y) = (x – x) (y – y)
n
As SD of x, SD of y and co-variance of (x and y) involve lot of calculations,
another formula is suggested for the computation of correlation co-efficient.
negative correlation and if it is zero, there is no correlation
r=
of pairs.
Example:
In a study to assess the hours of study per day, for second year B.Sc
Nursing exam and the marks obtained in the university Exam, the following data
were obtained. Calculate the correlation coefficient.
Marks obtained
in the university
Hours of study xy x2 y2
exam
y
2 20 40 4 400
8 40 320 64 1600
10 42 420 100 1764
6 24 144 36 576
3 36 108 9 1296
∑x=29 ∑y= 162 ∑xy= 1032 ∑ x2=213 ∑ y2 =5636
Correlation coefficient, r =
= + 0.70
Rank Correlation
When the precise measurements of the variables are not available, the
Pearson correlation coefficient can not be calculated. Even if the precise
measurements are available, it is not justifiable to calculate ‘r’ unless, the variables
follow normal.
Spearman rank correlation is appropriate to use when either or both the variables
are measured at ordinal level of measurement, or at interval / ratio level that do not
meet the assumptions of normality. The dependent and independent variables must
be paired observations
It is introduced by Spearman and hence known as ‘Spearman’s Rank Correlation’
denoted by ‘R’. Like the Pearson correlation coefficient, Spearman rank correlation
coefficient ranges between -1 and +1, where the sign indicates the direction of the
relationship and the absolute value indicates the strength of the relationship.
Example:
Steps involved:
1. Reconstruct the table with additional columns d = [Rank
given by 1st examiner (x) – 2nd Examiner (y)] and d2
Examiner – I Examiner – II
Candidate d=x–y d2
(x) (y)
A 2 3 -1 1
B 1 4 -3 9
C 4 2 2 4
D 5 5 0 0
E 3 1 2 4
F 7 6 1 1
G 6 7 -1 1
n = no. of
Total d2 = 20
candidate = 7
Rank corr. R = 1 - 62 d = 1 -
2
6 x 20
= 1 – 0.36 = +0.64
n(n – 1) 7(72 – 1)
Interpretation : Since the Rank Correlation R = +0.64, which is more than +0.5, it is
suggestive of moderate positive correlation. i.e. if the marks of one
examiner increase in the marks given by examiner 2 also.
Qn. 2 Information on IQ and personality scores of 6 students are given below. Study
the relationship between these two variable using Spearman correlation.
IQ Rank Personality
Rank of differenc
Sl no index of score d2
y e in ranks
x x y
1 10 6 9 5.5 0.5 0.25
2 9 4.5 9 5.5 -1 1
3 6 2 7 3 -1 1
4 9 4.5 5 1.5 3 9
5 8 3 8 4 -1 1
6 4 1 5 1.5 -0.5 0.25
= 1 – 6 x 12.5 = 1 - 75
6 x 35 210
= 1 - 0.36
= 0.64
Scatter diagram
This is useful to assess the relationship between two variables. In plotting data of
this type, one variable is placed on the x –axis and the second variable on the y –
axis. The (x,y) points are indicated by means of dots. The pattern made by these
dots is indicative of a possible relationship between two variables. The scatter
may be linear, curve linear, or exponential.
Example : The data of 10 albino rats on intake of proteins and gain on weight.
26
24
22
20
18
Gain in weight (gm)
16
14
12
10
9 10 11 12 13 14 15 16 17
This is useful to assess the relationship between two variables. In plotting data of
this type, one variable is placed on the x –axis and the second variable on the y –
axis. The (x,y) points are indicated by means of dots. The pattern made by these
dots is indicative of a possible relationship between two variables. The scatter
may be linear, curve linear, or exponential.
A scatter plot is a graph with points plotted to show a possible relationship between
two sets of data. In plotting data of this type, one variable is placed on the x –axis
and the second variable on the y – axis. The (x,y) points are indicated by means of
dots. After plotting all the observations, it can see how the values on the Y axis are
scattered for a given value of x axis. Similarly, how the values on X-axis are
scattered for a given value on the Y-axis. This diagram shows the scatter of two
variables with respect to each other.
Example : The data of 10 albino rats on intake of proteins and gain on weight.
26
24
22
20
18
Gain in weight (gm)
16
14
12
10
9 10 11 12 13 14 15 16 17
The dependent variables represent the output or outcome whose variation is being studied. The
independent variables represent inputs or causes, i.e. potential reasons for variation
Regression
Regression analysis is a method of develop a mathematical equation that predict the
dependent variable for a given value of independent variable based on a sample of
measurements on both the dependent and independent variables
Linear Regression
The first step in regression is to draw a scatter diagram for independent variable X
and dependent variable Y. If there is any trend happens to be like a line in the scatter
diagram of the values of the dependent variable Y for given values of X, the
regression technique applied for the estimation of the dependent variable Y is called
linear regression.
If one independent variable is used to develop a linear equation that describes the
relationship between dependent and independent variable, such linear regression is
called simple linear regression.
The concept of regression lies in identifying a line called regression line, that is the
nearest to the data points marked on the scatter diagram, so that for a given value of
X, make a close prediction of value of Y. For that, using the observed data, find the
mathematical quantities of the equation of a straight line. Te identified line passes
through the point whose coordinates are the means of X and Y. Mathematically the
equation of straight line is
Fitting a regression line to the data is to obtain the values of ̂ 1 and ̂ 0 in the
equation. The method of estimating the slope and the intercept of liner regression is
called least square method.
The method of least square ensures the sum of squares of all vertical distances
between the fitted regression line and the data points will be the least of all possible
regression lines on this data.
The expression for the least square estimates of the slope and the intercept are
obtained as
x x yi y
ˆ1 xy i
SS
SS xx xi x 2
ˆ
Regression coefficient 1 = xy - x y
N
2
x2 - (x)
n
Intercept ˆ0 y ˆ1 x
Many statements are made with certain elements of uncertainty - may probable to
rain tomorrow, probable to recover after surgery, new drug tested may be effective
etc. No conclusion can be drawn with 100 percent certainty. Probability is the
measure of chance/ uncertainty associated with a conclusion.
If an event can occur in N mutually exclusive and equally likely ways and if m of
them possess a specific characteristic E, then
P(E) = m/N
Probability is usually expressed by the symbol ‘p’. It ranges from zero to one. When
p=0, it means that no chance of an event happening or its occurrence is impossible
(example, an animal giving birth to a human child). If p=1, it means that the chances
of an event happening are 100% (example, death for any living being).
Types of Probability ; There are two types of probability (a) mathematical and (b)
statistical
To find out the probabilities in all the above problems, evidence based on
empirical data is required.
1. If the past experience indicates that out of 1000 first pregnancies resulted in
the delivery of 530 girls, probability of getting girl in the first pregnancy is
530/1000 = 0.53.
2. If the past data shows out of 200 cases of kidney transplantations, succeed in
80 cases. Then probability of survival after transplantation is 80/200 = 0.4
2. If the past data shows out of 1000 first deliveries, 180 developed postpartum
depression, probability of woman develop postpartum depression after first
delivery is 180/1000 = 0.18
6 7 8 9 10 11 12
5 6 7 8 9 10 11
4 5 6 7 8 9 10
Die 2
3 4 5 6 7 8 9
2 3 4 5 6 7 8
1 2 3 4 5 6 7
1 2 3 4 5 6
Die 1
Theoretical probability – total number of possible sum 7 when working with two
dice divided by total number of outcomes is 6/36 = 0.167
If probability of an event happening is p and that of not happening is defined by q,
then q=1-p or p+q=1
PROBABILITY LAWS
Law of Additivity
If A and B are mutually exclusive outcomes, then the probability that either A
or B will occur, P (A or B) is
Probabilities: 1
P(2) =
6
1
P(5) =
6
P(2 or 5) = P(2) + P(5)
1 1
= +
6 6
2
=
6
1
=
3
Example 3: A glass jar contains 1 red, 3 green, 2 blue, and 4 yellow
marbles. If a single marble is chosen at random from the jar,
what is the probability that it is yellow or green?
Probabilities: 4
P(yellow) =
10
3
P(green) =
10
P(yellow or green) = P(yellow) + P(green)
4 3
= +
10 10
7
=
10
The probability that an independent event will occur jointly is the product of
the probabilities of each event.
If A and B are independent events, then the probability that A and B will occur
- P (AB), is
Example 1: what is the probability the probability of tossing a coin twice
and getting a head on each toss?
Probabilities: Probability of getting head on the first toss P(H1)
P(H1) = 1
2
Probability of getting head on the second toss P(H2)
1
P(H2) =
2
P(H1H2) = P(H1) X P(H2)
1 1
= X
2 2
1
=
4
Example 2: A coin is tossed and a single 6-sided die is rolled. Find the probability of
landing on the head side of the coin and rolling a 3 on the die.
Probabilities:
1
P(head) =
2
1
P(3) =
6
1 1
= ·
2 6
1
=
12
Example 3: New borns by blood ‘Rh’ factor and sex are cited below:
From table also probability of child being female with Rh +ve = 45 / 100 = 0.45.
The probability that event B occurs, given that event A has already occurred is
Example 1 : If the analysis of records show that 90 per cent of the large number of
patients of abdominal tuberculosis came with complaints of pain in abdomen,
vomiting and constipation of long duration, then the conditional probability is
The conditional probability is restricted to a specific group, in the above example the
restricted group is the patients with abdominal TB.
Uses of probability
2. All the 3 measures of Central tendency – mean, median and mode are
equal (i.e. mean = median = mode).
6. If the total area under the curve is considered as 100%, (Mean 0.67 SD)
covers 50% area; (Mean 1.96 SD) covers 95% area and (Mean 2.58
SD) covers 99% area.
7. The 1st Quartile value (Q1) and 3rd Quartile value (Q3) are equidistant from
the Mean (Q2).
5. Various other statistical tests like t-test, chi-square test, F test etc. also
developed on the basis of Normality principles. In short, the entire Tests
of significance is founded upon the Principles of Normality.
6. During generalization of results, often the researcher has to predict a
possible interval in which 95 percent of the sample estimates may lie
(95% Confidence Interval). Its calculation is made possible through the
theory of normal distribution
The deviation from normality can be of two types : skewness and Kurtosis
Negative Skewness: If more values are concentrated above the median, it is called
negative skewness. Negatively skewed distribution have a long tail to the left, mean
and median are both less than mode and mean will be less than median. For
negatively skewed, Mean < Median < Mode
Positive Skewness: If more values are concentrated below the median, it is called
positive skewness. Positively skewed distribution have a long tail to the right, mean
and median are both less grater than mode and mean will be grater than median.
For positively skewed, Mean > Median > Mode
If the values are symmetrically distributed on either side of the median, there is no
skewness and it is normal curve. A normal distribution curve has skewness zero.
There are many measures of skewness. A simple measure of skewness is given
below
or
Mesokurtic : A distribution that is peaked in the same way as any normal distribution
is said to be mesokurtic or normokurtic. The peak of a mesokurtic distribution is
neither high nor low rather it is considered to be a baseline for the two other
classifications
Leptokurtic : Leptokurtic distribution are those that have peak greater than a
mesokurtic distribution. Leptokurtic distributions are identified by peaks that are thin
and tall.
Platykurtic : Platykurtic distributions are those that have a peak lower than a
mesokurtic distribution. Platykurtic distributions are characterized by a certain
flatness to the peak, and have slender tails
There are many measures of Kurtosis, a simple formula for Kurtosis is given by
K= (Q3-Q1) / 2(P90 - P10), where Q1- First quartile, Q3- Third quartile, P 10 – 10th
percentile and P90 -90th percentile
If K=0.263, the cuve is normokurtic. If it is grater than 0.263, it is platykurtic. If it is
less than 0.263, it is leptokurtic
Sampling error
For example, if one measures the height of a thousand individuals from a place of
one lakh, the average height of the thousand is typically not the same as the average
height of all one lakh people of that place. Since sampling is typically done to
determine the characteristics of a whole population, the difference between the
sample and population values is considered as sampling error
Exact measurement of sampling error is generally not feasible since the true
population values are unknown
DESIGN OF EXPERIMENTS
It can be defined as “the logical construction of the experiment in which the degree of
uncertainty with which the inference is drawn may be well defined.
The specific questions that the experiment is intended to answer must be clearly
identified before carrying out the experiment.
Experiment
An experiment is a device or a means of getting an answer to problem under
consideration. Experiment can be classified into two categories as Absolute and
Comparative.
Absolute experiments consists in determining the absolute value of some
characteristics like (i) obtaining the average intelligence quotient of a group of
people, (ii) fining the correlation coefficient between tow variables in a bivariate
distribution.
Comparative experiments are designed to compare the effect of two or more
objects on some population characteristics.
Treatments
Various objects of comparison in a comparative experiment are termed as
treatments.
Example 1. A corn field is divided into four, each part is 'treated' with a different
fertiliser to see which produces the most corn
Example 2. A teacher practices different teaching methods on different groups
in her class to see which yields the best results
Example 3. a doctor treats a patient with a skin condition with different
creams to see which is most effective
Experimental units
The smallest division of the experimental material to which we apply the
treatments and on which we make observations on the variable under study is
termed as experimental unit. e.g., in field experiments, the plot of “land” is the
experimental unit.
Blocks
In agricultural experiments, most of the times we divide the whole
experimental units into relatively homogeneous sub-groups or strata. These strata,
which are more uniform amongst themselves than the field as a whole are known as
blocks.
Yield
The measurement of the variable under study on different experimental units
are termed as yields.
Example : A farmer wishes to evaluate a new fertilizer. He uses the new fertilizer on
one field of crops (A), while using his current fertilizer on another field of crops (B).
The irrigation system on field A has recently been repaired and provides adequate
water to all of the crops, while the system on field B will not be repaired until next
season. He concludes that the new fertilizer is far superior.
The problem with this experiment is that the farmer has neglected to control for the
effect of the differences in irrigation. This leads to experimental bias, the favoring of
certain outcomes over others. To avoid this bias, the farmer should have tested the
new fertilizer in identical conditions to the control group, which did not receive the
treatment. Without controlling for outside variables, the farmer cannot conclude that
it was the effect of the fertilizer, and not the irrigation system, that produced a better
yield of crops.
Another type of bias that is most apparent in medical experiments is the placebo
effect. Since many patients are confident that a treatment will positively affect them,
they react to a control treatment which actually has no physical affect at all, such as
a sugar pill. For this reason, it is important to include control, or placebo, groups in
medical experiments to evaluate the difference between the placebo effect and the
actual effect of the treatment.
The simple existence of placebo groups is sometimes not sufficient for avoiding bias
in experiments. If members of the placebo group have any knowledge (or suspicion)
that they are not being given an actual treatment, then the effect of the treatment
cannot be accurately assessed. For this reason, double-blind experiments are
generally preferable. In this case, neither the experimenters nor the subjects are
aware of the subjects' group status. This eliminates the possibility that the
experimenters will treat the placebo group differently from the treatment group,
further reducing experimental bias.
Experimental Error
A large homogeneous field is divided into different plots and different
treatments are applied to these plots. Experience tells us that even if the same
treatment is used on all plots, the yields would still vary due to the difference in soil
fertility. Such variation from plot to plot, which is due to random factors beyond
human control, is called as experimental error.
Replication
Replication means ‘the repetition of treatments under investigation.
Randomization
‘Randomization’, a process of assigning the treatments to various
experimental units in a purely chance matter.
Local Control
The process of reducing the experimental error by dividing the relatively
heterogeneous experimental area (field) into homogeneous block is known as Local
control.
Randomization
It is the method of creating homogeneous treatment groups to eliminate any potential biases.
One standard method for assigning subjects to treatment groups is to label each
subject, then use a table of random numbers to select from the labeled subjects.
If, for instance, an experimenter had reason to believe that age might be a significant
factor in the effect of a given medication, he might choose to first divide the
experimental subjects into age groups, such as under 30 years old, 30-60 years old,
and over 60 years old. Then, within each age level, individuals would be assigned to
treatment groups using a completely randomized design.
Example
A researcher is carrying out a study of the effectiveness of four different skin creams
for the treatment of a certain skin disease. He has eighty subjects and plans to divide
them into 4 treatment groups of twenty subjects each. Using a randomized block
design, the subjects are assessed and put in blocks of four according to how severe
their skin condition is; the four most severe cases are the first block, the next four
most severe cases are the second block, and so on to the twentieth block. The four
members of each block are then randomly assigned, one to each of the four
treatment groups
A matched pairs design is a special case of the randomized block design. It is used
when the experiment has only two treatment conditions; and participants can be
grouped into pairs, based on some blocking variable. Then, within each pair,
participants are randomly assigned to different treatments.
A matched pairs design for the above mentioned study of skin disease experiment.
The 80 participants are grouped into 40 matched pairs. Each pair is matched on
gender and age. For example, Pair 1 might be two women, both age 21. Pair 2 might
be two women, both age 22, and so on.
For the above example, the matched pairs design is an improvement over the
completely randomized design and the randomized block design. Like the other
designs, the matched pairs design uses randomization to control for confounding.
Example : with 4 treatment A,B,C,D, one typical arrangement of 4X4 LSD is given below
A B C D
B A D C
C D A B
D C B A
TESTING OF HYPOTHESIS OR TESTS OF SIGNIFICANCE
Really, hypothesis is a wild guess and the significance is a search for its proof. It is a
search for truth, if two or more arguments appear to be ‘apparently correct’.
Hypotheses are the anticipated results. It is the starting point for any research. It is
usually formulated on the basis of literature search, material evidence, real life
experience or even as an intuition and the arguments put forward to accept or reject
the claim, often termed as tests of significance.
The methodology that is used to see whether the difference between the sample
estimate (statistic) and the true value of the population (parameter) or between two
or more independent sample estimates is due to sampling variation (peculiar nature
of the sample)or otherwise is called testing of hypothesis or test of significance. It
measures the strength of evidence to believe that the claims put forward by the
investigator are true or false.
The testing of hypothesis is often equated to a criminal trial, because every citizen
in India, is considered innocent before the count of law, till his guilt is proved.
Similarly, till the statically significance is tested with suitable testes and proved, it is
believed that there is no difference between the groups or between the sample
estimate and the true value of the population.
For example, if a researcher is intended to test the association between smoking and
cancer, starts with the hypothesis that there is no association between smoking and
cancer. Such a hypothesis, indicating ‘no difference’ or ‘equal’ is called ‘Null
Hypothesis’, which is denoted by ‘H0’
Example : 10 patients waiting in the out patient section. The doctor can call any one
of them as the first patient, 2nd patient, 3rd patient and so on with some amount of
freedom but the 10th patient must go and for her there is no freedom to call or to
enter, since she being the last one in the queue. So in a sample of 10 patients, all
the 9 patients except, the last one, has got some amount of freedom to enter the
study as the 1st ,2nd ,3rd, and so on. So if n is the sample size, the degree of freedom
(d.f.)is always, one less than the total number i.e.(n-1). Thus, in an one sample test
with n items, the degrees of freedom will be (n-1)and in a 2 sample test with sample
size n1=10(first sample)and n2=20(second sample), the total d.f.=(n 1-1)+(n2-
1)=(n1+n2-2)=10+20-2=28.
6. Compare the calculated value of the test statistic with the table value and
interpret
Refer the concerned statistical table (if normality test refer normal table, if t-test refer
t table, if chi-square test refer chi-square table and so on). As a general rule, refer
the table at 5 per cent probability level (p<0.005) corresponding to the appropriate
degrees of freedom to get the minimum level of significance. If the calculated value
of the statistics or t or chi-square or F is more than the table value, the test is
statistically significant at 5 per cent level. i.e.the probability of differing the argument
of the researcher (alternate hypothesis) is less than 5 if 100 sample studies have
taken up. In other words, in 95 percent or more of such sample studies the alternate
hypothesis many be true. If the calculated value is less than or equal to the table
value, the inference is that the test is not statistically significant (p>0.05), which
means the Null hypothesis is accepted, that is, there is no difference between the
two groups and if any present numerically, it may be due to sampling variation.
Generally, the calculated values of the test statistic are compared with the table
value (2 sided or 2 tail).
Table x. summarizes the state of affairs in the population and the nature of Type I &
Type II error.
State of Null Hypothesis in Decision
the population Accept Ho Reject Ho
Ho is true Correct – No error Type I error(α error)
Ho is false Type II error(β error) Correct- No error
If the decision has been made to reject the null hypothesis and in fact, the null
hypothesis is true, we have made a Type I error. Type I error occurs when the
researcher concludes that there is a statistically significant difference when in reality
it does not exist. Type I and type II errors are inversely related. When the probability
of type II errors reduces, the probability of type I error increases.
A Type I error has the probability of alpha (α), the level of statistical significance that
the researcher has set up.
If the alternative hypothesis is, in fact, true and the null hypothesis is actually false
but the decision maker concludes that we should not reject a null hypothesis, then
we have made what is called a Type II error. The probability of making this incorrect
decision is called beta (β).The quantity (1- β) is called the power of a test
No error is made if the null hypothesis is true and the decision is made to
accept it. A correct decision is also made if the null hypothesis is false and the
decision is made to reject the null hypothesis.
b) Confidence Interval
Confidence interval estimation is one way to make inference about the parameter.
After drawing a random sample of adequate size a sample statistic either sample
mean or sample proportion is calculated. This value is called point estimate. Then
the researcher defines an interval around this value within which population value is
likely to lie. Since the researcher uses a sample, the first step in constructing an
interval estimate is to decide on the risk the researcher is willing take of being wrong.
An interval estimate is wrong if it does not contain the population parameter. This
probability of error is called α (alpha).
The exact value of α will depend on the nature of research question, but a 5 %
(.05) probability is commonly used. Setting α = .05,also called 95% confidence level,
means that over the long run the researcher is willing to be wrong only 5% of the
time. In other words, if the researcher draws 100 random samples of same size and
if the limits are calculated every time then it may not contain the parameter in 5
times.
From the table of Normal distribution, it is known that mean± 1.96SD will contain 95
percent of the cases and only 5 percent of the cases will lie outside it. Similarly
mean±2.58 SD will contain 99 percent of the cases and only 1 percent of the cases
will lie outside. These properties can be used in the estimation procedure.
Introduction
There are various types of problems for which the test of significance
are used for drawing conclusions. Different types of problems need different tests but
the basis of all tests and the steps involved in the procedure are the same. The
common types of problems are
PROCEDURE
Where
S= n1S12 + n2S22
n1 + n2
n1 - Number of samples in first group
n2 - Number of samples in second group
Example:
Solution:
Given
Sample 1 Sample2
Sample size 62 76
Mean 15.5 20.0
SD 6.5 7.1
(b) Null hypothesis. There is no difference between the means of the hearing
thresholds taken in the sound proof room and in the field, that is, the two samples
have come from the same population.
= S
n1 1
1
+ n
2
= 3.846
(e) Comparison with the theoretical value. The probability of observing this
value (3.846) or greater value by chance is less than 1%. Hence, the rejection of the
null hypothesis (p<0.01).
(f) Inference. There is evidence to believe that the hearing level tested in the
sound-proof room is different from the hearing level tested in the fields.
x – x2
1
= S
n1 1
1
+ n
2
SE = S n1 1
1
+ n
2
where
Example:
In the feeding trial, 17 children were given high protein food supplement to their
normal diet and 15 comparable children were kept under normal diet. The total
calories of intake per child pe day to the high protein group is 1296 and of the control
group is 1293. They were kept on this feeding trial for a period of seven months. At
the end of the study, the changes (initial – final) in the haemoglobin (g%) level of the
two groups are assessed and given in Table 12.2. Does it provide any evidence to
say that the change in the haemoglobin level of the children who received high
protein food is different from the control group?
Table 12.2 Changes in Haemoglobin Levels o Children in High Protein Diet and
Control group
(b) Null hypothesis. The two samples have come from the population with
same mean. In other words, there is no difference in means of the change in
haemoglobin values between the children fed on the high protein diet and normal
diet.
(c) Standard error of the difference in means. This estimate of the standard
error of difference in means is given by the formula.
SE = S n1 1
1
+ n
2
In this problem,
S= 41.2704 + 33.8646
17 + 15 - 2 = 1.5826 g%
x – x2
1
= S
n1 1
1
+ n
2
= 2.923
(e) Comparison with the theoretical value. This critical ratio denoted as
follows a t-distribution with n 1 + n2 – 2 = (17 + 15 – 2 = 30) degrees of freedom. The
t-distribution for 30 degrees of freedom gives the 5% level as 2.042 and 1% level as
2.750. Our observed value is 2.923 which is greater than the 1% level. This means
the probability of getting by chance a value as much as 2.923.
(f) Inference. This experiment provides evidence to show that the mean
change in the haemoglobin level of the children fed on high protein diet is different
from the mean change in the haemoglobin level of the children fed on normal diet.
Comparison of Means of Two Correlated Samples (i.e., with the Same Subjects
in Both Groups): The Paired t-test
Critical Ratio.
d–0
t=
s / n
where
(dn -–1d)
2
S=
Example
Solution
(b) Null hypothesis. The sample is taken from the population in which there is
no difference in the skin-fold thickness.
(c) Standard error of the mean of difference. The mean of the difference is
0.75 mm. The estimate of the population standard deviation is given by
(dn -–1d) 2
S=
Where n stands for the number of subjects included in the study. In the problems,
S= 20.25
11 = 1.357 mm
d–0
t= s / n = 1.91
(e) Comparison with the theoretical value. This critical ratio, t, follows a t-
distribution with n – 1 (12 – 1 = 11) degrees of freedom. The 5% of level is 2.201 and
the 1% level is 3.106 for 11 degrees of freedom. The value 1.91 is less than the 5%
level. This means the probability of getting by chance a value as much as 1.91 or
greater is more than 5%. Therefore, we do not reject the null hypothesis.
(f) Inference. This experiment does not provide any evidence to say that there
is a difference between the initial and final values of skin-fold thickness.
Elements of Analysis of Variance (F Test)
An ANOVA (Analysis of variance), some time called F test, is closely related to the t
test. The major difference is that, where the t test measures the difference between
the means of two groups, an ANOVA tests the difference between the means of two
or more groups.
Thus, while comparing the means of more than two groups, the total
variability between groups has been divided into two – (i) the share attributable to
assignable cause, often termed as ‘between group variation’ and (ii) the share
attributable to chance cause, which is called ‘within group variation’ and compared.
Theoretically, it is difficult to estimate, the portion of the difference in the estimate
due to chance because, it is beyond the control of the researcher. Still we may be
able to estimate the total difference due to all factors, from which the portion due to
assignable cause if subtracted for getting the portion due to chance cause. Then the
ratio of the variability due to assignable cause to the variability due to chance cause
or error is called F value.
=
Mean variability between groups
Mean variability within group (error)
As F is computed as a ratio, it is also known as F Ratio,
A one way ANOVA , or single factor ANOVA, tests differences between groups that
are only classified on one independent variable. One potential draw back to an
ANOVA is that the F value tells that there is a significant difference between the
group, not which group are significantly different from each other. To find out where
the difference exist post-hoc comparison is used. Some commonly used post-hoc
tests are Sheffe’s and Tukey’s
Chi – Square test is a ‘non-parametric test’ or ‘distribution free test’. It is a ‘test for
association’ or ‘test for independence’. This test is most commonly used when the
data are in frequencies such as in the number of responses in two or more
categories. It can be used with any data which can be reduced proportions or
percentages. Chi – Square test is denoted by Greek letter χ 2 and is pronounced by
“kye Square”
(O – E)2
Chi-Square (2) =
E
Expected frequency =
If no relationship exists between the row &column variable then the value of χ 2 will
be small. If there is relationship between the variables then the value of χ 2 statistic
The d.f is calculated by (C-1) ((R-1), where C= No. of columns and R= number
of rows.
The chi square test can be used irrespective of the size of the sample. Still, it is
advisable to only, if the expected frequency is more than 5 in each cell and not at all
advisable, if it is zero.
However, in case the expected frequency is <=5 in any cell, a correction formula
suggested by ‘Yates’ is recommended. If the “Yates’ correction is applied the general
formula for the computation of chi square is given below
1 2
Chi-Square (2) = [O – E- /2]
E
Illustration
Hypothetical data for chi-square showing the effectiveness of a new type of surgery
= 9 + 9 + 3 + 3 = 24
Since the calculated value is greater than table value the observed
difference is statistically significant. Percentage having improvement in condition is
more in the treatment group (40%).
n1 (n1 1)
U1 = n1 n2 R1
2
n2 (n2 1)
U2 = n1 n2 R2
2
Where n1 = First sample size
n2 = Second sample size
R1 = Sum of ranks of the first group
R2 = Sum of ranks of the second group
Now U is the minimum of (u1, u2). Table of U values are available for different levels
of significance and for different values of n 1 and n2. For statistical significance, the
calculated value must be equal or lower than the table value.
Alternatively z static can be computed as follows
n1 n2
U
2
Z =
n1 n2 (n1 n2 1)
12
Now, the critical value of z can be used to assess the statistical significance.
Example
A survey was conducted among mothers to ascertain their views on PNDT act. One
group consist of primi mothers and other multi gravid, the scores are given below. Can
the researcher conclude that the opinions are same for the two groups,
Opinion on PNDT Act; table
Subjects Score of Rank Score of multi Rank
primigravida gravida
12 3.5 14 7
14 7 25 18
12 3.5 17 11
13 5 18 12
20 14 19 13
16 10 21 15
15 9 24 17
14 7 22 16
9 1
11 2
n1=10 R =62 n2=8
1 R =109
2
n1 (n1 2)
U1 = n1n2+ - R 1
2
10(10 1)
= 10x8+ 62
2
= 73
n2 (n 2 1)
U2 = n1n2+ - R 2
2
8 x9
= 10x8+ 179
2
= 7
n1 n2
U
2
z =
n1 n2 (n1 n2 1)
12
10 x8
7
2
=
10x8(10 8 1)
12
7 40 33
= 2.93
126.5 11 .25
The calculated value of Z is greater than the critical value at 5% level (1.96), the
observed difference is significant. Multi gravid women favour the act than
primigravida (find the mean and median for the two sets for clarity).
Wilcoxon signed – ranks test.
This test is an alternative to the paired‘t’ test. This test is ideal when the level
of measurement is ordinal. Also can be used with interval ratio level data when an
assumption of normality is failed. The computational procedure starts with finding the
difference for each pair of observations. In the next step the difference are ranked
ignoring the positive or negative signs zero differences are ignored and will; not be
ranked. Assign 1 to the lowest rank 2 to the next lowest and so on. When two or more
differences are the same, use the procedure for tied ranks. Now restore the signs to the
ranks and find the sum of positive ranks and negative ranks separately. Let the letter T
denotes the smaller of the totals. For significance, calculated value of T must be
equal or lower than the table value.
Alternatively Z statistics can be computed as
n(n 1)
T
4
Z =
n(n 1)(2n 1)
24
Now the critical values of z can be used to assess the statistical significance.
Wilcoxon signed rank test
Knowledge scores of 13 subjects before and after of an intervention programme. Test
whether the education programme was effective.
T= smaller of totals = 6
n=13
n(n 1)
T
4
z =
n(n 1)(2n 1)
24
13x14
6
4
= 13x14 x 27
24
= 2.7
Since the calculated value is greater than table value (1.96 the observed difference is
statistically significant).
RELIABILITY AND VALIDITY OF TEST SCORE
A test score is called reliable when we have reasons for believing the score to be
stable and trustworthy. Stability and trustworthiness depend upon the degree to
which the score is an index of “true ability” – is free of chance error. Scores achieved
on unreliable tests are neither stable nor trustworthy. In fact, a comparison of scores
made upon repetition of an unreliable test, or upon two parallel forms of the same
test, will reveal many discrepancies- some large and some small- in the two scores
made by each individual in the group. The correlation of the test with itself-computed
in several ways is called the reliability coefficient of the test.
There are mainly three procedures in common use for computing the reliability
coefficient (sometimes called the self-correlation) of a test. These are
In the split-half method, the test is first divided into two equivalent
“halves” and the correlation found for these half-tests. From the reliability of
the half-test, the self correlation of the whole test is then estimated by the
Spearman-Brown prophecy formula. The first set of scores, for example,
represents performance on the odd-numbered items, 1,3,5,7, etc.; and the
second set of scores, performance on even-numbered items, 2,4,6,8, etc.
Other ways of making up two half-tests which will be comparable in content,
difficulty and susceptibility to practice are employed, but the odds-evens split
is the one most commonly used. Form the self-correlation of the half-tests, the
reliability coefficient of the whole test may be estimated from the formula
r= 2r1
1 + r1
(Spearman-Brown prophecy formula for estimating reliability from
two comparable halves of a test)
The split-half method is regarded by many as the best of the methods for measuring
test reliability. One of this main advantages is the fact of the all data for computing
reliability are obtained upon one occasion, so that variations brought about by
difference between the two testing situations are eliminated. A marked
disadvantages of the split-half technique lies in the fact the chance errors may affect
scores on the two halves of the test in the same way, thus tending to make the
reliability coefficient too high. This follows because the test is administered only one.
The longer the test the less the probability that effects of temporary and variable
disturbances will be cumulative in one direction, and the more accurate the estimate
of score reliability
VITAL STATISTICS
Vital statistics have great importance in health. It helps to identify the health problem
of the community and makes solution for that. These are systematically collected
information regarding the events which occur in human life such as birth, marriages,
deaths etc. in a given population.
i. Census
ii. Records of vital Registration
iii. To see the met and unmet health needs of the community
iv. To fix up priority for National Health Programmes
4. Indicators of health
i. Mortality Indicators
ii. Morbidity indicators and
i. Mortality Indicators
Mortality means death. Death is very much related to health status of a country.
The mortality rates generally computed without considering any of the influencing
factors like age, sex, occupation etc. Such rates that are computed in a crude form is
called Crude Death Rate (C.D.R.).If the ‘factors influencing deaths’ are taken into
account at the time of computation, it is called Specific Death Rate.
The word Crude means not refined. Crude Death Rate means, the Death
Rate that is computed without looking into the specific factors responsible for death -
age, sex, occupation etc. So the Crude Death Rate is defined as the ratio of all
deaths from all causes during 1 year in a specified geographical area to the total
mid-year population per1000.
If various factors responsible for deaths are considered, such rates are called
specific death rate. The important specific death rates are:
The word infants denote children below 1 year. In this age group, deaths are
more due to various reasons and hence, it is considered as a sensitive index. The
Infant Mortality Rate (IMR) is defined as the ratio of Infant deaths (deaths below 1
year) to the total live births in one year and always expressed per 1000.
It is defined as the ratio of death of newborns from 28 days of age to <1 year to
the total live births and expressed per 1000. It is computed with the formula:
The Maternal Mortality Rate is defined as the ratio of maternal deaths (deaths
during pregnancy or after delivery but within 42 days) to total live births and
expressed per 1000.
No. of Maternal deaths x 1000
MMR = Total live birth in year
Fatality means the chances of non-survival. So the fatality rate indicates the
severity of illness. It is computed as the ratio of number of deaths due to a particular
disease to the total cases registered in the Hospital with the same disease and
usually it is computed as percentage. It is computed with the formula:
PMR (age) =
No. of deaths after 60 yr of age x 100
Total deaths in all ages
a) Prevalence rate.
1. Population at risk
In mortality statistics, the base for the computation of various indices is the mid-
year population, generally. But in the case of morbidity, the rates are computed by
considering the population at risk. It is defined as the aggregate of all the people,
who are supposed to have the risk of getting the disease-it, may be total persons
residing in a geographic area or the total persons working in a factory or the
students and teachers assembled in a class or all family members residing together
and so on.
2. Incidence Rate
Incidence indicates new sickness. So incidence rates are used to measure the
number of newly affected person or the new spells of sickness reported. If the
number of persons developing the disease during a specified time is considered, it is
called incidence Rate (persons), which is defined as the ratio of persons newly
affected to the population at risk and expressed per 1000 or even per 100. Likewise,
the Incidence Rate (spells or episodes) is the ratio of new spells of sickness to the
total population at risk per 1000. These rates can be calculated by the formula:
3. Prevalence Rate
Usually, Prevalence Rates are computed for chronic illnesses like, Tuberculosis,
Leprosy etc. In this case, the word prevalence signifies cases-both old and new.
The period prevalence rate (persons) is defined as the Ratio of current cases of
illness (both old & new) for a specified period to the total population at risk and
expressed per 1000.
Also the period prevalence rate(spells) can be defined as the ratio of current spells of
sickness (both old spells & new spells) to the total population at risk and multiplied
by 1000
At the same time the point prevalence Rate is defined as the ratio of the current
illness at a particular point of time and expressed per 1000.
Period prevalence rate (persons) = No. of persons currently sick during a specified period of times x 1000
Total Population at risk
Period Prev (Spell) = No. of cur. Spell of sickness during a period of time x 1000
Total Population at risk
No. of sickness (both old & new) at a particular point of time x 1000
Point prevalence rate = Population at risk
This rate is called Crude because none of the factors like age, sex, religion,
occupation etc. responsible for high fertility has been considered at the time of its
computation. The Crude Birth Rate is defined as the ratio of Total Live Births during
1 year to the Midyear estimated population per 1000.
In the computation of General Fertility Rate one limitation has given to the
denominator compared to C.B.R. i.e. in the denominator, female population in the
reproductive age group alone is considered. It is defined as the ratio of total live
births to the female in the reproductive age group (15-44 or 15-49 years) and
expressed per 1000.
It is defined as the ratio of total live births in a specified age group (say age
group 20-24 years) to the total women in the same age group (20-24 years)
multiplied by 1000.
The sum of all single-year age-specific fertility rates is called Total Fertility rate. If
age specific fertility rates are computed as intervals say women in 15-19 years, 20-
24 years etc., the sum must be multiplied by the class interval. If the single year age-
specific birth rates are computed for 1000.
Net Reproduction rate is the ideal rate to assess fertility status of a country. Gross
Reproduction rate provides the average number of female children that would be
born to a female in the entire reproductive life span. But there is no guarantee that all
the women in the reproductive span or their female off springs are going to survive
though out the reproduction life. Therefore, from GRR a subtraction is made to that
effect by considering the current mortality conditions of females. The calculations are
being done by Life Table Techniques. Almost all the developing countries trying to
attain a N.R.R. =1, so that the population may be able to maintain by itself. Then
every woman in the reproductive age group will be replaced by one, which makes no
change in the population. If N.R.R is more than 1 it indicates an increase in the
population and if it is less than 1, it is suggestive of a declining trend.