Professional Documents
Culture Documents
to the
Biostatistics 1
Course instructor: Dr. JMA Hannan
E-mail: jmahannan@northsouth.edu
How to do well in this class?
1. Forget about your previous failure.
2. Attend lectures and take notes.
3. * Effort = Result.
4. Read the syllabus.
5. Read exam questions carefully.
6. Answer all parts of a given question.
7. Turn assignments in on time
8. Ask if you have questions.
General Policy
Examination Marks
Midterm 1 20%
Midterm 2 20%
Final exam 40%
Class tests 10%
Assignment 5%
Class participation 5%
W.S. Gosset
STATISTICS - HISTORY PERSPECTIVES
R.A. Fisher
DEFINITION OF STATISTICS
• Simplifies complexity
• Helps to compare
APPLICATION OF OF BIOSTATISTICS
The concepts of statistics may be applied to a number of fields that include public
health, pharmaceutical company, business, psychology, agriculture etc.
In Medicine
In the field of medicine, statistical methods are used to evaluate effectiveness of a
new drug and method of treatment. A drug is given to animal or human to
explore whether the changes produced by the drug are due to the action of drug
or by chance, or to compare the action of two or more different drugs or different
dosages of the same drug are studied using statistical methods.
To find an association between disease and risk factors such as myocardial
infarction (MI) and alcohol intake, we need the help of statistics.
To define the normal range/limit of physiological and biochemical parameters for
example: the average systolic blood pressure is 120 mmgHg or random blood
glucose level is 6.7mmol/l but upto what limits it may be normal on either side of
average which may be established with appropriate statistical technique.
Continuation……..
APPLICATION OF OF BIOSTATISTICS
What factors increase the risk that an individual will develop coronary hart disease?
• It is the outcome of
facts (sex, occupation),
events (birth, death, disease)
measurements (height, weight)
About many individual i.e. when these happens for number of people then it
becomes data e.g.
Sex: male/female,
Birth: live birth/still birth
Death: cause/age/sex
Occupation: teacher/physician/labor etc
Types of data
1. Qualitative data
» Nominal data
» Rank data
2. Numerical or Quantitative
» Discrete data
» Continuous data
Qualitative Data
Nominal Data
• Nominal data are data that one can name.
Continuous Data
• Such data are measurement that can, in theory
at least, take any value within a given range.
• Example: Diastolic blood pressure, which is
continuous, is converted into hypertension and
normotension.
Collection of data
1. Interviewing or enumeration
2. Questionnaire
3. Experiments
Data from Physiology, Pharmacolgy and clinical pathology lab,
hospital ward, fundamental research etc
4. Surveys
Data of incidence/prevalence of health or disease situation in a
community such as incidence of malaria or prevalence of leprosy etc
5. Records
Records are maintained as a routine in register or books over along
period of time for still birth, death etc. Data are collected from these
records.
Methods of presentation of data
1. Tabulation of data
2. Diagrammatic presentation
Tabulation of data
Objectives of tabulation:
To clarify the object of investigation.
To simplify complex data.
To facilitate comparison
Rules for Tabulation of data
Row sub
heading
Row sub Body
heading
Row sub
heading
The following Table Shows the consumption per person
among adolescent boys
1996 654
1997 700
1998 900
1999 1200
2000 1500
2001 1350
Example: Tabulation of data
Data are presented as mean±SD. Unpaired t-test was done as the test of
significance. *p<0.05, **p<0.01.
Graphical presentation of data
Importance of diagrams:
They are attractive and impressive.
They save time and labour to understand
They make data simple.
They make comparison easy
They provide more information than table
Types of diagram
Line diagram.
bar diagram (simple & multiple)
Pie diagram
Histogram
Scatter
Line Diagram
Number of cigarettes consumption per adolescent
boy
2000
1500
1000
500
0
1996 1997 1998 1999 2000 2001
10
8
6
4
2
0
0 min 30 min 60 min 90 min 120 min
2000
1500
1000
500
0
1996 1997 1998 1999 2000 2001
250
200
150
100
50
0
AUC Glycemic
Index
Glucose only Diabetic food
19%
Pill
Injectable
2%
45% Condom
10% Sterilization
IUD and Norplant
7% Traditional
17%
SUMMARIZING DATA: Measures of location
Objective of average:
1. To get single value that represent the entire data.
2. To facilitate comparison between groups of data of similar nature.
x1 , x 2 , x3 ........x n
x
n
x
x
n
where
x stands for an observed value.
n stands for the number of observations in the data set.
stands for the sum of all observed x values.
stands for the mean value of x.
Merits:
• It is the most popular average easy to understand and easy to calculate.
• It takes all the observation into account.
• The mean is used in computing other statistics (such as the variance, standard
deviation etc)
Limitation:
• Mean is affected by extremely high or low values.
• It is not a good measure of average in extremely asymmetric distribution
of observations.
Median
75
Here n= 75, Therefore n/2 = 75/2 = 37.5. Looking at the cumulative range column in the table,
we find that n/2 (37.5) falls in the range 17 – 21. This means that median value lies between 17
and 21.
L = 17, F = 35, f = 19, c = 5.
n
Here F
Median L 2 c
37.5 35
= 17 ± 5 = 17.66
f 19
Where, L = The lower limit of the median class (median class is that class which contains n/2 observations of the
series). N = Total number of observation
F = Cumulative frequency of the class just preceding the median class.
f = Frequency of the median class
c = The class interval of the median class.
Median
Merits:
• Median is easy to understand and easy to calculate.
• It is not affected by extremely high or low values.
Limitation:
• It is not based on all the observations. It is a position average
and thus it is not determined by each and every observation.
• It is less reliable average than mean when number of
observation is small.
Mode
Merits:
• Mode is easy to understand and easy to calculate.
• Like median, mode is not at all affected by extremely high or low
values.
• When there is a large frequency in a distribution, mode happens to be
meaningful as an average.
Limitation:
• It is not based on all the observations.
• It is less reliable average than mean when number of observation is
small.
Since average is a single value representing a
group of values it must be properly interpreted
otherwise there is a possibility to wrong
conclusion.
COMPARISON OF MEAN, MEDIAN & MODE
• The mode is useful for non-numeric data. It provides little information about the rest of the
values in the data.
• The mean can be seriously affected by the presence of outliers (When an observation is very
different from all other observations in a data set, it is called an outlier i.e very small or large
values, eg. 200) but the median is not.
5 7 8 8 12 15 19 21 23
median = 12, mean = 13.1
5 7 8 8 12 15 19 21 23 200
median = 13.5, mean = 31.8
• The median (a position average) does not alter because it is only dependent on the middle
observation's value. The mean does change, however, because it is dependent on the average
value of all observations. So, in the above example, as the last value of the last observation
increases, so too does the mean.
• Outliers can sometimes occur as a result of error or deliberate misinformation. In these cases,
the outliers should be excluded from the measure of central tendency. Other times, outliers just
show how different one value is, and this can be a very useful piece of data.
COMPARISON OF MEAN, MEDIAN & MODE Cont’d
3. Half of the data lies below the median and half of the data lies above it. This will be
approximately true for the mean when the data is symmetric. If the data is skewed, then the
median may differ significantly from the mean and usually the median would be used.
4. By choosing a wrong measure of central tendency, one can mislead people with statistics. In fact,
this is commonly done.
SUMMARIZING DATA: Measures of variation
The median and mean mark for both tests are 20 but data A is more spread out
than data B.
Important measures of dispersion are:
1. Range
2. Variance & standard deviation
3. Standard error of Mean
4. Co-efficient of variation.
Range
Application:
Range is used in medical science to define the normal limits of biological
characteristics.
Example: normal ranges of systolic and diastolic blood pressure are 100 – 140
mm and 80 –90 mm respectively. Ordinarily observations falling within a
particular range are considered normal and those falling outside the normal
range are considered as abnormal.
Range for a biological character such as blood cholesterol, fasting blood sugar,
hemoglobin, bilirubin etc is worked out after measuring the characteristics in
large number of healthy persons of the same age, sex, class etc.
Range
Merits:
• It is simple to compute and understand.
• It gives a rough but quick answer
Limitation:
1. It is not a satisfactory measure as it is based only on two
extreme values, ignoring the distribution of all other
observations within the extremes. These extreme values vary
from study to study, depending upon the size and nature of
sample and type of study.
Variance & Standard deviation
Variance
( x x ) 2
S .D. ( )
( x x ) 2
n 1 n 1
It is computed as the root of average squared deviation of each number from its
mean. For example, for the numbers 1, 2, and 3 the mean is 2 and the standard
deviation is:
SD = 0.667 = 0.44
Standard deviation
Merits:
• It is the most important and widely used measure of dispersion.
• It is based on all the observations and the actual sign of deviations
are used.
• Standard deviation provides the unit of measurement for the
normal distribution.
• It is the basis for measuring the coefficient of correlation, sampling
and statistical inference.
Limitation:
• It is not easy to understand and difficult to calculate
• It is affected by the value of every item in the series.
Calculations of SD:
= 20 = 20
In these two groups, means are same (20) but their variation (SD) is different (SDA, 8.2 and SDB, 5.5).
Calculations of SD with alternative formulas:
Greater SD, greater is variation of observation.
The standard error of a sample mean is just the sample standard deviation
divided by the square root of the sample size.
SD
SE
n
If we draw a series of samples from same population and calculate the mean of
the observations in each, we have a series of means. The series of means, like
the series of observations in each sample, has a standard deviation. The SE of
the mean of one sample is an estimate of the SD that would be obtained from
the means of a large number of samples drawn from the population.
Another thing is if we draw random samples from the population their means
will vary from one to another. This variation depends on the variation of
population and size of samples. We do not know the variation of population so
we use the variation of the sample as an estimate of it. This is expressed in SD
and if we divide SD by squire root of the number of observations in the sample
we have an estimate of SE of mean, SEM = SD/n
Advantage of SE
x
z
s
n
SE = n
n SE
Greater SE, greater is variation of observation.
SD
C.V . 100
Mean
Example: Height (cm) of adult and children are given in the table
Mean SD CV
It means though height in adult shows greater variation in SD, but real
thing is that children is greater variation.
Population & Sample
Population
• All possible values of a variable or all possible objects whose characteristics
are of interest in any particular investigation or enquiry.
• If the income of the citizen of country is of interest to us, the aggregate of
all relevant incomes will constitute the population.
Sample
• A sample is a part of population.
• Although we are primarily interested in the properties of a population or
universe, it is often impracticable or even impossible to study the entire
universe.
• Thus inferences about a population are usually drawn on the basis of a
sample. It represents the population.
Normal Distribution
Characteristics :
Relationship
The Correlation The Spearman Rank Correlation Coefficien
between two
Coefficient
variables
Nonparametric Regression Analysis
Simple Linear Regression