You are on page 1of 93

BIOSTATISTICS

COMH 601

By

Tadesse Alemayehu (PhD)


College of Health & Medical Sciences, HU
January 2016
Course objectives
At the end of the course, students will be able to:
• Differentiate the type of scale measurements and the different
types of data
• Calculate the most known summary values for set of data
• Differentiate between point and interval estimation and explain
the meaning and application of confidence interval
• Use methods of association in measurement data – correlation
analysis
• Describe relationship of variables using simple and multiple
variable regression analysis and interpret coefficients and
ratios
• Apply methods of regression analysis for measurement data in
two or more variables
2
• Identify model assumptions – parameter estimation, hypothesis
testing and prediction
• Use Analysis of Variance (ANOVA) and multivariate analysis of
variance (MANOVA)
• Use different non-parametric tests for distribution free data
• Understand categorical data and methods of analysis
• Apply the chi-square test for nominal and ordinal data and
interpret associations
• Identify modern methods used to analyze time-to-event data.
• Use survivorship functions, Kaplan-Meier curves, log rank test,
Cox regression, model fitting strategies, model interpretation,
stratification, time dependent covariates, and introduction to
parametric survival models
3
Methods of Instruction
• Lectures

• Take-home assignments

• Practical exercises and tutorials

• Questions and answers

• Group discussions

• Independent reading on exercises

• Attendance at classes is strictly compulsory

• Computer lab on data entry and analysis 4


Contents
1. Introduction to descriptive statistics
2. Statistical Estimation
a. Estimation
b. Test of hypothesis
3. Sampling and sampling distribution??
4. Categorical data analysis
5. Bivariate and Multivariate analysis of
measurement data
a. Correlation analysis
b. Regression analysis
c. Analysis of variance
d. Non-parametric tests
5
6. Survival Analysis
References
1. Daniel W. Biostatistics: A Foundation for
analysis in Health Sciences
2. Campbell. Medical Statistics
3. M. Pagano & K. Gauvereau: Principles of
Biostatistics
4. Jaypee. Methods in Biostatistics. 6th edition
5. Armitage P. & Berry G. : Statistical Methods
in Medical Research
6. Stanton. Premier of Biostatistics

6
7. F. L. Hernandez: Biostatistics: A Guide to
Design, Analysis and Discovery
8. H. Motulsky. Intuitive Biostatistics
9. C. Friis. Introduction to Biostatistics for Health
Sciences
10. V. Shukla: Biostatistics: perspective in
Healthcare Research & Practice
11. G. Van Belle: Biostatistics: A Methodology for
health Sciences
12. Rao: Biostatistics: A manual of methods use in
Health, Nutrition and Anthropology. 2nd edition
13. Rosner. Fundamentals of Biostatistics. 7th
edition

7
Descriptive Statistics

8
• Biostatistics: a science that deals with the
collection, organization, analysis, interpretation
and presentation of information that can be
stated numerically.
 The application of statistical methods to the
fields of biological and medical sciences or
public health.
 Has central role in medical investigations
 Concerned with interpretation of biological data
& the communication of information about data

9
Uses of biostatistics
• Provide a way of organizing information
• Assessment of health status
• Health program evaluation
• Resource allocation
• aids in summarizing the results
• helps us recognize underlying trends and
tendencies in the data
• aids in communicating the results to others
• Magnitude of association
– Strong vs weak association between
exposure and outcome 10
• Assessing risk factors
– Cause & effect relationship
• Evaluation of a new vaccine or drug
– What can be concluded if the proportion of
people free from the disease is greater among
the vaccinated than the unvaccinated?
– How effective is the vaccine (drug)?
– Is the effect due to chance or some bias?
• Drawing of inferences
– Information from sample to population

11
What does biostatistics cover?

Research Planning

Design The best way to


Biostatistical learn about
thinking biostatistics is to
Execution (Data collection)
contribute in follow the flow of a
every step in a research from
Data Processing
research inception to the
final publication
Data Analysis

Presentation

Interpretation
12
Publication
Types of Statistics

1. Descriptive statistics:
• Ways of organizing and summarizing data
• Helps to identify the general features and
trends in a set of data and extracting
useful information
• Also very important in conveying the final
results of a study
– Example: tables, graphs, numerical summary
measures
13
2. Inferential statistics:
• Methods used for drawing conclusions
about a population based on the
information obtained from a sample of
observations drawn from that population
– Example: Principles of probability, estimation,
CI, comparison of two or more means or
proportions, hypothesis testing, etc.

14
• Population
– complete set of individuals, objects or
measurements

• Sample
– a sub-set of a population

• A parameter is a characteristic of a population


– e.g., the average height of all Ethiopian.

• A statistic is a characteristic of a sample


– e.g., the average height of a sample of
Ethiopian.
15
Data

• Data are numbers which can be measurements


or can be obtained by counting
• The raw material for statistics
• Can be obtained from:
– Routinely kept records
– Surveys
– Counting
– Experiments
– Reports
– Observation

16
Type of data and scales of measurement
• All measurements are not the same.
• Measuring weight of a patient eg. 40kg
• Measuring the status of a patient on scale eg.
―improved‖, ―stable‖, ―not improved‖.

• Measuring scales are different according to the


degree of precision involved.

• There are four types of scales of measurement.


1. Nominal scale: uses names, labels, or sym
bols to assign each measurement to one of
a limited number of categories that cannot
be ordered.
Examples: Blood type, sex, race, marital
status
2. Ordinal scale: assigns each measureme
nt to one of a limited number of categories
that are ranked in terms of a graded order.
Examples: Patient status, Cancer stages

18
3. Interval scale: assigns each measurement
to one of an unlimited number of categories
that are equally spaced.
• It has no true zero point.
Example: Temperature measured on
Celsius or Fahrenheit

4.Ratio scale: measurement begins at a true


zero point and the scale has equal space.
Examples: Height, weight, blood pressure
19
20
Interval
Ordinal
Nominal

Ratio
Degree of precision in measuring
Variables
Variable: A characteristic which takes different
values in different persons, places, or things
Qualitative variable: The notion of magnitude is
absent or implicit.
Quantitative variable: Variable that has magnitude.

Discrete variable: It can only have a finite number of


values in any given interval.
Continuous variable: It can have an infinite number
of possible values in any given interval.

21
SUMMARY

Variable

Types Quantitative
Qualitative
of measurement
or categorical
variables

Nominal Ordinal Discrete Continuous


(not ordered) (ordered) (count data) (real-valued)
e.g. ethnic e.g. response e.g. # of e.g. height
group to treatment admissions

Measurement scales 22
Methods of Data Organization
and Presentation

23
• The actual organization and
summarization of data starts from
frequency distribution.

• Frequency distribution: A table which


has a list of each of the possible values
that the data can assume along with the
number of times each value occurs.

24
Frequency Distributions
• Simple frequency distribution:
– It is useful for categorical variable
– For continuous variable it is not common.
– But the following information can be obtained if the
number of observation is not too large
• it allows you to pick up at a glance some valuable informa
tion, such as highest, lowest value.
• ascertain the general shape or form of the distribution
• make an informed guess about central tendency values

25
a) Qualitative variable:
• Count the number of cases in each category.

- Example1: The ICU type of 25 patients entering intensive


care unit at a given hospital:
1. Medical
2. Surgical
3. Cardiac
4. Other

26
Table: Distribution of 25 patients entering ICU at a
certain period in ―X‖ Hospital:
Frequency Relative Percentage
ICU Type (How often) Frequency (100%)
(Proportionately
often)
Medical 12 0.48 48
Surgical 6 0.24 24
Cardiac 5 0.20 20
Other 2 0.08 8
Total 25 1.00 100

27
b) Quantitative variable:
- Select a set of continuous, non-overlapping
intervals such that each value can be placed
in one, and only one, of the intervals.
- The first consideration is how many intervals
to include

28
For a continuous variable
(e.g. – age), the frequency
distribution of the individual
ages is not so interesting.
We “see more” in frequencies
of age values in “groupings”.
Here, 10 year groupings
make sense.
Example:
– Leisure time (hours) per week for 120 college
students:
23 24 18 14 20 36 24 26 23 21 16 15 19 20
22 14 13 10 19 27 29 22 38 28 34 32 23 19
21 31 16 28 19 18 12 27 15 21 25 16 23 24
18 14 20 36 24 26 23 21 15 16 19 20 22 14
13 19 27 10 19 23 32 28 34 38 28 22 29 31
21 16 28 19 18 24 23 16 21 25 15 27 12 18
14 14 20 22 20 36 24 16 19 23 25 23 21 20
13 31 29 22 27 19 19 23 32 28 38 28 34 32
16 21 25 15 27 12 18 21
Grouped frequency Distribution

• Too few interval results loss of information


• Too many interval results that the objective of
organization will not be met

32
To determine the number of class intervals and the
corresponding width, we may use:

Sturge’s rule:
K  1  3.322(logn)
LS
W
K
where
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value

33
Reading Assignment

Read about constructing frequency distribution


for quantitative variable – Grouped frequency
Distribution

34
Diagrammatic presentation

35
Importance of diagrammatic representation:

1. Diagrams have greater attraction than


mere figures.
2. They give quick overall impression of the
data.
3. They have great memorizing value than
mere figures.
4. They facilitate comparison
5. Used to understand patterns and trends

36
• Well designed graphs can be powerful
means of communicating a great deal of
information

• When graphs are poorly designed, they not


only don‘t effectively convey message, but
they are often misleading.

37
Specific types of graphs include:
• Bar graph Nominal, ordinal
• Pie chart data

• Histogram
• Stem-and-leaf plot
• Box plot
• Scatter plot Quantitative
data
• Line graph
• Frequency polygon
• Ogive
38
• Others
Reading Assignment

Read about diagrammatic representation:


how to construct the different graphs,
diagrams, charts, etc

39
Numerical Summary Measures
– Single numbers which quantify the
characteristics of a distribution of values
 Measures of central tendency (location)
 Measures of dispersion

40
Measures of Central Tendency (MCT)
• On the scale of values of a variable there is
a certain stage at which the largest number
of items tend to cluster.
• Since this stage is usually in the centre of
distribution, the tendency of the statistical
data to get concentrated at a certain value
is called “central tendency”
• The various methods of determining the
point about which the observations tend to
concentrate are called MCT.
41
• The objective of calculating MCT is to
determine a single figure which may be
used to represent the whole data set.

• In that sense it is an even more compact


description of the statistical data than the
frequency distribution.

• Since a MCT represents the entire data, it


facilitates comparison within one group or
between groups of data.

42
Characteristics of a good MCT
A MCT is good or satisfactory if it possesses
the following characteristics.
1. It should be based on all the observations
2. It should not be affected by the extreme values
3. It should have a definite value
4. It should not be subjected to complicated and tedious
calculations
5. It should be capable of further algebraic treatment
6. It should be stable with regard to sampling

43
• The most common measures of central
tendency include:
– Mean
– Median
– Mode

44
1. Arithmetic Mean

A. Ungrouped Data
• The arithmetic mean is the "average" of the data
set and by far the most widely used measure of
central location
• Is the sum of all the observations divided by the
total number of observations.

45
B. Grouped Data
• In calculating the mean for grouped data, we
assumed that all values falling into a particular class
interval are located at the mid point of the interval
• Because of this, it is calculated as follows:
k

m f i i

X  i 1
k

f
i 1
i

Where k = the number of class intervals


mi = the mid point of the ith class interval
fi = the frequency of the ith class interval
46
Example. Compute the mean age of 169
subjects from the grouped data.

Class interval Mid-point (mi) Frequency (fi) mifi

10-19 14.5 4 58.0


20-29 24.5 66 1617.0
30-39 34.5 47 1621.5
40-49 44.5 36 1602.0
50-59 54.5 12 654.0
60-69 64.5 4 258.0
Total __ 169 5810.5

Mean = 5810.5/169 = 34.48 years 47


The mean can be thought of as a “balancing
point”, “center of gravity”

48
When the data are skewed, the mean is
“dragged” in the direction of the skewness

• It is possible in extreme cases for all but one of the sample


points to be on one side of the arithmetic mean & in this
case, the mean is a poor measure of central location or does
not reflect the center of the sample.
49
Mean, Median, Mode

Mean Mode Mean Mean


Mode
Median
Median Mode Median

Negatively Symmetric Positively


Skewed (Not Skewed) Skewed
2. Median
a) Ungrouped data
• The median is the value which divides the data set into
two equal parts.

• If the number of values is odd, the median will be the


middle value when all values are arranged in order of
magnitude.

• When the number of observations is even, there is no


single middle value but two middle observations.

• In this case the median is the mean of these two middle


observations, when all observations have been arranged
in the order of their magnitude. 51
If the sample  n  1
size n is ODD Mediam =  thValue
 2 
If the sample
  n  th  n  th 
size n is EVEN M ediam = average of       1  value
2  2  
 

52
• The median is a better description (than the mean) of
the majority when the distribution is skewed
• Example
– Data: 14, 89, 93, 95, 96
– Skewness is reflected in the outlying low value of 14
– The sample mean is 77.4
– The median is 93

53
b) Grouped data
• In calculating the median from grouped data, we
assume that the values within a class-interval
are evenly distributed through the interval.

• The first step is to locate the class interval in


which the median is located, using the following
procedure.

• Find n/2 and see a class interval with a minimum


cumulative frequency which contains n/2.
• Then, use the following formal.
54
n 
  Fc 
~
x = Lm  2 W
 fm 
 
where,
Lm = lower true class boundary of the interval containing the median
Fc = cumulative frequency of the interval just above the median class
interval
fm = frequency of the interval containing the median
W= class interval width
n = total number of observations

55
Example. Compute the median age of 169
subjects from the grouped data.

n/2 = 169/2 = 84.5

Class interval Mid-point (mi) Frequency (fi) Cum. freq


10-19 14.5 4 4
20-29 24.5 66 70
30-39 34.5 47 117
40-49 44.5 36 153
50-59 54.5 12 165
60-69 64.5 4 169
Total 169

56
• n/2 = 84.5 = in the 3rd class interval
• Lower limit = 29.5, Upper limit = 39.5
• Frequency of the class = 47
• (n/2 – fc) = 84.5-70 = 14.5

• Median = 29.5 + (14.5/47)10 = 32.58 ≈ 33

57
3. Mode

• The mode is the most frequently occurring


value among all the observations in a set
of data.
• It is not influenced by extreme values.
• It is possible to have more than one mode
or no mode.
• It is not a good summary of the majority of
the data.
58
Mode
Mode
Mode

20
18
16
14
12
10
8
6
4
2
0 59
T. Ancelle, D. Coulombie
a) Ungrouped data

• It is a value which occurs most


frequently in a set of values.
• If all the values are different there is no
mode, on the other hand, a set of
values may have more than one mode.

60
• Example
• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
• Mode is 4 “Unimodal”
• Example
• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
• There are two modes – 2 & 5
• This distribution is said to be “bi-modal”
• Example
• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
• No mode, since all the values are different
61
b) Grouped data
• To find the mode of grouped data, we usually
refer to the modal class, where the modal
class is the class interval with the highest
frequency.
• There are two ways of determining mode:
1. For a rough estimation of a mode, the mid-
point of the modal class interval could
consider.

62
63
2. It is also possible to approximate mode for
grouped frequency distribution based the
formula:
^
X = L.c.bmod + (d1)
d1+d2 W

Where: L.c.bmod = L.c.b. of the modal class


d1 = fmod – fl
fl = frequency next lower to the modal class
d2= fmod – fh
fh = frequency next higher to the modal class
fmod = frequency value of the modal class
64
modal class = class with the highest frequency
4. Geometric mean (GM)
• Mainly used in many types of laboratory data,
specifically data in the form of concentrations of
one substance in another

If x 1 , x 2 , ..., x n are n positive observed values, then


n
GM = n  x i
i=1

and
n

 logx
i=1
i
logGM = .
n
The geometric mean is generally used with data measured on a logarithmic scale, such
as titers of anti-neutrophil immunoglobulin G. 65
• Example: the minimum inhibitory concentration of
penicillin in urine for N. gonorrhoeae in 71 patients
(µg/ml) Frequency (µg/ml) Frequency

0.03125 21 0.250 19
0.0625 6 0.50 17
0.1250 8 1.0 3

Solution:
logGM = [21log(0.03125) + 6log(0.0625) +
8log(0.125) + 19log(0.25) + 17log(0.5)
+ 3log(1.0)]/71 = -0.846
The GM = the antilogarithm of -0.846 = 0.143
66
5. Harmonic mean (HM)
• Just as the geometric mean is based on an
arithmetic mean of logarithms, so is the
harmonic mean based on arithmetic mean of
the reciprocals.
• Pertains to rates and time
• We define it as the reciprocal of the arithmetic
mean of the reciprocal of the given numbers.

If the given numbers are x 1 , x 2 , ..., x n , then


1
HM =
1 n 1

n i=1 x i 67
6. Weighted mean (WM)
• In a weighted mean, separate outcomes
have separate influences.

• The influence attached to an outcome is


the weight.

• Familiar is the calculation of a course


grade as a weighted average of scores on
separate outcomes.

68
Example:

69
Which measure of central tendency is best with a
given set of data?

• Two factors are important in making this


decisions:
– The scale of measurement (type of data)
– The shape of the distribution of the
observations

70
• The mean can be used for discrete and
continuous data
• The median is appropriate for discrete and
continuous data as well, but can also be
used for ordinal data
• The mode can be used for all types of
data, but may be especially useful for
nominal and ordinal measurements
• For discrete or continuous data, the
―modal class‖ can be used
71
• The geometric mean is used primarily for
observations measured on a logarithmic
scale.
• Harmonic mean is a suitable MCT when
the data pertains to rates and time.
• Weighted mean is commonly used in the
calculation of mean for different
outcomes.

72
Measures of Dispersion

Consider the following two sets of data:

A: 177 193 195 209 226 Mean = 200

B: 192 197 200 202 209 Mean = 200

Two or more sets may have the same mean and/or


median but they may be quite different.

73
These two distributions have the same mean,
median, and mode

74
• MCT are not enough to give a clear
understanding about the distribution of
the data.

• We need to know something about the


variability or spread of the values —
whether they tend to be clustered close
together, or spread out over a broad
range

75
• Measures that quantify the variation or
dispersion of a set of data from its central
location

• Dispersion refers to the variety exhibited by


the values of the data.

• The amount may be small when the values are


close together.

• If all the values are the same, no dispersion

76
• Measures of dispersion include:
– Range
– Inter-quartile range
– Variance
– Standard deviation
– Coefficient of variation
– Standard error
– Others

77
1. Range (R)

• The difference between the largest and smallest


observations in a sample.

• Range = Maximum value – Minimum value

• Example –
– Data values: 5, 9, 12, 16, 23, 34, 37, 42
– Range = 42-5 = 37

78
2. Interquartile range (IQR)
Quartiles
• Split ordered data into 4 quarters
25% 25% 25% 25%

 Q1   Q2   Q3 
• Q1 = first quartile

• Q2 = second quartile= Median

• Q3= third quartile


• Indicates the spread of the middle 50% of
the observations, and used with median

IQR = Q3 - Q1

• Example: Suppose the first and third quartile for


weights of girls 12 months of age are 8.8 Kg and
10.2 Kg, respectively.
IQR = 10.2 Kg – 8.8 Kg
i.e., 50% of the infant girls weigh between 8.8 and
10.2 Kg.

80
3. Mean deviation (MD)

• Mean deviation is the average of the


absolute deviations taken from a central
value, generally the mean or median.
• Consider a set of n observations x1, x2,
..., xn. Then:
n
1
MD   x i  A
n i 1
• ‗A‘ is a central value (arithmetic mean or
median).
81
4. Variance (2, s2)
• The main objection of mean deviation, that
the negative signs are ignored, is removed
by taking the square of the deviations from
the mean.

• The variance is the average of the squares


of the deviations taken from the mean.

82
• It is squared because the sum of the
deviations of the individual observations of
a sample about the sample mean is
always 0

0 = ( x i - x )

• The variance can be thought of as an


average of squared deviations
83
• Variance is used to measure the
dispersion of values relative to the mean.
• When values are close to their mean
(narrow range) the dispersion is less than
when there is scattering over a wide
range.
• Population variance = σ2
• Sample variance = S2

84
a) Ungrouped data
 Let X1, X2, ..., XN be the measurement on
N population units, then:

 i
(X   ) 2

2  i 1
where
N
N

X i
= i=1
is the population mean.
N
85
A sample variance is calculated for a sample of
individual values (X1, X2, … Xn) and uses the sample
mean (e.g. ) rather than the population mean µ.

86
b) Grouped data

 i
(m  x) 2
fi
S2  i=1
k

f
i=1
i -1

where
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
x = the sample mean
k = the number of class intervals
87
5. Standard deviation (, s)

• It is the square root of the variance.


• This produces a measure having the
same scale as that of the individual
values.

   and S = S
2 2

88
Example. Compute the variance and SD of the age of 169
subjects from the grouped data.
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23
SD = √S2 = √120.23 = 10.96
Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80
Total 169 1901.20 20199.22

89
SD Vs Standard Error (SE)

• SD describes the variability among individual


values in a given data set
• SE is used to describe the variability among
separate sample means obtained from one
sample to another

• We interpret SE of the mean to mean that


another similarly conducted study may give a
mean that may lie between  SE.
90
Standard Error

• SD is about the variability of individuals

• SE is used to describe the variability in


the means of repeated samples taken
from the same population.

• For example, imagine 5,000 samples, each of the same


size n=11. This would produce 5,000 sample means.
This new collection has its own pattern of variability. We
describe this new pattern of variability using the SE, not
the SD.
91
6. Coefficient of variation (CV)
• When two data sets have different units
of measurements, or their means differ
sufficiently in size, the CV should be
used as a measure of dispersion.
• It is the best measure to compare the
variability of two series of sets of
observations.
• Data with less coefficient of variation is
considered more consistent.
92
CV is the ratio of the SD to the mean multiplied by 100%.

S
CV  100%
x
SD Mean CV (%)
SBP 15mm 130mm 11.5
Cholesterol 40mg/dl 200md/dl 20.0

• ―Cholesterol is more variable than systolic blood


pressure‖

93

You might also like