Lecture 1 - IntrotoBiostats - 1

Sikhulile Moyo, PhD, MSc, MPH.
Research Associate/Snr Lecturer
Introduction to Biostatistics
Objectives
• Explain the need for studying biostatistics in science

• List applications of biostatistics
• Define terms used in biostatistics
• Explain descriptive statistics
• Describe and produce descriptive statistics for categorical data
2
Statistics and the Scientific Method
Scientific Method
Question
Make Design
Conclusions Study
Analyze
Collect Data
Data
3
Why study statistics?
• MLS and CHS degree programmes prescribe a research
project
• much research relies on statistics
• literature is full of statistics
– support study of other courses
4
Defining Statistics (1)
To analyze data, we use statistics.
Statistics are methods/tools that we use to:

• Collect
• Organize
• Summarize
• Analyze
• Present
• Understand (draw conclusions from the data)
the data we collect in health and disease, and research
studies. 5
Examples of Statistics (a)
Example: A study was done on reversible bone loss in

breast-feeding women compared to controls.
Group = 1 (breast-feeding women) or
2 (non-breast-feeding women)
Bone = Percent change in the mineral content of women's spines
6
Examples of Statistics (b)
The raw data from this study:

ID Group Bone ID Group Bone ID Group Bone ID Group Bone
1 2 -4.4 21 1 -0.4 41 2 -3.6 61 2 -4.7
2 1 2.9 22 2 -2.1 42 2 0.4 62 2 -3.3
3 1 -1.5 23 2 -4.9 43 1 2.2 63 2 -6.8
4 1 1.2 24 1 0.7 44 2 0.2 64 1 1.7
5 2 -2.1 25 2 -6.5 45 1 0.9 65 2 -5.1
6 1 0.3 26 2 -2.7 46 2 1.7 66 2 -3.1
7 2 -4.3 27 1 -0.4 47 2 -3.8 67 2 -0.8
8 2 -6.5 28 2 -5.2 48 2 2.2 68 2 -2
9 2 -7 29 1 -0.1 49 2 -5.3 69 2 -8.3
10 2 -0.3 30 1 -0.4 50 2 -2.5
11 2 -4 31 2 -1 51 2 -1.8
12 1 0 32 1 -2.2 52 2 -6.8
13 2 -6.2 33 2 -5.2 53 2 -5.7
14 1 -0.1 34 1 2.4 54 2 -4.7
15 1 -1.6 35 2 -2.3 55 2 -5.3
16 2 0.3 36 1 -0.6 56 2 -5.9
17 2 -2.2 37 2 -4.9 57 2 -7.8
18 2 -2.2 38 2 -2.5 58 1 1.1
19 1 -0.1 39 2 -1 59 1 1
20 2 -5.6 40 2 -3 60 1 -0.2
7
Source: Baldi, B, Moore, DS. Practice of Statistics in the Life Sciences. New York: W.H. Freeman and Company; 2009
Examples of Statistics (c)
The data summarized in a box plot:
2
% Change in Mineral Content
0
-2
-4
-6
-8
Other Women Breast-feeding Women
8
In statistics, we use a
study sample to make
inference about some
population that we are
interested in.
9
Sample
N=5
10
What are some types of statistics that you are familiar with?
11
What are some types of statistics that you are familiar with?
… in all of these examples, statistics are a way to organize the

raw data in some meaningful way.
12
Need for Biostatistics (1)
Biostatistics is the application of statistical

methods to problems in the biological and
related sciences (medicine, public health, etc)
In biology and health sciences, there is always

variability. Examples:
• Range of normal BP values in the human body
• Amount of hemoglobin in the blood
• Size of tumor
• Proportion with genetic abnormality
13
• Growth rates, etc
Need for Biostatistics (2)
The field of biostatistics provides graphical and numerical

methods that can:
• Quantify data
• Present data
• Account for biological variation
in biological research studies.
14
Applications of Biostatistics (1)
Biostatistical methods have a role in:

• Official health statistics
– Ex.: Studying trends of number of cases of a disease over time
• Epidemiology
– Ex.: Association of diseases with some aetiological factors
• Clinical studies
– Ex.: Comparison of treatments in clinical trials
15
Applications of Biostatistics (2)
Biostatistical methods have a role in:

• Human biology
– Ex.: Growth pattern
• Agriculture
– Ex.: crop yields
• Laboratory studies
– Ex.: Dose-response studies
– Ex: Reference intervals
• Health service administration
– Ex.: With limited resources, there may be need to prioritize
16
target groups for necessary interventions
Parameter vs. Statistic (1)
Parameter: a numeric quality, usually

unknown, that describes a certain
population characteristic
• Examples: mean height, median income, proportion
infected with HIV, prevalence of breast cancer, etc
Statistic: a quantity, calculated from a

sample of data, used to estimate a
parameter (used to describe the sample)
also called an estimate 17
In other words:
• A parameter is a feature of the population.
• A statistic is a feature of the sample (random subset of a
population)
18
Common parameters and statistics and their symbols
Quantity Population Parameter Sample Statistic

Mean µ (“mu”)
Standard
σ (“sigma”) s (“s”)
deviation
Variance σ2 (“sigma-squared”) s2 (“s-squared”)
Proportion p (“p”)
Correlation ρ (“rho”) r (“r”)
Regression
β (“beta”) b (“b”)
coefficient
19
Variables
A variable is a quantity or measure that can be

different (varies) from person to person (or
object to object). It is a characteristic of a
population.
Examples:
Variable Possible Values
Height (cm) 158, 169.3, 170, 200.6
Weight (kg) 10.2, 50, 69.4, 84
Outcome of disease Recovery, chronic illness, death
Parity 0, 1, 6, 8, 10
Marital status Single, married, widowed, separated, cohabiting
Stage of disease/cancer I, II, III, IV
Hemoglobin (g/dl) 8.9, 14.2, 12.7
Number of AIDS cases 278, 301, 313, 350 20
Classification of Variables
Variable
Categorical Numerical
(Qualitative) (Quantitative)
Nominal Ordinal Discrete Continuous
21
Variables can be classified as:

• Categorical (qualitative, factor)
• A categorical variable places an individual (or observation)
into one of several groups or categories
• Categorical variables are also called ‘qualitative variables’
• Categorical variable is also referred to as a ‘factor.’
• Can be measured on a nominal or ordinal scale.
22
Variables: Categorical
• Categorical Nominal
– No natural order of the levels, mutually exclusive
• Example: Gender, race, species, HIV status, blood
groups, alive/dead, village of birth, eye colour, tall/short,
marital status, beliefs, types of tumor, etc.
• Name only, no order, magnitude unimportant
23
Variables: Categorical
• Categorical Ordinal
– Some natural ordering of the levels
• Example: Severity scale, good/better/best;
no/mild/moderate/severe, low/middle/high income,
tumor (benign/malignant), etc
• Order important, magnitude unimportant
24
Variables: Types
• Categorical Ordinal (Rank Data)
– numbers used only to order the data, thus the name, rank data.
• Example: ten leading causes of death, class
position in as test, Olympic medals, etc.
Top 10 Causes of Death in Botswana 2013
1. HIV/AIDS 32%
2. Malaria 7%
3. TB 6%
4. Diarrheal Diseases 4%
5. Cancer 4%
6. Pre-Term Birth Complications 2%
7. Ischemic Heart Disease 2%
8. Stroke 2%
9. STDs 2%
25
10. Road Injuries 2%
Quantitative Variables (numerical, scale):

• A quantitative variable takes numerical values for which
arithmetic operations such as adding and averaging make
sense.
• The values of a quantitative variable are usually recorded in a
unit of measurement such as seconds or kilograms.
• Quantitative variables are also called ‘numerical variables.’
• Can be measured at a discrete or continuous level.
26
Variables: Quantitative
Types of variables (measurement scales):

• Quantitative (numerical)
– Discrete: takes on integer values, does not take
intermediate values, (count/integer data)
• Example: Number of children in a family, number of deaths in
millions, number of pills per dose, number of visits to the doctor
– Continuous: takes on many values, may be decimal

• Example: Blood pressure, bilirubin, Hb, sodium, glucose, time,
height, mass, temperature, etc
27
Variables: Roles
Roles that variables may play in statistical

analyses:
• Predictor: a variable used to predict or explain
the value of another variable
– Also called: explanatory variable, covariate, exposure,
independent variable, study factor
• Outcome: a variable that is predicted or

explained by one or more predictor variables
– Also called: response variable, dependent variable
28
Variables: Distributions
The distribution of a variable tells us what

values the variable takes and how often it
takes these values (frequency distribution).
• For example, the distribution of a categorical variable
lists the categories and gives either the count or percent
of individuals that fall in each category.
– Males: 62%
– Females: 38%
• For example, the distribution of a quantitative variable:
mean, median, standard deviation
29
What is data?
• Measurements collected on a variable as

a result of taking observations.
• Data may have units (e.g. mm, g, etc).
• Can be classified as quantitative or
categorical
30
Derived Data
derived using more than one variable

• Percentages
– Ratio expressed as percentage.
– e.g. change in outcome following intervention
• Ratios
– Quotient of two variables e.g. BMI
• Rates
– e.g. disease rates
31
Class Activity: Activity 1
• Classifying Variables
32
• Number of deaths in Botswana in a specific year
• Number of previous miscarriages an expectant mother has had
• Anti-streptolysin O titre (ASOT)
• Estimated glomerular filtration rate (eGFR)
• Arterial PCO2, mmHg
• Concentration of chlorine in water
• Disease outcome
• Body mass index
• Stages of cancer
• Weight, kg
• Malaria parasitemia
• HIV viral load
• Level of education (illiterate, primary, secondary and tertiary)
• Haptoglobin phenotypes
• Number of passion killings in Botswana
33
• Length of time to recovery after a heart attack in years
Key Points (1)
• Biostatistics is the application of statistics to biological problems.

• Biostatistical methods provide a way of quantifying the data and
handling the variability that is present in nearly all biological data.
• Applications of biostatistics include: official health statistics,
epidemiology, clinical trials, human biology, laboratory studies,
and health service administration.
34
Key Points (2)
• A parameter is a feature of the population; a statistic is a feature

of the sample.
• Variables can be categorical (nominal or ordinal) or quantitative
(discrete or continuous).
• A variable may play the role of a predictor or an outcome in a
statistical analysis.
• Descriptive statistics for categorical variables include frequency
tables and bar charts.
35
Descriptive Methods for Categorical
Data
Classification of Statistics
Two main areas:

a. Descriptive statistics
i. Categorical data
ii. Quantitative data
b. Inferential statistics
i. Categorical data
ii. Quantitative data 37
Descriptive Statistics (1)
Descriptive statistics are used to:

• Identify missing data, errors in measurement, and
other data collection problems
• Assess the validity of assumptions needed for more
formal (inferential) analyses
• Understand basic aspects of the data
– Details of the distribution of each variable
– Sizes of subgroups
– Relationships between variables
– Familiarize with data
38
Descriptive Statistics (2)
The tools of descriptive statistics are:

• Tables
• Graphs / charts
• Numerical summaries
39
Descriptives for Categorical Data (1)
Categorical variables are summarized with percentages or

proportions.
Descriptive summaries are usually presented in:

• Frequency tables
– Typically used in journal articles, presentations, longer reports
• Bar charts
– Typically used only in presentations or longer reports
40
Descriptives for Categorical Data (2)
Example: You have conducted a study on birth control

method use in a sample of 290 individuals. You obtain
the following data:
Method Number of individuals
Abstinence 9
Oral contraceptive 93
Depo-provera 26
Loop 49
Spermicide 20
Condoms 75
Vasectomy 9
Hysterectomy 6
Norplant 3 41
One-way Frequency Tables
Birth control method is a categorical nominal variable. One

way you might choose to present this data is with a one-
way frequency table:
Birth control Cumulative
method Frequency Percent Percent
Abstinence 9 3 3
Oral contraceptive 93 32 35
Depo-Provera 26 9 44
Loop 49 17 61
Spermicides 20 7 68
Condoms 75 26 94
Vasectomy 9 3 97
Hysterectomy 6 2 99
Norplant 3 1 100
Total 290 100 100
42
Bar Charts
You might also show the data in a bar chart:
43
Descriptives for Categorical Data (1):
2 variables
Frequency tables and bar charts can also be used to summarize

information on two categorical variables at the same time.
• These types of tables and charts can be used to examine relationships
between the two variables.
44
2 Variables
Example: 2165 individuals were sampled in a study designed to assess

whether HIV infection is a risk factor for pulmonary tuberculosis (PTB).
Among the study subjects:
• 651 were HIV(-)
– Of these, 57 were PTB(+) and 594 were PTB(-)
• 1514 were HIV(+)

– Of these, 875 were PTB(+) and 639 were PTB(-)
45
2 Variables
HIV status and PTB status are binary variables:

• Only two possible values: (+) or (-)
• Such variables are often coded with 0/1 values
...
46
Two-way Frequency Tables (1)
HIV status and PTB status are both categorical nominal

variables. You could present this data in a two-way
frequency table:
HIV status PTB status
Positive Negative Total
Positive 875 (57.8%) 639 (42.2%) 1514 (100.0%)
Negative 57 (8.9%) 594 (91.1%) 651 (100.0%)
Total 932 (43.0%) 1233 (57.0%) 2165 (100.0%)
47
Note that the percentages shown in brackets are

row percentages
• This is because the aim of the study was to assess
whether the percentage of PTB(+) individuals was higher
among HIV(+) individuals than among HIV(-) individuals.
• Depending on the study question, you might choose to
show either row or column percentages.
48
What do you notice about percentage of PTB(+) individuals

among those with HIV vs. those without HIV?
Positive 875 (57.8%) 639 (42.2%) 1514 (100.0%)
Negative 57 (8.9%) 594 (91.1%) 651 (100.0%)
Total 932 (43.0%) 1233 (57.0%) 2165 (100.0%)
49
What do you notice about percentage of PTB(+) individuals

among those with HIV vs. those without HIV?
• Note that you are making an observation about the relationship,
or association, between the two variables.
Positive 875 (57.8%) 639 (42.2%) 1514 (100.0%)
Negative 57 (8.9%) 594 (91.1%) 651 (100.0%)
Total 932 (43.0%) 1233 (57.0%) 2165 (100.0%)
50
Bar charts - Frequencies
You might also show this same data in a bar chart:
51
Bar charts - Percentages
Another option would be to show percentages rather than

frequencies.
52
Key Points
• Descriptive statistics for categorical variables include frequency

tables and bar charts.
53
Measures of Central Tendency and
Dispersion
Module #2
Objectives of Descriptive Statistics
This session will address the following topics:
• Calculation and interpretation of measures of central tendency

– Arithmetic mean
– Median
– Mode
• The appropriate application of measure of Central Tendency
• Calculation and interpretation of measures of variability

– Range
– Inter-quartile range
– Standard deviation
– Standard error for the mean
• For continuous variables we have two major
mathematical descriptions at our disposal and we need
both to completely describe the shape of the distribution
of observations
a) Measures of location
b) Measures of dispersion/variability/spread
• These summary statistics and in addition to providing a

description of data in mathematical terms they are also
necessary for precise and efficient comparisons of
different sets of data.
• Consider figure numbers 1 and 2. There may be

differences in the location of the distributions and
Figure 1. Distribution of the value of factor X in two
populations A and B
Population A
No. of
People
Population B
Different Variability Factor X

Same Location
Figure 2. Distribution of the value of factor Y in two
populations A and B
No. of
People
Population A Population B
Same Variability Factor Y

Different Locations
Measures of Central Tendency
• Three measures frequently used to provide a “Typical Value” for a given continuous
variable in a specific population.
Measures of Central Tendency
Quick definitions
– Mode
• the most frequently occuring score
– Median
• the mid-point of a set of ordered scores
– Mean
• the result of dividing the arithmetic sum of
scores by the number of scores
Finding the Mode
• Annual salary
–4332384372
• units of $10k
• Incubation period for 6 Hepatitis affected persons

– 29, 31, 24, 29, 30, 25
Calculating the Mode
To compute the mode:
• Arrange the data in sequence from low to high

• Count the number of times each value appears
• The most frequently appearing value is the mode
Finding the Mode
• Annual salary
– 2, 2, 3, 3, 3, 3, 4, 4, 7, 8
• The mode is three 3
• Incubation period for 6 Hepatitis affected persons

– 24, 25, 29, 29, 30, 31
• Mode is 29
• The Mode of a distribution is the value that is observed
most frequently in a given data set (rarely used).
- There may be no mode - when ?
- There may be more than one mode - when ?
- Not very amenable to statistical tests.

Median
• The Median describes literally the middle of the data.

It is defined as the value above or below which half
(50%) the observations fall.
Finding the Median
Exercise
Person Sex Salary ($)
1 F 4000
2 F 3000
3 M 3000
4 F 2000
5 F 3000
6 M 8000
7 M 4000
8 M 3000
9 F 7000
10 F 2000
Which salary figure is the median?

Computing the Median
– The number of observations or scores is
referred to as "n".
ü Arrange the scores in order from smallest to largest
(ascending order)
ü Count the number of scores (determine n)
§ If n is an odd number, then

• median = the (n+1) / 2 th observation
For example, consider the observations

8 ,25 ,7 ,5 ,8 ,3 ,10 ,12 ,9
Arranged in order, the observations are
3 ,5 ,7 ,8 ,8 ,9 ,10 ,12 ,25
In this case, n=9 ( an odd number); therefore, the

median is the (9+1)/2=5 th observation.
Computing the Median
(even number of observations)
– For another example, consider the observations

• 11 , 7 , 10 , 9 , 15 , 13 ,
•Arranged in order, the observations are

• 7 , 9 , 10 , 11 , 13 , 15
•In this case, n=6 ( an even number); therefore, the median is the:
• the average of the observations (n/2) + (n/2+1)
• The average of the 3 and 4 observations
= (10+11)/2
= 10.5
Median
• The advantage of this measure is that it is unaffected

by extreme values !
• The disadvantage is that it is selected by its rank and
does not contain information on the other values in
the distribution.
• It is also less amenable than the mean to statistical
tests.
Mean (arithmetic average)
Most commonly used measure of location. It is calculated by adding all

the observed values and dividing by the total sample size.
Each observation is noted as x
The total number of observations n

Summation Process by Sigma S
The mean itself is expressed as X
Computing the Mean
Exercise
Person Sex Salary ($)
1 F 4000
2 F 3000
3 M 3000
4 F 2000
5 F 3000
6 M 8000
7 M 4000
8 M 3000
9 F 7000
10 F 2000
For this simple problem, you could compute the mean with pencil and paper by summing the
numbers in the salary column and dividing by “n” (10).
Method for Computing the Mean
To compute the mean:
– Count the number of scores (determine “n”)
– Determine the sum of the scores by adding
them
– Divide the sum by “n”
• For example, consider the observations

– 8 , 25 , 7 , 5 , 8 , 3 , 10 , 12 , 9
– In this case, n=9 and the sum=87; therefore,

the mean
= 87 / 9
= 9.67
• The mean has a lot of good theoretical properties and it is
used as the basis of many statistical tests . For a
symmetrical distribution the mean is a good summary
statistic. It is less useful for an asymmetric distribution
Q. What is its limitation as a summary statistic in

asymmetrical distributions?
A. It can be distorted by outliers, therefore giving a poor
“typical” value.
Imagine weight in Kgs in a sample population of 5 people
50, 60, 50, 40, 120
The mean is calculated as 62 kilos. Is this value of 62 Kilos

“typical” for the observations ?
Figure 4: Symmetric and asymmetric Distributions
No. of
People
Value of Factor K
No. of
People
Value of Factor J
The mean and the median
• Did you notice that the median was the

same, 8 (the 5th value), for both data
examples?
• On the other hand, the mean changed

from 9.67 to 11.89 with the one extreme
score changing from 25 to 45.
• Extreme scores in a set of data have a

more pronounced effect on the mean than
on the median.
Choosing a Measure of Central Tendency
• Depends on the nature of the distribution
• For continuous variables in a unimodal and

symmetric distribution the mean, median and mode
are identical.
• With a skewed distribution the median may be more

useful
• For statistical analyses the mean is the preferred

measure.
Measures of Spread, Dispersion, Variability
• In addition to a measure of central tendency, in describing a distribution it is

important to provide information concerning the relative position of other data points
in the sample, (that is, a measure of spread or variability).
Range – is the simplest = Highest value minus
lowest value
• Take a sample of 10 heights (70, 95, 100, 103, 105, 107, 110,
112, 115, 140cms)
Lowest (minimum) value = 70cm.
Highest (Maximum) value= 140cm
Range is therefore 140 – 70 = 70cm
Simple to understand but far from perfect - why ?
i The range is derived from extreme values. It says nothing
about the values in between
• Not stable (as sample size increases the range can change
dramatically)
• Can’t use statistics to look at it.
Figure 8. Two distributions with the same range
No. of
People
Same Range
Different mean and variability
• Percentiles: Those values in a series of observations,
arranged in ascending order of magnitude, which divide the
distribution into two equal parts (thus the median is the 50th
percentile).
• Quartiles: The values which divide a series of observations,

arranged in ascending order, into 4 equal parts. (Thus the
2nd Quartile is the Median).
• The Interquartile Range represents the central portion of

the distribution and is calculated as the difference between
the third quartile and the first quartile. This range includes
about one-half of the observations in the set, leaving one
quarter of the observations on each side.
Median and quartiles
Sort the data in increasing order
The median is the middle value (if n is odd) or the average of the two middle
values (if n is even), it is a measure of the “center” of the data
Quartiles: dividing the set of ordered values

into 4 equal Qparts
2 = second quartile = median
first 25% second 25% third 25% fourth 25%

Q1 Q2 Q3
IQR = Interquartile range = Q3 - Q1
Measures of Data Variability
• Interquartile Range
– the difference between the score representing the 75th percentile and the score
representing the 25th percentile
– Arrange observation in ascending order

– Find the position for Q1 and Q3
– Identify values and The Inter-quartile range = Q3 - Q1
– Example 29 , 31 , 24 , 29 , 30 , 25
– Arrange: 24 , 25 , 29 , 29, 30 , 31
» Q1 = value of (n+1)/4=1.75
» Q1 = 24+0.75(25-24) = 24.75
» Q3 = value of (n+1)*3/4=5.25
» Q3 = 30+0.25(31-30) = 30.25
» Q3 – Q1 = 30.25 – 24.75=5.50
Exercise
• Determine the first and third quartiles and

interquartile range for the following data
– 0, 3, 0, 7, 2, 1, 0, 1, 5, 2, 4, 2, 8, 1, 3, 0, 1, 2, 1
So how do we get a single mathematical
measure or
summarise the variability of an observed set of
values?
• The most frequent and most informative

measure is the VARIANCE and its related
functions
• The variance is computed in stages:

• 1. Calculate the mean as a measure of central location
(MEAN)
• 2. Calculate the difference between each observation and

the mean (DEVIATION)
(x-x)
• 3. Next square the differences (SQUARED DEVIATION)
(x-x)2
• Q. What is the effect of this ?
- Negative and positive deviations will not cancel each

other out.
- Values further from the mean have a bigger impact.
• 4. Sum up these squared deviations (SUM OF THE
SQUARED DEVIATIONS)
Σ (x -x)2
• 5. Divide this SUM OF THE SQUARED DEVIATIONS by

the total number of observations minus 1 (n-1) to give the
VARIANCE
Σ (x - x)2
n-1
This is a measure of the variability of the data
Why divide by n - 1 ?
This is an adjustment for the fact that the mean is just an

estimate of the true population mean. It tends to make the
variance bigger.
Measures of Data Variability
• Standard Deviation
– The standard deviation is the square root of the average
squared deviation from the mean
- (
å%# - #$
&' =
)
" -!
n å x - (å x )
2 2
SD =
i i
n( n - 1 )
Calculating Standard Deviation
Score (x) Mean (x) Deviation Squared deviation
(x –x) (x – x )2
13
12
13
14
10
16
15
24
20
18
Σx = 155
Calculating Standard Deviation
Score (x) Mean Deviation Squared

(x) (x –x) deviation
(x – x )2
13 15.5 -2.5 6.25

å%# - #$
- (
= 156.5 = 4.17
&' =
)
12 15.5 -3.5 12.25
" -! 9
13 15.5 -2.5 6.25
14 15.5 -1.5 2.25
10 15.5 -5.5 30.25
16 15.5 0.5 0.25
Lets use the computational
15 15.5 -0.5 0.25
24 15.5 8.5 72.25
formula…………….
20 15.5 4.5 20.25
18 15.5 2.5 6.25
Σx = Σ(x –x) Σ (x -x)2 =
155 =0 156.5
Choosing the Measures of
Central Location and Dispersion
Grouped data
Mean
• Step 1: construct a frequency distribution

• Step 2: find x which is the mid point for each
class
• Step 3: calculate x*f i.e x multiplied by the
frequency
• Step 4: find the sum of x*f
!=
å
• Step 5 : divide sum
xf of x*f by total frequency
åf
• Determine the mean, median for the data
presented below
Class interval frequency

0–2 1
3–5 3
6–8 5
9 – 11 4
12 – 14 2
The Mean, median and variance for grouped data
• As part of the soccer camp study, the investigators wanted to
estimate how much the respondents would be willing to pay for
their child to attend the camp. They felt it was best to measure
with price ranges (an ordinal measurement scale) rather than
with specific prices (a ratio measurement scale). The following
question was asked as part of their survey:
• Q) Assuming the camp would run for five days, two hours
each day, how much would you be willing to pay for your child to
attend the camp?
• 1. Less than $10
• 2. $11 - $25
• 3. $26 - $50
• 4. More than $50
• What is the mean price the respondents would be willing to pay?
• What is the median price the respondents would be willing to
pay?
• Calculate the variance and hence the standard deviation?
PRICE
Valid Cumulative
Frequency Percent Percent Percent
Valid Less
2 1.0 1.4 1.4
than $10
$11-25 20 10.0 14.3 15.7
$26-50 63 31.5 45.0 60.7
More
55 27.5 39.3 100.0
than $50
Total 140 70.0 100.0
Missing System
60 30.0
Missing
Total 60 30.0
Total 200 100.0
1.The median for grouped data calculation requires you to use the frequency distribution
output and the class intervals that are the question’s response categories.
2.The formula for the median is:
æn ö
ç - cf p ÷
2
Med = Lm + è øC
m
fm
1.The median for grouped data calculation requires you to use the frequency distribution
output and the class intervals that are the question’s response categories.
2.The formula for the median is:
æn ö
ç - cf p ÷
2
Med = Lm + è øC
m
fm
where:
Lm = lower boundary of class containing median
n = sample size
cfp = cumulative frequency of classes preceding class containing the median
fm = number of observations in class containing the median
Cm = width of the interval containing the median
• Step 1: set up the frequency distribution table
• Step 2. Identify the median class i.e the class interval with 50% of
the values above it or below it.
• Step 3: use the formula to find the median
In our example,
The median class interval is the 26 -50 class interval.
Lm = 26
n = 140
cfp = 15.7
fm = 63
æn ö
ç - cf p ÷
2
Med = Lm + è øC
m
= 26 + (140/2 -15.7)24/63
fm
= 46.69
The median price is $46,68

Variance and standard deviation
!ariance =
1 é
êå f i x i -
2
(å f i xi )
2
ù
ú
n -1 ê n úû
ë
End of Class
101
Lab Session
102
Excel Tutorial
Returning to the sputum exam results:
1 2 1 1 4 1 1 4 3 2 1 4 1
1 2 3 1 1 4 1 2 4 1 1 1
If we were going to conduct analyses

on this variable, we would enter it in a
column in Excel.
103
Frequencies in Excel (1)
COUNTIF function
• Create a place to put the frequency values
104
105
106
Bar Charts in Excel (1)
Frequency table
To create a bar chart,

first select the
frequency values.
107
108
109
110
111
112
113
114
115
116
117
HIV / PTB Data

(frequencies shown)
Select the
data &
insert a
column
chart
118
119
120
121
Bar Charts in Excel: Two-way Tables
122
Homework
• Refer to Handout 1.2: Homework and Handout 1.3:

Documentation for Ferroportin Data
123

Lecture 1 - IntrotoBiostats - 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1 - IntrotoBiostats - 1

Uploaded by

Copyright:

Available Formats

Sikhulile Moyo, PhD, MSc, MPH.

Research Associate/Snr Lecturer

• Explain the need for studying biostatistics in science

To analyze data, we use statistics.

Statistics are methods/tools that we use to:

Example: A study was done on reversible bone loss in

The raw data from this study:

The data summarized in a box plot:

Other Women Breast-feeding Women

… in all of these examples, statistics are a way to organize the

Biostatistics is the application of statistical

In biology and health sciences, there is always

The field of biostatistics provides graphical and numerical

Biostatistical methods have a role in:

Biostatistical methods have a role in:

Parameter: a numeric quality, usually

Statistic: a quantity, calculated from a

Common parameters and statistics and their symbols

Quantity Population Parameter Sample Statistic

A variable is a quantity or measure that can be

Nominal Ordinal Discrete Continuous

Variables can be classified as:

Quantitative Variables (numerical, scale):

Types of variables (measurement scales):

– Continuous: takes on many values, may be decimal

Roles that variables may play in statistical

• Outcome: a variable that is predicted or

The distribution of a variable tells us what

• Measurements collected on a variable as

derived using more than one variable

• Biostatistics is the application of statistics to biological problems.

• A parameter is a feature of the population; a statistic is a feature

Two main areas:

Descriptive statistics are used to:

The tools of descriptive statistics are:

Categorical variables are summarized with percentages or

Descriptive summaries are usually presented in:

Example: You have conducted a study on birth control

Birth control method is a categorical nominal variable. One

You might also show the data in a bar chart:

Frequency tables and bar charts can also be used to summarize

Example: 2165 individuals were sampled in a study designed to assess

• 1514 were HIV(+)

HIV status and PTB status are binary variables:

HIV status and PTB status are both categorical nominal

HIV status PTB status

Positive Negative Total

Positive 875 (57.8%) 639 (42.2%) 1514 (100.0%)

Negative 57 (8.9%) 594 (91.1%) 651 (100.0%)

Total 932 (43.0%) 1233 (57.0%) 2165 (100.0%)

Note that the percentages shown in brackets are

What do you notice about percentage of PTB(+) individuals

HIV status PTB status

Positive Negative Total

Positive 875 (57.8%) 639 (42.2%) 1514 (100.0%)

Negative 57 (8.9%) 594 (91.1%) 651 (100.0%)

Total 932 (43.0%) 1233 (57.0%) 2165 (100.0%)

What do you notice about percentage of PTB(+) individuals

HIV status PTB status

Positive Negative Total

Positive 875 (57.8%) 639 (42.2%) 1514 (100.0%)

Negative 57 (8.9%) 594 (91.1%) 651 (100.0%)

Total 932 (43.0%) 1233 (57.0%) 2165 (100.0%)