You are on page 1of 123

Sikhulile Moyo, PhD, MSc, MPH.

Research Associate/Snr Lecturer

Introduction to Biostatistics
Objectives

• Explain the need for studying biostatistics in science


• List applications of biostatistics
• Define terms used in biostatistics
• Explain descriptive statistics
• Describe and produce descriptive statistics for categorical data

2
Statistics and the Scientific Method

Scientific Method

Question

Make Design
Conclusions Study

Analyze
Collect Data
Data

3
Why study statistics?
• MLS and CHS degree programmes prescribe a research
project
• much research relies on statistics
• literature is full of statistics
– support study of other courses

4
Defining Statistics (1)

To analyze data, we use statistics.

Statistics are methods/tools that we use to:


• Collect
• Organize
• Summarize
• Analyze
• Present
• Understand (draw conclusions from the data)
the data we collect in health and disease, and research
studies. 5
Examples of Statistics (a)

Example: A study was done on reversible bone loss in


breast-feeding women compared to controls.
Group = 1 (breast-feeding women) or
2 (non-breast-feeding women)
Bone = Percent change in the mineral content of women's spines

6
Examples of Statistics (b)

The raw data from this study:


ID Group Bone ID Group Bone ID Group Bone ID Group Bone
1 2 -4.4 21 1 -0.4 41 2 -3.6 61 2 -4.7
2 1 2.9 22 2 -2.1 42 2 0.4 62 2 -3.3
3 1 -1.5 23 2 -4.9 43 1 2.2 63 2 -6.8
4 1 1.2 24 1 0.7 44 2 0.2 64 1 1.7
5 2 -2.1 25 2 -6.5 45 1 0.9 65 2 -5.1
6 1 0.3 26 2 -2.7 46 2 1.7 66 2 -3.1
7 2 -4.3 27 1 -0.4 47 2 -3.8 67 2 -0.8
8 2 -6.5 28 2 -5.2 48 2 2.2 68 2 -2
9 2 -7 29 1 -0.1 49 2 -5.3 69 2 -8.3
10 2 -0.3 30 1 -0.4 50 2 -2.5
11 2 -4 31 2 -1 51 2 -1.8
12 1 0 32 1 -2.2 52 2 -6.8
13 2 -6.2 33 2 -5.2 53 2 -5.7
14 1 -0.1 34 1 2.4 54 2 -4.7
15 1 -1.6 35 2 -2.3 55 2 -5.3
16 2 0.3 36 1 -0.6 56 2 -5.9
17 2 -2.2 37 2 -4.9 57 2 -7.8
18 2 -2.2 38 2 -2.5 58 1 1.1
19 1 -0.1 39 2 -1 59 1 1
20 2 -5.6 40 2 -3 60 1 -0.2
7
Source: Baldi, B, Moore, DS. Practice of Statistics in the Life Sciences. New York: W.H. Freeman and Company; 2009
Examples of Statistics (c)

The data summarized in a box plot:

2
% Change in Mineral Content
0
-2
-4
-6
-8

Other Women Breast-feeding Women

8
Defining Statistics (2)

In statistics, we use a
study sample to make
inference about some
population that we are
interested in.

9
Defining Statistics (3)

Sample

N=5
10
Defining Statistics (4)

What are some types of statistics that you are familiar with?

11
Defining Statistics (5)

What are some types of statistics that you are familiar with?

… in all of these examples, statistics are a way to organize the


raw data in some meaningful way.

12
Need for Biostatistics (1)

Biostatistics is the application of statistical


methods to problems in the biological and
related sciences (medicine, public health, etc)

In biology and health sciences, there is always


variability. Examples:
• Range of normal BP values in the human body
• Amount of hemoglobin in the blood
• Size of tumor
• Proportion with genetic abnormality
13
• Growth rates, etc
Need for Biostatistics (2)

The field of biostatistics provides graphical and numerical


methods that can:
• Quantify data
• Present data
• Account for biological variation
in biological research studies.

14
Applications of Biostatistics (1)

Biostatistical methods have a role in:


• Official health statistics
– Ex.: Studying trends of number of cases of a disease over time
• Epidemiology
– Ex.: Association of diseases with some aetiological factors
• Clinical studies
– Ex.: Comparison of treatments in clinical trials

15
Applications of Biostatistics (2)

Biostatistical methods have a role in:


• Human biology
– Ex.: Growth pattern
• Agriculture
– Ex.: crop yields
• Laboratory studies
– Ex.: Dose-response studies
– Ex: Reference intervals
• Health service administration
– Ex.: With limited resources, there may be need to prioritize
16
target groups for necessary interventions
Parameter vs. Statistic (1)

Parameter: a numeric quality, usually


unknown, that describes a certain
population characteristic
• Examples: mean height, median income, proportion
infected with HIV, prevalence of breast cancer, etc

Statistic: a quantity, calculated from a


sample of data, used to estimate a
parameter (used to describe the sample)
also called an estimate 17
Parameter vs. Statistic (2)

In other words:
• A parameter is a feature of the population.
• A statistic is a feature of the sample (random subset of a
population)

18
Parameter vs. Statistic (3)

Common parameters and statistics and their symbols

Quantity Population Parameter Sample Statistic


Mean µ (“mu”)
Standard
σ (“sigma”) s (“s”)
deviation
Variance σ2 (“sigma-squared”) s2 (“s-squared”)
Proportion p (“p”)
Correlation ρ (“rho”) r (“r”)
Regression
β (“beta”) b (“b”)
coefficient
19
Variables

A variable is a quantity or measure that can be


different (varies) from person to person (or
object to object). It is a characteristic of a
population.
Examples:
Variable Possible Values
Height (cm) 158, 169.3, 170, 200.6
Weight (kg) 10.2, 50, 69.4, 84
Outcome of disease Recovery, chronic illness, death
Parity 0, 1, 6, 8, 10
Marital status Single, married, widowed, separated, cohabiting
Stage of disease/cancer I, II, III, IV
Hemoglobin (g/dl) 8.9, 14.2, 12.7
Number of AIDS cases 278, 301, 313, 350 20
Classification of Variables

Variable

Categorical Numerical
(Qualitative) (Quantitative)

Nominal Ordinal Discrete Continuous

21
Classification of Variables

Variables can be classified as:


• Categorical (qualitative, factor)
• A categorical variable places an individual (or observation)
into one of several groups or categories
• Categorical variables are also called ‘qualitative variables’
• Categorical variable is also referred to as a ‘factor.’
• Can be measured on a nominal or ordinal scale.

22
Variables: Categorical

• Categorical Nominal
– No natural order of the levels, mutually exclusive
• Example: Gender, race, species, HIV status, blood
groups, alive/dead, village of birth, eye colour, tall/short,
marital status, beliefs, types of tumor, etc.
• Name only, no order, magnitude unimportant

23
Variables: Categorical

• Categorical Ordinal
– Some natural ordering of the levels
• Example: Severity scale, good/better/best;
no/mild/moderate/severe, low/middle/high income,
tumor (benign/malignant), etc
• Order important, magnitude unimportant

24
Variables: Types
• Categorical Ordinal (Rank Data)
– numbers used only to order the data, thus the name, rank data.
• Example: ten leading causes of death, class
position in as test, Olympic medals, etc.
Top 10 Causes of Death in Botswana 2013
1. HIV/AIDS 32%
2. Malaria 7%
3. TB 6%
4. Diarrheal Diseases 4%
5. Cancer 4%
6. Pre-Term Birth Complications 2%
7. Ischemic Heart Disease 2%
8. Stroke 2%
9. STDs 2%
25
10. Road Injuries 2%
Classification of Variables

Quantitative Variables (numerical, scale):


• A quantitative variable takes numerical values for which
arithmetic operations such as adding and averaging make
sense.
• The values of a quantitative variable are usually recorded in a
unit of measurement such as seconds or kilograms.
• Quantitative variables are also called ‘numerical variables.’
• Can be measured at a discrete or continuous level.

26
Variables: Quantitative

Types of variables (measurement scales):


• Quantitative (numerical)
– Discrete: takes on integer values, does not take
intermediate values, (count/integer data)
• Example: Number of children in a family, number of deaths in
millions, number of pills per dose, number of visits to the doctor

– Continuous: takes on many values, may be decimal


• Example: Blood pressure, bilirubin, Hb, sodium, glucose, time,
height, mass, temperature, etc

27
Variables: Roles

Roles that variables may play in statistical


analyses:
• Predictor: a variable used to predict or explain
the value of another variable
– Also called: explanatory variable, covariate, exposure,
independent variable, study factor

• Outcome: a variable that is predicted or


explained by one or more predictor variables
– Also called: response variable, dependent variable
28
Variables: Distributions

The distribution of a variable tells us what


values the variable takes and how often it
takes these values (frequency distribution).
• For example, the distribution of a categorical variable
lists the categories and gives either the count or percent
of individuals that fall in each category.
– Males: 62%
– Females: 38%
• For example, the distribution of a quantitative variable:
mean, median, standard deviation
29
What is data?

• Measurements collected on a variable as


a result of taking observations.
• Data may have units (e.g. mm, g, etc).
• Can be classified as quantitative or
categorical

30
Derived Data

derived using more than one variable


• Percentages
– Ratio expressed as percentage.
– e.g. change in outcome following intervention

• Ratios
– Quotient of two variables e.g. BMI

• Rates
– e.g. disease rates
31
Class Activity: Activity 1

• Classifying Variables

32
• Number of deaths in Botswana in a specific year
• Number of previous miscarriages an expectant mother has had
• Anti-streptolysin O titre (ASOT)
• Estimated glomerular filtration rate (eGFR)
• Arterial PCO2, mmHg
• Concentration of chlorine in water
• Disease outcome
• Body mass index
• Stages of cancer
• Weight, kg
• Malaria parasitemia
• HIV viral load
• Level of education (illiterate, primary, secondary and tertiary)
• Haptoglobin phenotypes
• Number of passion killings in Botswana
33
• Length of time to recovery after a heart attack in years
Key Points (1)

• Biostatistics is the application of statistics to biological problems.


• Biostatistical methods provide a way of quantifying the data and
handling the variability that is present in nearly all biological data.
• Applications of biostatistics include: official health statistics,
epidemiology, clinical trials, human biology, laboratory studies,
and health service administration.

34
Key Points (2)

• A parameter is a feature of the population; a statistic is a feature


of the sample.
• Variables can be categorical (nominal or ordinal) or quantitative
(discrete or continuous).
• A variable may play the role of a predictor or an outcome in a
statistical analysis.
• Descriptive statistics for categorical variables include frequency
tables and bar charts.

35
Descriptive Methods for Categorical
Data
Classification of Statistics

Two main areas:


a. Descriptive statistics
i. Categorical data
ii. Quantitative data

b. Inferential statistics
i. Categorical data
ii. Quantitative data 37
Descriptive Statistics (1)

Descriptive statistics are used to:


• Identify missing data, errors in measurement, and
other data collection problems
• Assess the validity of assumptions needed for more
formal (inferential) analyses
• Understand basic aspects of the data
– Details of the distribution of each variable
– Sizes of subgroups
– Relationships between variables
– Familiarize with data
38
Descriptive Statistics (2)

The tools of descriptive statistics are:


• Tables
• Graphs / charts
• Numerical summaries

39
Descriptives for Categorical Data (1)

Categorical variables are summarized with percentages or


proportions.

Descriptive summaries are usually presented in:


• Frequency tables
– Typically used in journal articles, presentations, longer reports
• Bar charts
– Typically used only in presentations or longer reports

40
Descriptives for Categorical Data (2)

Example: You have conducted a study on birth control


method use in a sample of 290 individuals. You obtain
the following data:
Method Number of individuals
Abstinence 9
Oral contraceptive 93
Depo-provera 26
Loop 49
Spermicide 20
Condoms 75
Vasectomy 9
Hysterectomy 6
Norplant 3 41
One-way Frequency Tables

Birth control method is a categorical nominal variable. One


way you might choose to present this data is with a one-
way frequency table:
Birth control Cumulative
method Frequency Percent Percent
Abstinence 9 3 3
Oral contraceptive 93 32 35
Depo-Provera 26 9 44
Loop 49 17 61
Spermicides 20 7 68
Condoms 75 26 94
Vasectomy 9 3 97
Hysterectomy 6 2 99
Norplant 3 1 100
Total 290 100 100
42
Bar Charts

You might also show the data in a bar chart:

43
Descriptives for Categorical Data (1):
2 variables

Frequency tables and bar charts can also be used to summarize


information on two categorical variables at the same time.
• These types of tables and charts can be used to examine relationships
between the two variables.

44
Descriptives for Categorical Data (2):
2 Variables

Example: 2165 individuals were sampled in a study designed to assess


whether HIV infection is a risk factor for pulmonary tuberculosis (PTB).
Among the study subjects:
• 651 were HIV(-)
– Of these, 57 were PTB(+) and 594 were PTB(-)

• 1514 were HIV(+)


– Of these, 875 were PTB(+) and 639 were PTB(-)

45
Descriptives for Categorical Data (3):
2 Variables

HIV status and PTB status are binary variables:


• Only two possible values: (+) or (-)
• Such variables are often coded with 0/1 values

...
46
Two-way Frequency Tables (1)

HIV status and PTB status are both categorical nominal


variables. You could present this data in a two-way
frequency table:

HIV status PTB status

Positive Negative Total

Positive 875 (57.8%) 639 (42.2%) 1514 (100.0%)

Negative 57 (8.9%) 594 (91.1%) 651 (100.0%)

Total 932 (43.0%) 1233 (57.0%) 2165 (100.0%)

47
Two-way Frequency Tables (2)

Note that the percentages shown in brackets are


row percentages
• This is because the aim of the study was to assess
whether the percentage of PTB(+) individuals was higher
among HIV(+) individuals than among HIV(-) individuals.
• Depending on the study question, you might choose to
show either row or column percentages.

48
Two-way Frequency Tables (3)

What do you notice about percentage of PTB(+) individuals


among those with HIV vs. those without HIV?

HIV status PTB status

Positive Negative Total

Positive 875 (57.8%) 639 (42.2%) 1514 (100.0%)

Negative 57 (8.9%) 594 (91.1%) 651 (100.0%)

Total 932 (43.0%) 1233 (57.0%) 2165 (100.0%)

49
Two-way Frequency Tables (4)

What do you notice about percentage of PTB(+) individuals


among those with HIV vs. those without HIV?
• Note that you are making an observation about the relationship,
or association, between the two variables.

HIV status PTB status

Positive Negative Total

Positive 875 (57.8%) 639 (42.2%) 1514 (100.0%)

Negative 57 (8.9%) 594 (91.1%) 651 (100.0%)

Total 932 (43.0%) 1233 (57.0%) 2165 (100.0%)

50
Bar charts - Frequencies

You might also show this same data in a bar chart:

51
Bar charts - Percentages

Another option would be to show percentages rather than


frequencies.

52
Key Points

• Descriptive statistics for categorical variables include frequency


tables and bar charts.

53
Measures of Central Tendency and
Dispersion

Module #2
Objectives of Descriptive Statistics

This session will address the following topics:

• Calculation and interpretation of measures of central tendency


– Arithmetic mean
– Median
– Mode
• The appropriate application of measure of Central Tendency

• Calculation and interpretation of measures of variability


– Range
– Inter-quartile range
– Standard deviation
– Standard error for the mean
• For continuous variables we have two major
mathematical descriptions at our disposal and we need
both to completely describe the shape of the distribution
of observations

a) Measures of location
b) Measures of dispersion/variability/spread

• These summary statistics and in addition to providing a


description of data in mathematical terms they are also
necessary for precise and efficient comparisons of
different sets of data.

• Consider figure numbers 1 and 2. There may be


differences in the location of the distributions and
Figure 1. Distribution of the value of factor X in two
populations A and B

Population A

No. of
People

Population B

Different Variability Factor X


Same Location
Figure 2. Distribution of the value of factor Y in two
populations A and B

No. of
People

Population A Population B

Same Variability Factor Y


Different Locations
Measures of Central Tendency

• Three measures frequently used to provide a “Typical Value” for a given continuous
variable in a specific population.
Measures of Central Tendency

Quick definitions
– Mode
• the most frequently occuring score
– Median
• the mid-point of a set of ordered scores
– Mean
• the result of dividing the arithmetic sum of
scores by the number of scores
Finding the Mode

• Annual salary
–4332384372
• units of $10k

• Incubation period for 6 Hepatitis affected persons


– 29, 31, 24, 29, 30, 25
Calculating the Mode

To compute the mode:

• Arrange the data in sequence from low to high


• Count the number of times each value appears
• The most frequently appearing value is the mode
Finding the Mode

• Annual salary
– 2, 2, 3, 3, 3, 3, 4, 4, 7, 8
• The mode is three 3

• Incubation period for 6 Hepatitis affected persons


– 24, 25, 29, 29, 30, 31
• Mode is 29
• The Mode of a distribution is the value that is observed
most frequently in a given data set (rarely used).

- There may be no mode - when ?

- There may be more than one mode - when ?

- Not very amenable to statistical tests.


Median

• The Median describes literally the middle of the data.


It is defined as the value above or below which half
(50%) the observations fall.
Finding the Median
Exercise
Person Sex Salary ($)
1 F 4000
2 F 3000
3 M 3000
4 F 2000
5 F 3000
6 M 8000
7 M 4000
8 M 3000
9 F 7000
10 F 2000

Which salary figure is the median?


Computing the Median
– The number of observations or scores is
referred to as "n".
ü Arrange the scores in order from smallest to largest
(ascending order)
ü Count the number of scores (determine n)

§ If n is an odd number, then


• median = the (n+1) / 2 th observation

For example, consider the observations


8 ,25 ,7 ,5 ,8 ,3 ,10 ,12 ,9
Arranged in order, the observations are
3 ,5 ,7 ,8 ,8 ,9 ,10 ,12 ,25

In this case, n=9 ( an odd number); therefore, the


median is the (9+1)/2=5 th observation.
Computing the Median
(even number of observations)

– For another example, consider the observations


• 11 , 7 , 10 , 9 , 15 , 13 ,

•Arranged in order, the observations are


• 7 , 9 , 10 , 11 , 13 , 15

•In this case, n=6 ( an even number); therefore, the median is the:
• the average of the observations (n/2) + (n/2+1)
• The average of the 3 and 4 observations
= (10+11)/2
= 10.5
Median

• The advantage of this measure is that it is unaffected


by extreme values !
• The disadvantage is that it is selected by its rank and
does not contain information on the other values in
the distribution.
• It is also less amenable than the mean to statistical
tests.
Mean (arithmetic average)

Most commonly used measure of location. It is calculated by adding all


the observed values and dividing by the total sample size.
Each observation is noted as x

The total number of observations n


Summation Process by Sigma S
The mean itself is expressed as X
Computing the Mean
Exercise
Person Sex Salary ($)
1 F 4000
2 F 3000
3 M 3000
4 F 2000
5 F 3000
6 M 8000
7 M 4000
8 M 3000
9 F 7000
10 F 2000

For this simple problem, you could compute the mean with pencil and paper by summing the
numbers in the salary column and dividing by “n” (10).
Method for Computing the Mean
To compute the mean:
– Count the number of scores (determine “n”)
– Determine the sum of the scores by adding
them
– Divide the sum by “n”

• For example, consider the observations


– 8 , 25 , 7 , 5 , 8 , 3 , 10 , 12 , 9

– In this case, n=9 and the sum=87; therefore,


the mean
= 87 / 9
= 9.67
• The mean has a lot of good theoretical properties and it is
used as the basis of many statistical tests . For a
symmetrical distribution the mean is a good summary
statistic. It is less useful for an asymmetric distribution

Q. What is its limitation as a summary statistic in


asymmetrical distributions?
A. It can be distorted by outliers, therefore giving a poor
“typical” value.

Imagine weight in Kgs in a sample population of 5 people

50, 60, 50, 40, 120

The mean is calculated as 62 kilos. Is this value of 62 Kilos


“typical” for the observations ?
Figure 4: Symmetric and asymmetric Distributions

No. of
People

Value of Factor K
No. of
People

Value of Factor J
The mean and the median

• Did you notice that the median was the


same, 8 (the 5th value), for both data
examples?

• On the other hand, the mean changed


from 9.67 to 11.89 with the one extreme
score changing from 25 to 45.

• Extreme scores in a set of data have a


more pronounced effect on the mean than
on the median.
Choosing a Measure of Central Tendency

• Depends on the nature of the distribution

• For continuous variables in a unimodal and


symmetric distribution the mean, median and mode
are identical.

• With a skewed distribution the median may be more


useful

• For statistical analyses the mean is the preferred


measure.
Measures of Spread, Dispersion, Variability

• In addition to a measure of central tendency, in describing a distribution it is


important to provide information concerning the relative position of other data points
in the sample, (that is, a measure of spread or variability).
Range – is the simplest = Highest value minus
lowest value

• Take a sample of 10 heights (70, 95, 100, 103, 105, 107, 110,
112, 115, 140cms)
Lowest (minimum) value = 70cm.
Highest (Maximum) value= 140cm
Range is therefore 140 – 70 = 70cm
Simple to understand but far from perfect - why ?
i The range is derived from extreme values. It says nothing
about the values in between
• Not stable (as sample size increases the range can change
dramatically)
• Can’t use statistics to look at it.
Figure 8. Two distributions with the same range

No. of
People

Same Range
Different mean and variability
• Percentiles: Those values in a series of observations,
arranged in ascending order of magnitude, which divide the
distribution into two equal parts (thus the median is the 50th
percentile).

• Quartiles: The values which divide a series of observations,


arranged in ascending order, into 4 equal parts. (Thus the
2nd Quartile is the Median).

• The Interquartile Range represents the central portion of


the distribution and is calculated as the difference between
the third quartile and the first quartile. This range includes
about one-half of the observations in the set, leaving one
quarter of the observations on each side.
Median and quartiles
Sort the data in increasing order

The median is the middle value (if n is odd) or the average of the two middle
values (if n is even), it is a measure of the “center” of the data

Quartiles: dividing the set of ordered values


into 4 equal Qparts
2 = second quartile = median

first 25% second 25% third 25% fourth 25%


Q1 Q2 Q3
IQR = Interquartile range = Q3 - Q1
Measures of Data Variability

• Interquartile Range
– the difference between the score representing the 75th percentile and the score
representing the 25th percentile

– Arrange observation in ascending order


– Find the position for Q1 and Q3
– Identify values and The Inter-quartile range = Q3 - Q1
– Example 29 , 31 , 24 , 29 , 30 , 25

– Arrange: 24 , 25 , 29 , 29, 30 , 31

» Q1 = value of (n+1)/4=1.75
» Q1 = 24+0.75(25-24) = 24.75

» Q3 = value of (n+1)*3/4=5.25
» Q3 = 30+0.25(31-30) = 30.25

» Q3 – Q1 = 30.25 – 24.75=5.50
Exercise

• Determine the first and third quartiles and


interquartile range for the following data

– 0, 3, 0, 7, 2, 1, 0, 1, 5, 2, 4, 2, 8, 1, 3, 0, 1, 2, 1
So how do we get a single mathematical
measure or
summarise the variability of an observed set of
values?

• The most frequent and most informative


measure is the VARIANCE and its related
functions

• The variance is computed in stages:


• 1. Calculate the mean as a measure of central location
(MEAN)

• 2. Calculate the difference between each observation and


the mean (DEVIATION)
(x-x)
• 3. Next square the differences (SQUARED DEVIATION)
(x-x)2

• Q. What is the effect of this ?

- Negative and positive deviations will not cancel each


other out.
- Values further from the mean have a bigger impact.
• 4. Sum up these squared deviations (SUM OF THE
SQUARED DEVIATIONS)
Σ (x -x)2

• 5. Divide this SUM OF THE SQUARED DEVIATIONS by


the total number of observations minus 1 (n-1) to give the
VARIANCE
Σ (x - x)2
n-1

This is a measure of the variability of the data

Why divide by n - 1 ?

This is an adjustment for the fact that the mean is just an


estimate of the true population mean. It tends to make the
variance bigger.
Measures of Data Variability
• Standard Deviation
– The standard deviation is the square root of the average
squared deviation from the mean

- (

å%# - #$
&' =
)

" -!

n å x - (å x )
2 2

SD =
i i

n( n - 1 )
Calculating Standard Deviation
Score (x) Mean (x) Deviation Squared deviation
(x –x) (x – x )2
13
12
13
14
10
16
15
24
20
18
Σx = 155
Calculating Standard Deviation

Score (x) Mean Deviation Squared


(x) (x –x) deviation
(x – x )2

13 15.5 -2.5 6.25


å%# - #$
- (
= 156.5 = 4.17
&' =
)
12 15.5 -3.5 12.25
" -! 9
13 15.5 -2.5 6.25
14 15.5 -1.5 2.25
10 15.5 -5.5 30.25
16 15.5 0.5 0.25
Lets use the computational
15 15.5 -0.5 0.25
24 15.5 8.5 72.25
formula…………….
20 15.5 4.5 20.25
18 15.5 2.5 6.25
Σx = Σ(x –x) Σ (x -x)2 =

155 =0 156.5
Choosing the Measures of
Central Location and Dispersion
Grouped data
Mean

• Step 1: construct a frequency distribution


• Step 2: find x which is the mid point for each
class
• Step 3: calculate x*f i.e x multiplied by the
frequency
• Step 4: find the sum of x*f

!=
å
• Step 5 : divide sum
xf of x*f by total frequency
åf
• Determine the mean, median for the data
presented below

Class interval frequency


0–2 1
3–5 3
6–8 5
9 – 11 4
12 – 14 2
The Mean, median and variance for grouped data
• As part of the soccer camp study, the investigators wanted to
estimate how much the respondents would be willing to pay for
their child to attend the camp. They felt it was best to measure
with price ranges (an ordinal measurement scale) rather than
with specific prices (a ratio measurement scale). The following
question was asked as part of their survey:
• Q) Assuming the camp would run for five days, two hours
each day, how much would you be willing to pay for your child to
attend the camp?
• 1. Less than $10
• 2. $11 - $25
• 3. $26 - $50
• 4. More than $50
• What is the mean price the respondents would be willing to pay?
• What is the median price the respondents would be willing to
pay?
• Calculate the variance and hence the standard deviation?
PRICE

Valid Cumulative
Frequency Percent Percent Percent
Valid Less
2 1.0 1.4 1.4
than $10
$11-25 20 10.0 14.3 15.7
$26-50 63 31.5 45.0 60.7
More
55 27.5 39.3 100.0
than $50
Total 140 70.0 100.0
Missing System
60 30.0
Missing
Total 60 30.0
Total 200 100.0

1.The median for grouped data calculation requires you to use the frequency distribution
output and the class intervals that are the question’s response categories.
2.The formula for the median is:

æn ö
ç - cf p ÷
2
Med = Lm + è øC
m
fm
1.The median for grouped data calculation requires you to use the frequency distribution
output and the class intervals that are the question’s response categories.
2.The formula for the median is:

æn ö
ç - cf p ÷
2
Med = Lm + è øC
m
fm

where:
Lm = lower boundary of class containing median
n = sample size
cfp = cumulative frequency of classes preceding class containing the median
fm = number of observations in class containing the median
Cm = width of the interval containing the median
• Step 1: set up the frequency distribution table
• Step 2. Identify the median class i.e the class interval with 50% of
the values above it or below it.
• Step 3: use the formula to find the median

In our example,
The median class interval is the 26 -50 class interval.
Lm = 26
n = 140
cfp = 15.7
fm = 63
æn ö
ç - cf p ÷
2
Med = Lm + è øC
m
= 26 + (140/2 -15.7)24/63
fm
= 46.69

The median price is $46,68


Variance and standard deviation

!ariance =
1 é
êå f i x i -
2
(å f i xi )
2
ù
ú
n -1 ê n úû
ë
End of Class

101
Lab Session

102
Excel Tutorial

Returning to the sputum exam results:

1 2 1 1 4 1 1 4 3 2 1 4 1
1 2 3 1 1 4 1 2 4 1 1 1

If we were going to conduct analyses


on this variable, we would enter it in a
column in Excel.

103
Frequencies in Excel (1)

COUNTIF function
• Create a place to put the frequency values

104
Frequencies in Excel (2)

105
Frequencies in Excel (3)

106
Bar Charts in Excel (1)

Frequency table

To create a bar chart,


first select the
frequency values.

107
Bar Charts in Excel (2)

108
Bar Charts in Excel (3)

109
Bar Charts in Excel (4)

110
111
Bar Charts in Excel (5)

112
Bar Charts in Excel (6)

113
Bar Charts in Excel (7)

114
Bar Charts in Excel (8)

115
Bar Charts in Excel (9)

116
Bar Charts in Excel (10)

117
Bar Charts in Excel (11)

HIV / PTB Data


(frequencies shown)

Select the
data &
insert a
column
chart

118
Bar Charts in Excel (12)

119
Bar Charts in Excel (13)

120
Bar Charts in Excel (14)

121
Bar Charts in Excel: Two-way Tables

122
Homework

• Refer to Handout 1.2: Homework and Handout 1.3:


Documentation for Ferroportin Data

123

You might also like