You are on page 1of 56






Principles of Statistics

1. Introduce students to statistical concepts

2. Students able to use basic inference terms such as population,

3. Be able to describe data using central tendency, dispersion and
Key bio statistical concepts

 Imagine
Lets begin thethis (hypothetical)
concepts conversation
by asking “What with a man on the
is Statistics?”
taxi cab

 Man: what do you do?

 You: I’m a statistician.
 Man: what does that mean? What do you actually do?
 You:…

 …what would you say?

Key bio statistical concepts

Lets begin the concepts by asking “What is Statistics?”

What I would say…

 A statistician uses mathematical techniques (sometimes simple,

sometimes more complex) to summarize data in a way that can be
used to answer (or at least attempt to answer) a scientific question of

 As well as attempting to give the most accurate answer that we can,

we also attempt to quantify our degree of confidence in this answer.
In other words,
 How strong do our data support our conclusion?
 What range of other conclusions are supported almost as strongly?
 How well-suited are these data to answering the question in the first place?
Some Scientific Questions Of Interest

 The scientific questions can differ quite substantially in

their nature
• For example:
 What is the mean systolic blood pressure for all men between the ages of
35 and 60 in Zambia? –
 What is the effect of alcohol consumption on blood pressure? –
Effect estimation
 Is drug A better than drug B in lowering blood pressure? – Hypothesis
Some Scientific Questions Of Interest

• The study of statistics explores the collection, organization,

analysis, and interpretation of numerical data.

• The concepts of statistics may be applied to a number of

fields that include business, psychology, and agriculture.

• When the focus is on the biological and health sciences, we

use the term biostatistics.

 Statistics
is the science of planning studies and
experiments, obtaining data, and then
organizing, summarizing, presenting,
analyzing, interpreting, and drawing
conclusions based on the data
Statistical Inference

Population - the set of all elements of interest in a

particular study
Sample - a subset of the population

Statistical inference - the process of using data obtained

from a sample to make estimates
and test hypotheses about the
characteristics of a population
Census - collecting data for a population

Sample survey - collecting data for a sample


 Population
the complete collection of all individuals
(scores, people, measurements, and so on)
to be studied; the collection is complete in
the sense that it includes all of the
individuals to be studied
Census versus Sample

 Census
Collection of data from every member of
a population
 Sample
Subcollection of members selected from
a population

 Parameter
a numerical measurement describing some
characteristic of a population.



 Statistic
a numerical measurement describing some
characteristic of a sample.


Types of Data


Qualitative Quantitative

Numerical Non-numerical Numerical

Nominal Ordinal Nominal Ordinal discrete continuous

Quantitative Data

 Quantitative (or numerical) data

consists of numbers representing counts or
Example: The weights of students
Example: The ages of respondents
Qualitative Data

 qualitative data
consists of names or labels
Example: The genders (male/female) of professional
Example: Shirt numbers on professional athletes
uniforms - substitutes for names.

 Discrete data
result when the number of possible values is either
a finite number or a ‘countable’ number
(i.e. the number of possible values is
0, 1, 2, 3, . . .)

Example: The number of eggs that a hen lays


 Continuous (numerical) data

result from infinitely many possible values that
correspond to some continuous scale that covers a
range of values without gaps, interruptions, or

• Descriptive statistics are the tabular, graphical,

and numerical methods used to summarize data.
• A measure of central tendency is a descriptive
statistic that indicates;
– The average or typical observed value of a variable in a
data set.
• It gives an indication of where most of the data
– or where most of the data is clustered.

– Center measurement is a summary measure of the overall level of

a dataset
– Commonly used methods are mean, median, mode, geometric
mean etc.
– Mean: Summing up all the observation and dividing by number of
observations. Mean of 20, 30, 40 is (20+30+40)/3 = 30.

Notation : Let x1 , x2 , ...xn are n observatio ns of a variable

x. Then the mean of this variable,

x  x2  ...  xn x i
x 1  i 1
n n

Or mean = sum of value

# of observations

Example: age (yrs) of children (8, 11, 10, 8, 9)

8+8+9+10+11 46 = 9.2 yrs

Mean: sum
# obs. 5 5

Adding an outlier (age = 56) changes the mean


Median = Middle value

Example: age (yrs) of children (8, 11, 10, 8, 9)

Median: -Order data: 8,8,9,10, 11

-Pick the middle value
Here it is the 3rd: 9 years

• Median: The middle value in an ordered sequence of

observations. That is, to find the median we need to order
the data set and then find the middle value.
• In case of an even number of observations the average of
the two middle most values is the median.
• For example, to find the median of {9, 3, 6, 7, 5}, we first
sort the data giving {3, 5, 6, 7, 9}, then choose the middle
value 6. If the number of observations is even, e.g., {9, 3, 6,
7, 5, 2}, then the median is the average of the two middle
values from the sorted sequence, in this case,
• (5 + 6) / 2 = 5.5.

– Strength of median: The median is less sensitive to outliers (extreme

scores) than the mean and thus a better measure than the mean for
highly skewed distributions, e.g. family income. For example mean
of 20, 30, 40, and 990 is
– (20+30+40+990)/4=270.
– The median of these four observations is (30+40)/2 =35. Here 3
observations out of 4 lie between 20-40. So, the mean 270 really fails
to give a realistic picture of the major part of the data. It is influenced
by extreme value 990.
– Thus weakness of mean: does not work well for skewed data and is
not robust to outliers.
– Weakness of median: It only relies on the central values and ignores
all the other data.

Mode: The value that is observed most


Problems with mode:

• The mode is undefined for sequences in which
no observation is repeated.
• At times it might be nowhere near the center of
a data set.
• Sometimes there is more than one mode.

Observation that occurs most frequently

9 12 15 15 15 16 16 20 26
Observation Number of occurrences
9 1
12 1
15 3
16 2
20 1
26 1

Variability (or dispersion) measures the amount of

scatter in a dataset.

Commonly used methods: range, variance,

standard deviation, interquartile range, coefficient
of variation etc.
Range: The difference between the largest and the
smallest observations. The range of 10, 5, 2, 100 is
(100-2) = 98. It’s a crude measure of variability.

Variance: The variance of a set of observations is the average of the

squares of the deviations of the observations from their mean. In
symbols, the variance of the n observations x1, x2,…xn is

2 ( x1  x ) 2  ....  ( xn  x ) 2
S 
n 1

Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is

(5  5) 2  (3  5) 2  (7  5) 2
3 1

Standard Deviation (s) : Square root of the variance. The

standard deviation of the above example is 2.



• Synonyms: measures of variation, spread and scatter

• Dispersion refers to the variability exhibited by a set of observations

• A measure of dispersion describes the amount of variability present

in a set of data

• There are different measures of variation that are used in statistics.

Those included in this module are:
– The range, variance, standard deviation and coefficient of variation.

 The variance is a measure of variability which takes into

account the differences between each observation and the
sample mean

 It measures the scatter of the values in a set of data about the


 The dispersion of the value when they are close to the mean
is less and vice versa
 Hence the logic to measure the variation of values from the mean

 Sample variance = The sum of the squared deviations, divided

by (n – 1).

 Mathematical notation: s² = Σ(x – x¯)²

n -1
 The quantity s² is called the sample estimate of the variance

 Population variance:
 Mathematical notation: σ² = Σ(x – μ)²

• It takes into consideration all the values in the set of
• The units of measure are squared which may be
difficult to communicate
– e.g. variance of weight will be in kg squared.

• The way around the difficulty of s² is to use the square root of the variance as a measure of
• The quantity denoted by s, is called the sample standard deviation

• Thus, if s² = Σ(x – x¯)²

Then s = √(Σ (x – x¯)² / n – 1)

The population standard deviation will therefore be denoted as:

σ = √σ²
σ² = Σ(x – μ)²

• The data given is of plasma volume

 (x  x) 2

Variance  S 

( n  1)

Variance = 0.097
Standard dev. = 0.31

Quartiles: Data can be divided into four regions that cover the
total range of observed values. Cut points for these regions are
known as quartiles.
In notations, quartiles of a data is the (n+1)/4)qth observation of the
data, where q is the desired quartile and n is the number of
observations of data.
The first quartile (Q1) is the first 25% of the data. The second
quartile (Q2) is between the 25th and 50th percentage points in the
data. The upper bound of Q2 is the median. The third quartile
(Q3) is the 25% of the data lying between the median and the
75% cut point in the data.
Q1 is the median of the first half of the ordered observations and
Q3 is the median of the second half of the ordered observations.

In the following example Q1= ((15+1)/4)1 =4th observation of the data.

The 4th observation is 11. So Q1 of this data is 11.
An example with 15 numbers
3 6 7 11 13 22 30 40 44 50 52 61 68 80 94
Q1 Q2 Q3

The first quartile is Q1=11. The second quartile is Q2=40 (This is

also the Median.) The third quartile is Q3=61.

Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile

range of the previous example is 61- 11=50. The middle half of the
ordered data lie between 40 and 61.

In the following example Q1= ((15+1)/4)1 =4th observation of the data.

The 4th observation is 11. So Q1 of this data is 11.
An example with 15 numbers
3 6 7 11 13 22 30 40 44 50 52 61 68 80 94
Q1 Q2 Q3

The first quartile is Q1=11. The second quartile is Q2=40 (This is

also the Median.) The third quartile is Q3=61.

Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile

range of the previous example is 61- 11=50. The middle half of the
ordered data lie between 40 and 61.

Box plots are useful as they provide a visual summary of the data
enabling researchers to quickly identify mean values, the
dispersion of the data set, and signs of skewness.
Key bio statistical concepts


06/18/2023 1 46
Practical illustration

 Please, write on the data sheet on the flip chart your;

 Weight (kg),
 Height (meters), and
 Exercise habit
Do you exercise at least once a week?
 NB: If you can’t remember your actual height and weight, then
please just write down any reasonable answer.
Now we are going to do the following exercise

1. Select the weight of five people; add them up and divide by five
2. Repeat step 1 nine more times; you can select a particular person more than once
3. Record the results from steps 1 and 2. This will give you ten records in total
4. Now add the results in step 3 and divide by 10

Note: What have we just done?????1

06/18/2023 47
Population & Parameters

 Population refers to the entire set of subjects about whom we want


 If we were to take our measurement on all subjects in the population,

descriptive measure would give us exact information about the
population and our analysis would be finished.

 This descriptive measure is called a parameter

 A parameter is any quantity that can be calculated from the

population data

06/18/2023 1 48
Sample and Statistic

 Because it is generally not possible to collect data on all subject

of the population, we gather information on a portion of the
population, known as a sample.

 Note that the same group of subjects can be a sample for one
question about its characteristics and a population for another
 We use descriptive statistic from the sample to estimate the
characteristics of the population.

 A statistic is any quantity that can be calculated from the sample

data, which does not require knowledge of unknown quantities.
06/18/2023 49

 To make a dependable generalization about certain

population characteristics, the sample must
represent the population in those characteristics.

 One important approach to ensure representative

samples is to choose the 1samples randomly.
06/18/2023 50
Framework for statistical inference

Sampling process
Population Sample

 There is more than one framework for statistical inference.

 The traditional and most widely used approach is termed the “classical” or “
frequentist”, and this is the one pursued in this course – as illustrated above

 An important alternative, the “Bayesian” approach, is growing in influence.

 The inference is underpinned by the concept of sampling distribution of a


06/18/2023 51
Sampling distribution of a statistic
 We consider the systolic blood pressure measurements for 1600 workers as the

 The population mean was, say, μ and the population standard deviation was,
say, σ. Each value was written on a small disc and put into a bag. Each student
was asked to shake the bag, pick 10 discs, write down the ten systolic blood
pressures, work out their mean, x, and return the discs to the bag.

 We imagine that the original sampling process is repeated over and over again,
each time working out their mean.
 Over a large number of repetitions this builds up a distribution for the sample
means. This distribution is called sampling distribution of the mean
 We can work out:
 The mean of these sample means, close to the population mean
 The standard deviation of these sample means - standard error.

06/18/2023 52
Standard Error

06/18/2023 53
Central limit theorem

 Suppose that we repeat the sampling process for a sample size of 30

discs (instead of 10 discs) and build up the distribution for the
sample means.
 As we increase the sample size you will notice that the sampling
distribution for the larger sample size becomes bell-shaped. This is
an example of the central limit theorem (CLT), which plays an
important role in many parts of biostatistics.  

 Every statistic has a sampling distribution, and it is through this

distribution that we are able to relate a statistic to its corresponding
population quantity – inference

 In practice, however, we have just one sample that we wish to use to

estimate the mean of a larger population, which it represents
 Providing the sample size is large ( ≥ 30), the sampling
06/18/2023 should approximate
1 the normal distribution. 54
Sampling methods

 A sampling distribution can only describe the correct

behavior of the corresponding statistic if it is derived
from a sampling process whose properties are known.

 The most common and best-known example of this is

random sampling:
– In its simple form this implies that all members of a
population are equally likely to appear in a sample
– Such simple random sampling allows sampling
distributions to be derived in a comparatively simple way,
and inference procedures tend
06/18/2023 1 to be straightforward. 55
Statistical models

 To elucidate the information about population quantities contained

in sample statistic we need a precise and formal description of the
whole sampling process from population to sample. This description
is called the statistical model.

 Relevant features of the population are represented by parameters,

such as mean, or variance.

 The structure of the population, together with the sampling process,

allows a model to be formulated that describes the statistical
behavior of the sample, which in turn allows us to postulate
sampling distributions for the relevant statistic
56 1 56

You might also like