You are on page 1of 56

UNIVERSITY OF ZAMBIA

SCHOOL OF MEDICINE
DEPARTMENT OF MEDICAL EDUCATION

INTRODUCTION TO STATISTICS
@2023

ISAAC FWEMBA, PhD


BSc(ZAM), MPH(UK), PhD(GH)
LECTURER
SCHOOL OF MEDICINE
UNIVERSITY OF ZAMBIA

Principles of Statistics
Objective

1. Introduce students to statistical concepts

2. Students able to use basic inference terms such as population,


statistic
3. Be able to describe data using central tendency, dispersion and
Key bio statistical concepts

 Imagine
Lets begin thethis (hypothetical)
concepts conversation
by asking “What with a man on the
is Statistics?”
taxi cab

 Man: what do you do?


 You: I’m a statistician.
 Man: what does that mean? What do you actually do?
 You:…

 …what would you say?


Key bio statistical concepts

Lets begin the concepts by asking “What is Statistics?”

What I would say…

 A statistician uses mathematical techniques (sometimes simple,


sometimes more complex) to summarize data in a way that can be
used to answer (or at least attempt to answer) a scientific question of
interest.

 As well as attempting to give the most accurate answer that we can,


we also attempt to quantify our degree of confidence in this answer.
In other words,
 How strong do our data support our conclusion?
 What range of other conclusions are supported almost as strongly?
 How well-suited are these data to answering the question in the first place?
Some Scientific Questions Of Interest

 The scientific questions can differ quite substantially in


their nature
• For example:
 What is the mean systolic blood pressure for all men between the ages of
35 and 60 in Zambia? –
Descriptive
 What is the effect of alcohol consumption on blood pressure? –
Effect estimation
 Is drug A better than drug B in lowering blood pressure? – Hypothesis
Some Scientific Questions Of Interest

• The study of statistics explores the collection, organization,


analysis, and interpretation of numerical data.

• The concepts of statistics may be applied to a number of


fields that include business, psychology, and agriculture.

• When the focus is on the biological and health sciences, we


use the term biostatistics.
Statistics

 Statistics
is the science of planning studies and
experiments, obtaining data, and then
organizing, summarizing, presenting,
analyzing, interpreting, and drawing
conclusions based on the data
Statistical Inference

Population - the set of all elements of interest in a


particular study
Sample - a subset of the population

Statistical inference - the process of using data obtained


from a sample to make estimates
and test hypotheses about the
characteristics of a population
Census - collecting data for a population

Sample survey - collecting data for a sample


Population

 Population
the complete collection of all individuals
(scores, people, measurements, and so on)
to be studied; the collection is complete in
the sense that it includes all of the
individuals to be studied
Census versus Sample

 Census
Collection of data from every member of
a population
 Sample
Subcollection of members selected from
a population
Parameter

 Parameter
a numerical measurement describing some
characteristic of a population.

population

parameter
Statistic

 Statistic
a numerical measurement describing some
characteristic of a sample.

sample

statistic
Statistic
Types of Data

Data

Qualitative Quantitative

Numerical Non-numerical Numerical

Nominal Ordinal Nominal Ordinal discrete continuous


Quantitative Data

 Quantitative (or numerical) data


consists of numbers representing counts or
measurements.
Example: The weights of students
Example: The ages of respondents
Qualitative Data

 qualitative data
consists of names or labels
Example: The genders (male/female) of professional
athletes
Example: Shirt numbers on professional athletes
uniforms - substitutes for names.
DISCRETE DATA

 Discrete data
result when the number of possible values is either
a finite number or a ‘countable’ number
(i.e. the number of possible values is
0, 1, 2, 3, . . .)

Example: The number of eggs that a hen lays


CONTINUOUS DATA

 Continuous (numerical) data


result from infinitely many possible values that
correspond to some continuous scale that covers a
range of values without gaps, interruptions, or
jumps
DESCRIPTIVE STATISTICS

• Descriptive statistics are the tabular, graphical,


and numerical methods used to summarize data.
• A measure of central tendency is a descriptive
statistic that indicates;
– The average or typical observed value of a variable in a
data set.
• It gives an indication of where most of the data
lies
– or where most of the data is clustered.
MEAN
MEAN
MEAN

– Center measurement is a summary measure of the overall level of


a dataset
– Commonly used methods are mean, median, mode, geometric
mean etc.
– Mean: Summing up all the observation and dividing by number of
observations. Mean of 20, 30, 40 is (20+30+40)/3 = 30.

Notation : Let x1 , x2 , ...xn are n observatio ns of a variable


x. Then the mean of this variable,
n

x  x2  ...  xn x i
x 1  i 1
n n
MEAN

Or mean = sum of value


# of observations

Example: age (yrs) of children (8, 11, 10, 8, 9)

8+8+9+10+11 46 = 9.2 yrs


Mean: sum
# obs. 5 5

Adding an outlier (age = 56) changes the mean


dramatically!
MEDIAN

Median = Middle value

Example: age (yrs) of children (8, 11, 10, 8, 9)

Median: -Order data: 8,8,9,10, 11


-Pick the middle value
Here it is the 3rd: 9 years
MEDIAN

• Median: The middle value in an ordered sequence of


observations. That is, to find the median we need to order
the data set and then find the middle value.
• In case of an even number of observations the average of
the two middle most values is the median.
• For example, to find the median of {9, 3, 6, 7, 5}, we first
sort the data giving {3, 5, 6, 7, 9}, then choose the middle
value 6. If the number of observations is even, e.g., {9, 3, 6,
7, 5, 2}, then the median is the average of the two middle
values from the sorted sequence, in this case,
• (5 + 6) / 2 = 5.5.
MEAN OR MEDIAN

– Strength of median: The median is less sensitive to outliers (extreme


scores) than the mean and thus a better measure than the mean for
highly skewed distributions, e.g. family income. For example mean
of 20, 30, 40, and 990 is
– (20+30+40+990)/4=270.
– The median of these four observations is (30+40)/2 =35. Here 3
observations out of 4 lie between 20-40. So, the mean 270 really fails
to give a realistic picture of the major part of the data. It is influenced
by extreme value 990.
– Thus weakness of mean: does not work well for skewed data and is
not robust to outliers.
– Weakness of median: It only relies on the central values and ignores
all the other data.
MEAN OR MEDIAN

Mode: The value that is observed most


frequently.

Problems with mode:


• The mode is undefined for sequences in which
no observation is repeated.
• At times it might be nowhere near the center of
a data set.
• Sometimes there is more than one mode.
MEAN OR MEDIAN

Observation that occurs most frequently


9 12 15 15 15 16 16 20 26
Observation Number of occurrences
9 1
12 1
15 3
16 2
20 1
26 1
DISTRIBUTION
DISTRIBUTION
METHODS OF VARIABILITY MEASUREMENT

Variability (or dispersion) measures the amount of


scatter in a dataset.

Commonly used methods: range, variance,


standard deviation, interquartile range, coefficient
of variation etc.
Range: The difference between the largest and the
smallest observations. The range of 10, 5, 2, 100 is
(100-2) = 98. It’s a crude measure of variability.
METHODS OF VARIABILITY MEASUREMENT

Variance: The variance of a set of observations is the average of the


squares of the deviations of the observations from their mean. In
symbols, the variance of the n observations x1, x2,…xn is

2 ( x1  x ) 2  ....  ( xn  x ) 2
S 
n 1

Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is

(5  5) 2  (3  5) 2  (7  5) 2
4
3 1

Standard Deviation (s) : Square root of the variance. The


standard deviation of the above example is 2.
STANDARD DEVIATION (S)

•  

 
MEASURES OF DISPERSION

• Synonyms: measures of variation, spread and scatter

• Dispersion refers to the variability exhibited by a set of observations

• A measure of dispersion describes the amount of variability present


in a set of data

• There are different measures of variation that are used in statistics.


Those included in this module are:
– The range, variance, standard deviation and coefficient of variation.
VARIANCE

 The variance is a measure of variability which takes into


account the differences between each observation and the
sample mean

 It measures the scatter of the values in a set of data about the


mean

 The dispersion of the value when they are close to the mean
is less and vice versa
 Hence the logic to measure the variation of values from the mean
CALCULATION OF THE VARIANCE

 Sample variance = The sum of the squared deviations, divided


by (n – 1).

 Mathematical notation: s² = Σ(x – x¯)²


n -1
 The quantity s² is called the sample estimate of the variance

 Population variance:
 Mathematical notation: σ² = Σ(x – μ)²
N
ADVANTAGES AND DISADVANTAGES OF THE VARIANCE

Advantage
• It takes into consideration all the values in the set of
data.
Disadvantage
• The units of measure are squared which may be
difficult to communicate
– e.g. variance of weight will be in kg squared.
STANDARD DEVIATION

• The way around the difficulty of s² is to use the square root of the variance as a measure of
variability.
• The quantity denoted by s, is called the sample standard deviation

• Thus, if s² = Σ(x – x¯)²


n–1
Then s = √(Σ (x – x¯)² / n – 1)

The population standard deviation will therefore be denoted as:

σ = √σ²
Where
σ² = Σ(x – μ)²
N
EXAMPLE 1

• The data given is of plasma volume


 (x  x) 2

Variance  S 
2

( n  1)

Variance = 0.097
Standard dev. = 0.31
METHODS OF VARIABILITY MEASUREMENT

Quartiles: Data can be divided into four regions that cover the
total range of observed values. Cut points for these regions are
known as quartiles.
In notations, quartiles of a data is the (n+1)/4)qth observation of the
data, where q is the desired quartile and n is the number of
observations of data.
The first quartile (Q1) is the first 25% of the data. The second
quartile (Q2) is between the 25th and 50th percentage points in the
data. The upper bound of Q2 is the median. The third quartile
(Q3) is the 25% of the data lying between the median and the
75% cut point in the data.
Q1 is the median of the first half of the ordered observations and
Q3 is the median of the second half of the ordered observations.
METHODS OF VARIABILITY MEASUREMENT

In the following example Q1= ((15+1)/4)1 =4th observation of the data.


The 4th observation is 11. So Q1 of this data is 11.
An example with 15 numbers
3 6 7 11 13 22 30 40 44 50 52 61 68 80 94
Q1 Q2 Q3

The first quartile is Q1=11. The second quartile is Q2=40 (This is


also the Median.) The third quartile is Q3=61.

Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile


range of the previous example is 61- 11=50. The middle half of the
ordered data lie between 40 and 61.
METHODS OF VARIABILITY MEASUREMENT

In the following example Q1= ((15+1)/4)1 =4th observation of the data.


The 4th observation is 11. So Q1 of this data is 11.
An example with 15 numbers
3 6 7 11 13 22 30 40 44 50 52 61 68 80 94
Q1 Q2 Q3

The first quartile is Q1=11. The second quartile is Q2=40 (This is


also the Median.) The third quartile is Q3=61.

Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile


range of the previous example is 61- 11=50. The middle half of the
ordered data lie between 40 and 61.
BOXPLOT

Box plots are useful as they provide a visual summary of the data
enabling researchers to quickly identify mean values, the
dispersion of the data set, and signs of skewness.
Boxplot
BOX PLOT
Key bio statistical concepts

SAMPLING DISTRIBUTIONS

06/18/2023 1 46
Practical illustration

 Please, write on the data sheet on the flip chart your;


 Weight (kg),
 Height (meters), and
 Exercise habit
Do you exercise at least once a week?
(Yes=1/No=0).
 NB: If you can’t remember your actual height and weight, then
please just write down any reasonable answer.
Now we are going to do the following exercise

1. Select the weight of five people; add them up and divide by five
2. Repeat step 1 nine more times; you can select a particular person more than once
3. Record the results from steps 1 and 2. This will give you ten records in total
4. Now add the results in step 3 and divide by 10

Note: What have we just done?????1


06/18/2023 47
Population & Parameters

 Population refers to the entire set of subjects about whom we want


information.

 If we were to take our measurement on all subjects in the population,


descriptive measure would give us exact information about the
population and our analysis would be finished.

 This descriptive measure is called a parameter

 A parameter is any quantity that can be calculated from the


population data

06/18/2023 1 48
Sample and Statistic

 Because it is generally not possible to collect data on all subject


of the population, we gather information on a portion of the
population, known as a sample.

 Note that the same group of subjects can be a sample for one
question about its characteristics and a population for another
question.
 We use descriptive statistic from the sample to estimate the
characteristics of the population.

 A statistic is any quantity that can be calculated from the sample


data, which does not require knowledge of unknown quantities.
06/18/2023 49
Randomness

 To make a dependable generalization about certain


population characteristics, the sample must
represent the population in those characteristics.

 One important approach to ensure representative


samples is to choose the 1samples randomly.
06/18/2023 50
Framework for statistical inference

Sampling process
Population Sample
Inference

 There is more than one framework for statistical inference.

 The traditional and most widely used approach is termed the “classical” or “
frequentist”, and this is the one pursued in this course – as illustrated above

 An important alternative, the “Bayesian” approach, is growing in influence.

 The inference is underpinned by the concept of sampling distribution of a


statistic.

06/18/2023 51
Sampling distribution of a statistic
 We consider the systolic blood pressure measurements for 1600 workers as the
population.

 The population mean was, say, μ and the population standard deviation was,
say, σ. Each value was written on a small disc and put into a bag. Each student
was asked to shake the bag, pick 10 discs, write down the ten systolic blood
pressures, work out their mean, x, and return the discs to the bag.

 We imagine that the original sampling process is repeated over and over again,
each time working out their mean.
 Over a large number of repetitions this builds up a distribution for the sample
means. This distribution is called sampling distribution of the mean
 We can work out:
 The mean of these sample means, close to the population mean
 The standard deviation of these sample means - standard error.

06/18/2023 52
Standard Error

06/18/2023 53
Central limit theorem

 Suppose that we repeat the sampling process for a sample size of 30


discs (instead of 10 discs) and build up the distribution for the
sample means.
 As we increase the sample size you will notice that the sampling
distribution for the larger sample size becomes bell-shaped. This is
an example of the central limit theorem (CLT), which plays an
important role in many parts of biostatistics.  

 Every statistic has a sampling distribution, and it is through this


distribution that we are able to relate a statistic to its corresponding
population quantity – inference

 In practice, however, we have just one sample that we wish to use to


estimate the mean of a larger population, which it represents
 Providing the sample size is large ( ≥ 30), the sampling
distribution
06/18/2023 should approximate
1 the normal distribution. 54
Sampling methods

 A sampling distribution can only describe the correct


behavior of the corresponding statistic if it is derived
from a sampling process whose properties are known.

 The most common and best-known example of this is


random sampling:
– In its simple form this implies that all members of a
population are equally likely to appear in a sample
– Such simple random sampling allows sampling
distributions to be derived in a comparatively simple way,
and inference procedures tend
06/18/2023 1 to be straightforward. 55
Statistical models

 To elucidate the information about population quantities contained


in sample statistic we need a precise and formal description of the
whole sampling process from population to sample. This description
is called the statistical model.

 Relevant features of the population are represented by parameters,


such as mean, or variance.

 The structure of the population, together with the sampling process,


allows a model to be formulated that describes the statistical
behavior of the sample, which in turn allows us to postulate
sampling distributions for the relevant statistic
06/18/2023
56 1 56

You might also like