You are on page 1of 53

Biostatistics

MPH 2023/24
Dr. Nyinypiu Tong/MBBS/MSc (PH)
0925240002
Learning Objectives
1. Define Statistics and Biostatistics
2. Types of Statistical Applications
3. Distinguish between different types of data/variables
4. Describe and graph categorical data
5. Describe and graph numerical data
– Central tendency (location): mean, median and mode
– Spread: min, max, range, P25, P75, standard deviation and variance,
(outliers)
What is statistics?
Making sense out of numbers!
• More formally: Statistics is the science of collecting, summarizing, presenting and
interpreting data
• It provides:
- a way of organising information on a more formal basis than relying on the
exchange of anecdotes and personal experience
- taking variation into account
Introduction to Biostatistics
• Biostatistics can be defined as the application of the mathematical tools used in
statistics to the fields of biological sciences and medicine.
• It is the application of statistical methods in studies in biology, and encompasses
the design of experiments, the collection of data from them, and the analysis and
interpretation of data.
• The data come from a wide range of sources, including genomic studies,
experiments with cells and organisms, and clinical trials.
Types of Statistical Applications
1. Research Interpretations and Conclusions
• Statistics forms an important part of most sciences, helping researchers test
hypotheses, confirm (or reject) theories, and arrive at reliable conclusions. The
data generated from experiments and studies is never straightforward — one
has to take into account randomness and uncertainty, eliminate coincidences
and arrive at the most accurate findings.
• Statistical analysis helps reduce or eliminate errors so that researchers can
confidently make conclusions that will then direct further research.
Types of Statistical Applications
2. Meta-Analysis of Literature Reviews
• Before a researcher or scientist embarks on new research, it is customary to
perform a comprehensive literature search of all the available published
information on a specific topic.
• A statistical analysis of these studies helps extract the common truth underlying
all these studies, or uncover a hidden pattern or relationship.
Types of Statistical Applications
3. Clinical Trial Design
• One of the most important applications of statistical analysis is in designing
clinical trials. When a new drug or treatment is discovered, it has to first be
tested on a group/groups of people to understand its efficacy and safety.
• Biostatisticians can take on the task of performing a statistical analysis of the
study, helping not only to design it but also analyze and determine the
outcomes.
Types of Statistical Applications
4. Designing Surveys
• Surveys require careful design and implementation, considerations about the
survey format, accounting for bias and fatigue, etc.
• Data collected from surveys have to be carefully studied by statistical analysis
experts who also use their own discretion and experience to derive the most
meaningful information from a survey.
Types of Statistical Applications
5. Epidemiological Studies
• A statistical analysis involves identifying the most likely cause of a disease —
for example, the link between smoking and lung cancer. This information is used
to develop public health policies and implement preventive healthcare
programmes.
• Data visualization and statistical analysis also played an important role in
understanding the Ebola epidemic in West Africa.
Types of Statistical Applications
6. Statistical Modeling
• Statistical modeling involves building predictive models based on pattern
recognition and knowledge discovery. It is used in environmental and
geographical studies, predicting election outcomes, survival analysis of
populations, and more.
• Meteorologists use statistical tools to help them predict the weather. The line
between statistical modelling and machine learning is becoming increasingly
blurry — Robert Tibshirani, a statistician at Stanford called machine learning
“glorified statistics”.
Types of Statistical Applications
7. Monitoring & Evaluation
• Statistical analysis can be used to monitor programmes through data collection
and interpretation.
• Evaluation would involve periodic measurement of indicators and use of new
interventions to optimize progress regarding particular project
Description and Inference
1. Descriptive Statistics
• Numerical description of events – making sense out of numbers
2. Inferential Statistics
• Use information on sample to draw conclusions about a population
Statistical inference
Statistical inference
Descriptive statistics
• To describe a population in numerical terms:
– Whole population: parameter
– Sampled population: statistic
Descriptive statistics: why?
To provide:
- a precise, numerical description of the data
- taking variation into account

Variability at the heart of statistics:


https://www.youtube.com/watch?v=ipYaHqutMds&t=5s
Useful description ?
• Some people are young and some people are a bit older, and other people
are even older and very few people are extremely old, and I forgot .. Some
people are extremely young when they finish school
Learning objectives
1. Define Statistics
2. Introduction to Biostatistics
3. Types of Statistical Applications
4. Distinguish between different types of data/variables
5. Describe and graph categorical data
6. Describe and graph numerical data
– Central tendency (location): mean, median and mode
– Spread: min, max, range, P25, P75, standard deviation and variance, (outliers)
Types of data/variable
Exercise
• What type of data are the following variables?
– Number of visits to GP in a year
– Blood group (A, B, AB and O)
– Level of Education (Primary, Secondary, Tertiary)
– Having Children (Yes, No)
– Number of Children (0, 1, 2, …..)
– Mother’s age (years)
Describe and graph categorical data
• Describe categorical data
– Number of observations per category (frequency counts)
– Percentages (number per category/ total number of observations)
• Graph categorical data
– Bar charts
– Pie charts
Frequency counts and percentages
Graphs: categorical data
Describe and graph numerical data
• Describe numerical data:
– Central tendency (location): mean, median and mode
– Spread: min, max, range, P25, P75, standard deviation and variance,
(outliers)
• Graph numerical data
– Histogram
– Box plot
Histogram
Exercise

• What does the histogram of height measurements from


this class look like?
Describe and graph numerical data
How to describe the variable Age in a school leaving population ?
• Minimum age
• Maximum age
• Range: Max – Min: Measure of spread
Calculation of Min, Max, Range
Describe and graph numerical data
• What else do we want to know?
• The most frequent value?
– Mode: value that occurs most often
• Some measure of the mid-value?
– Median: Mid value of ordered data set
• Some measures of how values are spread around the median?
– P25, P75
Calculation of median (odd number of
observations)
Calculation of median (even number of
observations)
What information does the median give us?
• Median is the middle value of an ordered data set
• It divides the ordered values in two halves
• Below and above the median are equal number of values
– Thus 50% of all data have a value lower than the median
– And 50% of all data have a value higher than the median
• Median is also called P50, the 50th percentile
Calculation of Percentiles
Interpretation IQR
• The interquartile range is the difference between the first and third
quartile, between P25 and P75.
• It contains the central 50% of the observations in the ordered set
– 25% of observation lying below
– 25% of observation lying above
• Is IQR influenced by low minimum value or high maximum value (outliers)
Box plot
Calculation of mean/average
Calculation of mean/average
Variance – A fundamental concept to
understand
• A measure of the average distance between each of a set of data points and
their mean value
• The extent to which each observation deviates from the mean
• The larger the deviations, the greater the variability of the observations
• The larger the difference between a data point and the mean, the larger the
variance.
• The Variance is a measure of the spread and variation in a data set
Formula of the variance
Total deviation = zero
Variance – squared deviation from mean for
each observation
Standard Deviation
• In practice we do no work directly with the variance, but with the standard
deviation (sd)
– Square root of the variance
– A more standardized measure of the spread of data around the average
• Why SD and not variance: https://www.youtube.com/watch?v=rCAK7uYK3Bc
Formula of the standard deviation
Variance and Standard Deviation
Summary of measures of spread
• Min-max range: the difference between the maximum and minimum value
• 25% Percentile: 25 % of observations is below, and 75% of observations
is above this value
• 75% Percentile: 75 % of observations is below, and 75% of observations
is above this value
• Interquartile range: The difference between the 25th and 75th quartiles
• Standard deviation: Measures the spread or dispersion around the mean. It
is the most widely used measure of spread.
Measures of central tendency
• Mean: Average value
• Median: Middle value of an ordered data set
• Mode: The value(s) that occurs most frequently in a data set
• Whenever you report a measure of central tendency, it needs to come with
a measure of spread – Mean + SD and Median + IQR
• The mean/median tell us what the most central value is
• The SD/IQR tell us how much values are spread around mean/median
The normal distribution
• Many variables in nature have a normal
distribution
– Weight (healthy people)
– Height
– Cholesterol
• Normal distribution:
– Symmetric
– Bell-shaped
The normal distribution
• The normal distribution has many
desirable properties:
– mean=median=mode=170 cm
– The shape of the distribution is
defined by the mean and variance
– Fixed percentage of observations
are within 1, 2, or 3 SD of the mean
Skewed distributions

Examples of skewed distributions: https://www.youtube.com/watch?v=XSSRrVMOqlQ


Large variance, large spread of values
Small variance, values close to the mean
Exercise

• Calculate the mean, median, mode, variance and standard deviation of the
finger length of MPH students.
- 6 finger lengths at random
- Every nth finger and
- randomly selected cluster
• Interpretation?
Learning objectives
1. Define Statistics
2. Introduction to Biostatistics
3. Types of Statistical Applications
4. Distinguish between different types of data/variables
5. Describe and graph categorical data
6. Describe and graph numerical data
– Central tendency (location): mean, median and mode
– Spread: min, max, range, P25, P75, standard deviation and variance, (outliers)

You might also like