You are on page 1of 36

5/20/2021

Dr. Faran Emmanuel


University of Manitoba

1
5/20/2021

THE POWER OF STATISTICS


Statistical Thinking for Success in Life and Career
By Michael I. Parzen and Emanuel Parzen
May 30, 2013

Statistical thinking helps one’s success in life and career


by quantifying uncertainty using probability. It is
important to distinguish between outcomes that
are conceivable (i.e. zero probability), possible
(i.e. positive probability for an interval of similar
outcomes), and probable (i.e. positive probability).
Statistical thinking is used to answer questions about
what one knows and how one knows it, based on analysis
of data more than expert opinions.

2
5/20/2021

FROM EPIDEMIOLOGY TO BIOSTATISTCS


▪ Statistics is the body of technique and procedures dealing with the collection,
organization, analysis, interpretation, and presentation of information that
can be stated numerically. The science and art of understanding and analyzing
data to obtain reliable results and conclusions

Biostatistics is the application of statistics to problems in the


biological sciences, health, and medicine

SO WHAT DOES STATISTICS DO???


▪ Statistics uses sample statistics to estimate population parameters, also termed
population characteristics.

▪ One common example is the population mean… we calculate sample mean to


estimate population mean. Likewise we look at the proportion of people having a
specific disease within our sample and estimate population prevalence

▪ Although population parameters are sometimes considered unobservable, they


are taken to be fixed and potentially measurable quantities using survey statistics.

▪ Sample statistics vary… that’s why they are called variables

3
5/20/2021

THIS IS WHAT RESEARCHERS DO


Finalize data
Identify an issue or a
collection process Field Data collection
problem
and study teams
Usually a statistic

Pre-test the
Literature review to
questionnaire & Data management
understand the issue
research process and analysis
Usually review data… Preliminary analysis Many steps…

Develop rationale to
Ethical approval from Conclusions
support the
investigation a ERB Recommendations
Numbers

Develop Research
Develop a Hypothesis Dissemination and
Protocol &
or Research question Response
Questionnaires,
Hypothesis testing Sample Size, Variables, Questions Crisp Numbers

FROM EPIDEMIOLOGY TO BIOSTATISTOCS


Data vs Information
ANALYSIS
DATA Results and
questionnaire info conclusions
consists of numbers INFORMATION

▪ The methods and tools of biostatistics are used to analyze the data for
decision making

▪ make valid inferences from known samples about the populations from
which they were drawn.

4
5/20/2021

DATA ANALYSIS
There are various types of Statistical Analysis:
▪ Descriptive analysis: used to describe the data set
▪ Inferential analysis: used to generate conclusions about the
population’s characteristics based on the sample data
o Differences analysis: used to compare the mean of the responses
of one group to that of another group
o Associative analysis: determines the strength and direction of
relationships between two or more variables
▪ Predictive analysis: allows one to make forecasts for future events

5
5/20/2021

UNDERSTANDING VARIABLES
▪ “any population/sample characteristic that we want to study in a study
is called a “variable” e.g., age, sex, years of education, HIV status,
income etc.,

▪ The term "variable" makes sense because the value of the characteristic
varies from one subject to another…. Because of inherent variation
among individuals and from errors, called measurement errors, made in
measuring and recording a subject's value on a characteristic.

▪ Dependent (outcome) and Independent (predictors) variables

▪ We start by looking at the types of variables we have in our data set

TYPES OF DATA TYPES OF


VARIBALES/DATA
2 major types

QUALITATIVE QUANTITATIVE
Categorical Numerical/Scale

NOMINAL ORDINAL DISCRETE CONTINOUS


A score or value within a scale.
Order of value exists. The Difference between each
Identification of subjects Absolute Values… no
e.g., income value has a real meaning
Study ID, Gender, continuity e.g., No of
categories,
children, No of
Address educational
patients seen,
categories, state of
Name of the school health

6
5/20/2021

mean
median
mode
Measure of central
tendency
Variance

NUMERIC Standard Deviation


APPROACH
Data is usually Range
described in form QUANTITATIVE
Measure of dispersion
of numbers – Tables Inter-quartile Range
are the most
common
presentation of
data Skewness
Distribution Kurtosis

QUALITATIVE
(Categorical) Frequencies/Proportions

7
5/20/2021

DESCRIPTIVE – NUMERIC – MEASURES OF CENTRAL TENDENCY


▪ Measures of central tendency are measures of the location of the middle
or the center of a distribution
▪ Why describe Central Tendency
Data often cluster around a central value that lies between the two
extremes. This single number can describe the value of scores in the
entire data set.
▪ Three measures of central tendency usually used.
1) Mean
2) Median
3) Mode

DESCRIPTIVE – NUMERIC – MEASURES OF CENTRAL TENDENCY - MEAN


▪ THE MEAN is the most ▪ MEDIAN is the ▪ MODE is the most
commonly used, also known middle score when frequently occurring
as “average” all scores in the number in a set of
data set are data.
▪ Population mean , and arranged in order.
sample meanx ▪ If there are two
▪ Half the scores lie modes, the data set is
▪ Sum of all the scores divided above and half lie bimodal.
by the number of scores below the median.
▪ If there are more than
▪ Weighted mean - weight two modes, the data
given to each value according set is said to be
to its importance multimodal.

• Most commonly used • Used when distribution is • Usually used with


• When the variable is skewed (e.g.,) income Categorical data
normally distributed • Median (range)
• Mean (sd)

8
5/20/2021

MEASURES OF CENTRAL TENDENCY

mean
median
mode
Measure of central
tendency
Variance

NUMERIC Standard Deviation


APPROACH
Data is usually Range
described in form QUANTITATIVE
Measure of dispersion
of numbers – Tables Inter-quartile Range
are the most
common
presentation of
data Skewness
Distribution Kurtosis

QUALITATIVE
(Categorical) Frequencies/Proportions

9
5/20/2021

DESCRIPTIVE – NUMERIC – MEASURES OF DISPERSION


▪ Lets look at an example………
▪ Group A: 0, 5, 10, 15, 20, 25, 30

▪ Group B: 11, 14, 14, 15, 16, 16, 19

▪ Group C: 0, 0, 15, 15, 15, 30, 30

▪ All 03 groups have a similar mean….. So can we say , that all 03


groups are the same?
▪ Other than mean/average, another variable determines the
characteristics of a group….
▪ Let us look at another example…………..

DESCRIPTIVE – NUMERIC – MEASURES OF DISPERSION

10
5/20/2021

DESCRIPTIVE – NUMERIC – MEASURES OF DISPERSION


▪ A variable’s spread is the degree to which values on the variable differ
from each other.
▪ More the values are different from each other, more is the spread of data.
▪ If every score on the variable were about equal, the variable would have
very little spread.
▪ Variability and dispersion are synonyms for spread.
▪ There are various measures of dispersion e.g., Range, Variance, Standard
Deviation, IQ Range,

DESCRIPTIVE – NUMERIC – MEASURES OF DISPERSION - RANGE


▪ The range is the simplest measure of spread or dispersion.
▪ It is equal to the difference between the largest and the smallest values.
e.g., 100, 74, 68, 68, 57, 56
Range = H - L = 100 - 56 = 44
▪ Range is very sensitive to extreme scores since it is based on only two
values.
▪ The range should almost never be used as the only measure of spread,
but can be informative if used as a supplement to other measures of
spread

11
5/20/2021

DESCRIPTIVE – NUMERIC – MEASURES OF DISPERSION – VARIANCE

• Variance is the sum of the squared deviations from the mean divided by N.

 2
=
 (x -)2
• Population variance is given by 2
N
 (x - x)2
• Sample variance is given by s2 s 2
=
n-1

DESCRIPTIVE – NUMERIC – MEASURES OF DISPERSION

STANDARD DEVIATION
▪ Commonest measure of dispersion used
▪ To calculate Standard Deviation, simply calculate the square root of the
variance.
▪ Population Standard deviation is given by 
▪ Sample Standard deviation is given by s
▪ Average or mean is always presented along with standard deviation

12
5/20/2021

DESCRIPTIVE – NUMERIC – MEASURES OF DISPERSION – INTERQUARTILE RANGE


▪ The interquartile range is a measure of where the “middle fifty” is in a data set.

▪ Where range is a measure of where the beginning and end are in a set, an
interquartile range is a measure of where the bulk of the values lie. That’s why it’s
preferred over many other measures of spread when reporting things like school
performance or SAT scores.

▪ The interquartile range formula is the first quartile subtracted from the
third quartile:
IQR = Q3 – Q1.

mean
median
mode
Measure of central
tendency
Variance

NUMERIC Standard Deviation


APPROACH
Data is usually Range
described in form QUANTITATIVE
Measure of dispersion
of numbers – Tables Inter-quartile Range
are the most
common
presentation of
data Skewness
Distribution Kurtosis

QUALITATIVE
(Categorical) Frequencies/Proportions

13
5/20/2021

DESCRIPTIVE – NUMERIC – DISTRIBUTION


▪ When talk of the distribution of data it
basically means “how a variable or a
characteristic is distributed in the
population ???

▪ For example, “ what is the distribution of


age among this class”

▪ we can develop a histogram showing how


many people are within each age category
(each year or class intervals)
20 24 28 32 36 42 46
▪ This histogram shows the age distribution

NORMAL DISTRIBUTION
▪ the most famous probability distribution in statistics.
▪ also called “Gaussian distribution” or “bell shaped curve”
▪ It is continuous, smooth, bell shaped and having only one peak (unimodal)
▪ The curve is symmetrical about the mean (shape is same on both sides)
▪ The mean, median and mode are equal and located at the center of the distribution
▪ Two parameters define the normal distribution, the mean () and the standard
deviation ().
▪ Since it is a probability distribution, total area under the curve is 1.00 or 100%

14
5/20/2021

NORMAL DISTRIBUTION

15
5/20/2021

WHY IS NORMAL DISTRIBUTION IMPORTANT

▪ Countless phenomena follow (or closely approximate) normal


distribution e.g height, weight, serum cholesterol, body temp of healthy
persons
▪ Much statistical theory and methodology developed on this assumption
and is basis for inferential statistics

SKEWNESS OF DATA

▪ Skewness is a measure of symmetry, or more precisely, the lack of


symmetry. A distribution, or data set, is symmetric if it looks the same to
the left and right of the cent
▪ The skewness for a normal distribution is zero. Negative values for the
skewness indicate data that are skewed left and positive values for the
skewness indicate data that are skewed right.

16
5/20/2021

TYPES OF DISTRIBUTION/CURVE
Normal Positive Skew Negative Skew
Distribution

Median Mean
Mean & Median
Mean arethe
& Median same
e Mean Mean Median

▪ A distribution is skewed if one of its tails is longer than the other.


▪ The distribution can be positively skewed. This means that it has a long tail in the
positive direction… "skewed to the right"
▪ The distribution can be negatively skewed if it has a long tail in the negative direction….
"skewed to the left"

17
5/20/2021

OUTLIERS
▪ An outlier is any observation, which falls more than 3 std deviation away
from the mean.
▪ Outliers are extremely important because they can significantly skew
distributions, which otherwise are normal.

▪ Decision need to be taken, about how to deal with outliers.

18
5/20/2021

APPLICATION OF NORMAL DISTRIBUTION


▪ Knowing a distribution alone is not enough… Scientists use this to answer
research questions
▪ … we just said that countless phenomena follow (or closely approximate)
normal distribution e.g., height, weight, serum cholesterol, body temp of
healthy persons
▪ E.g., If systolic blood pressure is normally distributed with a mean of 120 and
standard deviation of 12, what proportion of people will have a normal systolic
blood pressure. (hypertensive is >135, while low blood pressure <95).

STANDARD NORMAL DISTRIBUTION


▪ Normal distribution is transformed to standard normal distributions by the formula:
x− X is a score from the original normal distribution,
z= μ is the mean of the original normal distribution,
 and σ is the standard deviation

▪ The standard normal distribution is also called the z distribution.


▪ A z score is the number of standard deviations an observation is away from the mean.

By using this formula we can convert any


variable’s distribution into a standard
normal distribution with a mean of zero
and standard deviation of 1.

19
5/20/2021

NORMAL DISTRIBUTION

Z SCORE TABLE
AREA UNDER THE CURVE

One half of the distribution is a


mirror image of the other half.

20
5/20/2021

Z SCORE TABLE – USING THE TABLE


For instance, if you scored 120 on a test with a
mean of 100 and a standard deviation of 10.
What is your z score
= (120 – 100) / 10 = 20/10 = 2
The z score tells you how many standard
deviations from the mean your score is. In this
example, your score is 2 standard
deviations above the mean.
How many people scored above 120
How many people scored between 90 to 120 120 2
How many scored above 100
100

Z SCORE TABLE
AREA UNDER THE CURVE

▪ What is the 90th percentile

(X = μ + Zσ)
▪ To solve this…. Look for 90% in the table
and check the z score for it.

21
5/20/2021

Z DISTRIBUTION : THE CATCH


▪ The z distribution works on 2 major assumptions:
▪ The sample size is more than 30 i.e., N>30
▪ We know the value of .

IS THERE A PROBLEM WITH THAT ??

▪ Even if the sample size if high, in reality, there will never be a situation
where you know the true population variance or standard deviation…

22
5/20/2021

T DISTRIBUTION
▪ The t distribution (Student’s t-distribution) is a probability distribution that is used
to estimate population parameters when
i. the sample size is small
ii. and/or when the population variance is unknown.

▪ distribution of the t statistic (also known as the t score), is given by:

where x is the sample mean, μ is the population mean, s is the standard deviation of the
sample, and n is the sample size. The distribution of the t statistic is called the t
distribution or the Student t distribution.

T DISTRIBUTION

▪ There are many different “t” distributions.


▪ The particular form of the “t” distribution
is determined by its degrees of freedom.
▪ Degrees of freedom refers to the number of independent observations in a set of data.
▪ When estimating a mean score or a proportion from a single sample, the number of
independent observations is equal to the sample size minus one. Hence, the
distribution of the t statistic from samples of size 8 would be described by a t
distribution having 8 - 1 or 7 degrees of freedom. Similarly, a t distribution having 15
degrees of freedom would be used with a sample of size 16.

23
5/20/2021

T DISTRIBUTION
Following are the characteristics of the t distribution:
(1) The t statistic lies between −∞ < t < ∞.
(2) The probability distribution appears to be symmetric about t = 0.
(3) The probability distribution appears to be bell-shaped.
(4) The density curve looks like a standard normal curve, but the tails of the t-
distribution are "heavier" than the tails of the normal distribution. That is, we are
more likely to get extreme t-values than extreme z-values. There are no outliers.
(5) As the degrees of freedom increases, the t-distribution appears to approach the
standard normal z-distribution.

T SCORE TABLE
AREA UNDER THE CURVE

▪ What would be the t score if the area


under the curve is 90% with n=24
and at n=1000

▪ What would be the t score if the area


under the curve is 95% with n=100

▪ It is interesting to note that when


sample size is large, the distribution
of t statistic, matches distribution of z.

24
5/20/2021

25
5/20/2021

CHI SQUARE DISTRIBUTION


▪ When the data is categorical, Chi Square statistic is most commonly used for
testing relationships between categorical variables.
▪ Chi square distribution (χ2) is a probability distribution widely used in statistical
inference
▪ The chi-squared distribution (also chi-square or χ2-distribution) with k degrees of
freedom is the distribution of a sum of the squares of “k” independent standard
normal random variables.

CHI SQUARE DISTRIBUTION


▪ The distribution shape
depends on the degree of
freedom.
▪ Higher the degree of
freedom, Chi square
distribution (χ2) resembles a
normal distribution.
▪ DF= (r - 1) * (c - 1). Thus in
a 2X2 table the DF is 1 i.e.,
(2-1)*(2-1) = 1

26
5/20/2021

CHI SQUARE
TABLE
▪ No negative values
▪ The value of χ2 is
different with each
degree of freedom
▪ What will be the value
of χ2 statistic if df=1
and the confidence
level is 95%

CHI SQUARE AND STANDARD NORMAL DISTRIBUTION


▪ If a random variable (Z) has a standard normal distribution, the Z2 will follow a (χ2)
distribution with one degree of freedom
▪ The χ2 distribution is related to a standard normal distribution. The simplest chi-squared
distribution is the square of a standard normal distribution.

χ2 at 0.05 = 3.84 z at 0.05 = 1.96

27
5/20/2021

PROBABILITY DISTRIBUTIONS
DISCRETE CONTINUOUS

• Binomial distribution • The “F” Distribution


(Logistic regression) (ANOVA)

WHICH DISTRIBUTION HAS A LARGER VARIANCE?

28
5/20/2021

Bar Charts
Histograms/Frequency Polygons
Pie Charts
mean
Scatter Plots
median
Descriptive statistics

Measure of central mode


GRAPHICAL APPROACH tendency
Visual presentation of data
in form of figures and Variance
illustrations … more
commonly done for reports Standard
Deviation
Range
Measure of
dispersion Inter-quartile
NUMERIC APPROACH Range
Data is usually described in
form of numbers – Tables
are the most common Skewness
presentation of data
Distribution
Kurtosis

29
5/20/2021

DESCRIPTIVE – GRAPHICAL – BAR CHARTS

100%
80%
60%
40%
20%
0%
BHWL BNU DGK GJRN GJRT HYD KHI KSUR LRK MPK NWB PSH QTA RWP SHKP SLKT SKKR TRBT OVERALL

Home KK Brothel Hotel/Msg Street Phone

Best use : whole numbers, Nominal or discrete data

DESCRIPTIVE – GRAPHICAL – HISTOGRAM

Best use : Continuous data

30
5/20/2021

DESCRIPTIVE – GRAPHICAL – PIE CHARTS


Cell Phone*
17% Home
29%

Street
14%

Hotel/Msg
3%
KK
Brothel 36%
1%

Home KK Brothel Hotel/Msg Street Cell Phone*

Best use : Percentages

DESCRIPTIVE – GRAPHICAL – BOX PLOTS

Best use : more descriptive, shows measure of Central tendency + dispersion

31
5/20/2021

DESCRIPTIVE – GRAPHICAL – SCATTERPLOTS

Best use : relationship between two variables

HIV PREVALENCE TRENDS – THE POWER OF GRAPHICS


45%
IDUs
40% 38.4%
TGs 36.7%
2005 2007 2008 2011 2016 35% MSM
IDUs 10.8% 15.8% 20.8% 36.7% 38.4% FSWs
30%
TGs 0.8% 2.1% 6.4% 7.3% 7.2%
25%
MSM 0.4% 1.5% 0.9% 3.1% 5.6% 20.8%
20%
FSWs 0.4% 0.2% 0.5% 0.8% 2.2% 15.8%
15%

10% 10.8% 7.2%


7.3%
6.4% 5.6%
5% 3.1%
2.1%
0.4% 0.5%
1.5% 0.9% 0.8% 2.2%
0% 0.8% 0.2%
0.4%
2005 2007 2008 2011 2016

32
5/20/2021

DESCRIPTIVE – GRAPHICAL – BUBBLE CHARTS

DESCRIPTIVE – GRAPHICAL – BUBBLE CHARTS

33
5/20/2021

34
5/20/2021

DESCRIPTIVE ANALYSIS
▪ Understand our data set
▪ Scale/Numerical data :
▪ Check for Distribution (Skewness & kurtosis (symmetrical or skewed.. Remember z distribution, t distribution)
▪ Measures of central tendency (Mean, Median, Mode)
▪ Measures of Dispersion (Standard deviation, variance, range, IQR)
▪ DECIDE – Do you want to make categories or present as such

▪ Qualitative/Categorical data:
▪ Check for proportions within each category
▪ The distribution of categorical variables follow Chi square distribution

▪ How do you present your descriptive data


▪ Numeric presentation (Tables)
▪ Graphical presentation (Figures/graphs/illustrations0

DESCRIPTIVE ANALYSIS

35
5/20/2021

DESCRIPTIVE
ANALYSIS

THANKS
PLEASE READ THESE BASIC CONCEPTS….

will be happy to answer any questions


faran.emmanuel@umanitoba.ca

36

You might also like