You are on page 1of 58

Biostatistics

Prem Prasad Panta


Asso. Prof. in Biostatistics
Karnali Academy of Health Sciences
Email: pantaprem7@gmail.com

1
Probability
• Proportion of happening outcome from the
total
• Relative frequency of the total
• Chance of happening outcome from the
experiment
• Denoted by p
• So probability = favorable outcome = m
total outcome n
2
Example
• In coin tossing, total outcomes = {head, tail}=2
Probability of head = P (H) = ½
Probability of not head = P (T) = q = 1-1/2 =1/2
• In dice tossing, total outcomes = 6 {1,2,3,4,5,6)
– Probability of 5 = P(5) = 1/6
– Probability of 1 = P(1) = 1/6
• Probability of not happening event =q
• So p+q = 1
• Probability of sure event = 1
• probability of impossible event = 0
• Therefore probability lies between 0 to 1
3
Terms used in probability

• Independent events: first and second coin


tossing
• Dependent events: drawing first card and again
drawing second card from a pack of cards
• Mutually exclusive events: coin tossing
• Equally likely events: occurring of head and tail
in coin toss

4
Screening test

• Sensitivity: probability of having true positive


cases among the total cases (disease)
• Specificity: probability of having true negative
cases among the total healthy cases
• Positive predictive value: probability of
having the disease among a positive
screening test result
• Negative predictive value: probability of not
having disease among a negative screening
test result
5
Screening test

6
Screening test

7
MCQ
1. A bag contains 3 red and 5 green marbles. A
marble is drawn at random. The probability of
drawing a blue marble is
a. 5/8 b. 3/8 c. 0/8 d. 1/8
2. The sum of the probability of an event and non
event is :
b. 2 b. 1 c. 0 d. 0.5
3. If three coins are tossed simultaneously, than the
total numbers of outcome will be:
a. 6 b. 3 c. 8 d. 1
• { H,T} =2
• {H,T} =2
• {H,T} =2
• 2X2X2 =8 outcomes
• {HHH, HHT, HTH, THH, ………TTT} =8
• P(all head) =P(HHH) = 1/8
• P(TTT) = 1/8
• P ( two head and one tail) = 3/8

• When 2 dice are tossed at once how many outcomes are there?
• 6 X6 =36
• First coin 1 2 3 4 5 6
• Second 1 (1 ,1) (1,2) (1,3)
• 2
• 3
• 4
• 5
• 6
Frequency distribution
• X F
• 20 3
• 30 5
• 40 10
• 50 5
Probability distribution
• Discrete probability distribution: having
discrete random variables: no. of death, no.
of births, household size
– Binomial probability distribution
– Poisson probability distribution
• Continuous probability distribution
– Normal probability distribution
– E.g. ht, wt. marks, income, expenditure etc

12
Binomial probability distribution
• Knowns a Bernoulli trails or Bernoulli process
• Deals
– the experiment having two mutually exclusive
outcomes, (binary events)
– independent trials,
– constant probability of success (p) and not
happening outcome is called failure (q) [p + q= 1]
– finite trials(n<20)
• Coin tossing(head/tail), birth(male/female),
test(positive/negative), result(pass/fail) etc.
13
Binomial probability
• Probability of happening outcome is called
success and denoted by p and not happening
outcome is q
• So p + q=1
• Parameter of binomial probability are n and p
• Mean = np
• Variance = npq

14
Poisson probability distribution

• Discrete probability distribution


• It deals the event which occurring within the
certain time interval
• It is also used for the rare disease
• Limiting case of binomial distribution
• When probability of success is very small and no.
of experiment is large, then Poisson probability
distribution is used.
• The parameter of Poisson distribution is mean
• Note: mean = variance =λ
15
• Variance = Sum (X- mean of X)2/n
• SD = square root (sum (X- mean of X )2/n)
• 2 4 6 8 10
• Distance from mean (6)
• -4 -2 0 2 4
• 16 4 0 4 16 =40
• N=5
• 40/5 = 8 = variance
• SD = square root of 8 = 4x2 = 2 root 2=2.8
• Square of SD = variance
• Square root of variance = SD
Normal probability distribution
• Continuous probability distribution
• Known as Gaussian distribution
• E.g. ht,wt, marks, SBP etc.
• bell shaped and symmetrical curve
• Mean= median =mode
• Area =1
• Parameters = mean and standard deviation

17
MCQ
• A variable that can assume any value between two given points is called
___________
a) Continuous random variable
b) Discrete random variable
c) Irregular random variable
d) Uncertain random variable
• Which of the following mentioned Probability distribution is continuous probability
distribution?
a) Gaussian probability Distribution
b) Poisson probability Distribution
c) Binomial probability Distribution
d) none of them
• A variable which can assume finite or countably infinite number of values is known
as:
a. Continuous b. Discrete c. Qualitative d. None of them
• Total area under the curve of a continuous probability density function· is always
equal to:
a. Zero b. One c. -1 d. None of them
MCQ
• Which of the following is not a characteristic of the
normal distribution?
a. the mean is always zero
b. the area under the curve equals one
c. the mean, median and mode are equal
d. it is a symmetrical distribution
• the parameters of normal distribution are:
a. Mean and median
b. Mean and standard deviation
c. Mean and mode
d. Mean, median and mode
Estimation
• Point estimation: single point
Example: estimation of mean and estimation of
prevalence
• Interval estimation: estimation of population
parameter within the certain range
• Confidence interval: interval estimation
having certain confidence i.e. certain level of
probability
• 90% CI (Z= 1.64), 95% CI (Z = 1.96) and 99%
CI (Z =2.58)
20
Confidence Interval (CI)

Mean ± 1.96 SE
Where SE for sample mean = SD/√n
Standard error: variability of sample means and
calculated by SD/√n
Standard deviation: variability of observations

21
Z value for α and β

Two tailed test One tailed test


Zα/2 = 1.64 at 90% CI Zα = 1.28 at 90% CI
Zα/2 = 1.96 at 95% CI Zα = 1.64 at 95% CI
Zα/2 = 2.58 at 99% CI Zα = 2.33 at 99 %CI

Z β = 0.84 at 80 % power of test


Z β = 1.28 at 90 % power of test
Z β = 1.65 at 95 % power of test
22
Confidence interval

23
Which one is reliable??
• The mean systolic blood pressure(120) lies
surely 95% in between 110 to 130

• The mean systolic blood pressure (120) lies


surely in between 100 to 140.
Note: Narrow the interval , better the precision

24
Hypothesis testing

• Null hypothesis(H0):
– Two means are equal ( not significantly different)
– Two proportions are equal
– There is no correlation between two variables
– There is no association between two variables
• Alternative hypothesis(H1)
– Two means are not equal (significantly different)
– Two proportions are significantly different
– There is significant correlation between two variables
– There is significant association between two variables

25
Types of alternative hypothesis

• Two tailed test (non directional) : not equal


• One tailed test (directional)
– Right tailed( first mean > second mean)
– Left tailed( first mean < second mean)

26
Two tailed test

Accept Ho if the sample mean falls in this region

0.025 0.475 + 0.475 0.025

0.95

Z=-1.96 Z=1.96
Reject Ho if the sample mean falls in either of these two region

27
One tailed test

Accept Ho if the sample mean falls in this


region

Rejection Region Acceptance Region

0.05 + 0.5=0.95

Z= -1.645 =H0

Reject Ho if the sample mean falls in either of these two


regions.

28
Errors in hypothesis

•  

29
False positive and false negative

30
P value
• In technical terms, a P value is the probability of
obtaining an effect at least as extreme as the
one in our sample data, assuming the truth of
the null hypothesis.
• High P values: Our data are likely with a true null
hypothesis
• Low P values: Our data are unlikely with a true
null hypothesis

31
Interpretation of p value

• P value is compared with 5% and 1% level of


significance
• P < 0.05 ( α = 5%, level of significance) ,
significant ( mean of two groups are significantly
different or two variables are significantly
associated)

32
Quantitative Variable
H0: Distribution of sample is normal

Test of normality (Kolmogorov Smirnov test)

Fail to reject H0( accepted)


H0 rejected
Normal Distribution Non -normal Distribution

Parametric test Non- parametric test

Test for relationship Test for mean Test for relationship


Test for mean
(Pearson’s correlation) (Spearman Rank
correlation)

33
Selection of test

Samples Comparison Parametric Non parametric test


of two test(Follows (Does not follow
averages normality ) normality)
Independent Different T test Mann Whitney U
groups or ( Independent test
samples t test)
Dependent Same groups Paired t test Wilcoxcon Matched
(Same group) of samples (Dependent t Pairs Signed Rank
test) Test

34
ANOVA test
More than two group or samples
Samples Comparison of Parametric Non parametric test
more than two test (Does not follow
averages (Follows normality)
normality )

Independe Different ANOVA Kruskal Wallis One


nt groups or One /two way Way ANOVA by rank
samples ANOVA

Dependent Same groups of Repeated Friedmann test


(Same samples ANOVA (ANOVA)
group)

35
Measures of Association
Samples Variable 1 Variable 2 Statistics

Independent Nominal Nominal Chi square test


(two different
group) Fisher Exact test*
Ordinal Nominal
Yates Correction*
Nominal Ordinal

Dependent Nominal Nominal Mc Nemar Test*


(same group
or sample)

* Applied only for 2x2 table


36
Measures of Relationship
Statistics Variable 1 Variable 2 Range

Non parametric : ordinal ordinal


Spearman rank -1 to +1
correlation
Ordinal ratio
Ratio Ordinal
Parametric: Interval and Interval and -1 to +1
Pearson Coefficient of ratio ratio
correlation

37
Z test
• Randomness
• Known variance
• Sample size > 30
• Normality
Interpretation
• Calculated Z > tabulated Z value (z=1.96 at
5% level of significance), reject null
hypothesis, otherwise accept null
hypothesis
38
Types of Z test
• Comparison between
– Sample mean and population mean
– Two sample means
– Sample proportion and population proportion
– Two sample proportions

39
T test and types
• Randomness
• Normality
• Sample size less than 30
• Unknown variance
Degree of freedom
= n-1 ( for one sample mean test) and paired data
= (n1-1)+(n2-1) = n1+n2 -2 ( for two sample mean
test)
40
Chi square test

Assumption: Use
i. When row and column value in the contingency
table are categorical or qualitative data
ii. none of the cells have expected frequency zero.
iii. Expected cell frequency should be at least five
iv. Adequate sample size (n=50)
Types
i) Test of association between two categorical variables
ii) Test of goodness of fit

41
Contingency table ( 2x2 table)
column
Variable 1 Variable 2 Total
Yes No
Row Yes a b a+b
No c d c+d
Total a+c b+d N= a+b+c+d

Degree of freedom = (row- 1) x (column -1) =(2-1)(2-1) =1


For 2x2 table, DF =1
For 2x3 table, DF = 2
For 3x2 table, DF =2
for 3x3 table, DF = 4

42
MCQ
1. A statement made about a population for testing purpose is
called?
a) Statistic b) Hypothesis c) Level of Significance d) Test-Statistic
2. If the null hypothesis is false then which of the following is
accepted?
a) Null Hypothesis b) Positive Hypothesis
c) Negative Hypothesis d) Alternative Hypothesis
3. The point where the Null Hypothesis gets rejected is called as?
a) Significant Valueb) Rejection Value
c) Acceptance Value d) Critical Value
4. The alternative hypothesis is also called:
a) Statistical hypothesis b) research hypothesis
c) Simple hypothesis d) null hypothesis

43
MCQ

5. Probability of rejecting the null hypothesis, when it is true is called


a. Power of test c. Level of significance (type I error)
b. level of confidence d. Type II error
6. If healthy client is admitted to the hospital, the error is said to be
c. Type I error c. Type II error
d. sampling error d. Unbiased error
7. Type II error is committed when
e. We reject the null hypothesis when it is not is true
f. We reject a null hypothesis when it is true
g. We accept a null hypothesis when it is not true
h. We accept a null hypothesis when it is true
8. Testing Ho: µ = 25 against H1: µ ≠ 25 leads to:
i. Two-tailed test c. left-tailed test
j. Right-tailed test d. Neither (a), (b) and (c)
Correlation and Regression
• Relationship between two quantitative variables (X=
independent variable and Y= dependent variable)
• Lies between -1 to +1
• Near to +/- 1, strong correlation
• Near to 0 , poor correlation
• R= 0, no correlation exist
• Types of correlation
• Positive correlation and negative correlation:
– Positive correlation :e.g. age and blood pressure, income and
expenditure etc.
– Negative correlation: age and immunity, income and fertility etc
Cont.
• Simple, multiple and partial correlation
– Simple: relationship between only two variables:
age and height
– Multiple: relationship between more than two
variables: age, height and weight
– Partial: controlling one and finding the relationship
between others: age and weight controlling height
• Linear and non linear
– Linear: straight line relationship
– Non linear: exponential, hyperbolic relationship
Cont.
• Parametric method: continuous data
– Karl Pearson coefficient of correlation
• Non parametric method: ordinal data
– Spearman rank correlation
• Coefficient of determination (r2)
– Measures the amount of change in y variable due
to the change of X variable
– Lies between 0 to 1
Regression
••Cause
  and effect relationship between two variables
•Cause/predictor/explanatory/ independent variable/factors
•Effect/outcome/response/dependent variable
•The amount of change in y variable due to the per unit
change of x variable is known as regression coefficient
(beta=
• Y = a+bx
– Y = dependent variable
– X =independent variable
– b= regression coefficient or slope of the line
– a= y intercept
Regressing line
Positive regression
Negative regression
Regression
• If correlation value is negative, regression
coefficient is also negative
• If correlation value is positive, regression
coefficient is also positive
• Regression is used to Predict the dependent
variable on the basis of given information of
independent variable
MCQ
1. A process by which we estimate the value of dependent variable
on the basis of one or more independent variables is called:
(a) Correlation (b) Regression (c) Residual (d) Slope
2. All data points falling along a straight line is called:
(a) Linear relationship (b) Non linear relationship
(c) Residual (d) Scatter diagram
4. The slope of the regression line of Y on X is also called the:
(a) Correlation coefficient of X on Y
(b) Correlation coefficient of Y on X
(c) Regression coefficient of X on Y
(d) Regression coefficient of Y on X
5. In simple linear regression, the numbers of unknown constants
are:
(a) One (b) Two (c) Three (d) Four
MCQ
6. If the value of any regression coefficient is zero, then
two variables are:
(a) Qualitative (b) Correlation
(c) Dependent (d) Independent
7. The straight line graph of the linear equation
Y = a+ bX, slope will be upward if:
(a) b = 0 (b) b < 0 (c) b > 0 (b) b ≠ 0
8. The independent variable is also called:
(a) predictor (b) response
(c) outcome (d) Estimated
9. A measure of the strength of the linear
relationship that exists between two variables is
called:
(a) Slope
(b) Intercept
(c) Correlation coefficient
(d) Regression equation
10. If one item is fixed and unchangeable and the
other item varies, the correlation coefficient will be:
(a) Positive (b) Negative (c) Zero (d) Undecided
Sampling technique
• Sampling –statistic
• Population – parameter
• Population and census
• Sample and sampling
• Sampling error: difference between sample
value and population parameter
• Non sampling error: human error
Types of sampling
• Probability sampling
– Simple random sampling
– Systematic sampling
– Stratified random sampling
– Cluster sampling
– Multistage sampling
• Non probability sampling
– Convenience sampling
– Purposive sampling
– Quota sampling
– Snowball sampling
Thank you

You might also like