You are on page 1of 52

Introduction to

biostatistics
Lecture plan

Basics
Variable types
Descriptive statistics:

1.
2.
3.

Categorical data
Numerical data

Inferential statistics

4.

Confidence intervals
Hipotheses testing

DEFINITIONS
STATISTICS can mean 2 things:
- the numbers we get when we measure and
count things (data)
- a collection of procedures for describing and
anlysing data.
BIOSTATISTICS application of statistics
in nature sciences, when biomedical and
problems are analysed.
2

Why do we need statistics?

????

Basic parts of statistics:

Descriptive
Inferential

Terminology

Population
Sample

Variables

Variable types

Categorical (qualitative)

Numerical (quantitative)

Combined

Categorical data
Nominal

2 categories
>2 categories

Ordinal

Numerical data

Continuous
Discrete

Description of categorical
data
Arranging data
Frequencies, tables
Visualization (graphical
presentation)

Frequencies and
contingency tables
From those
who were
unsatisfied 4
were males,
6 were
females.

Total

Males Females

40
80%

14
77,8
%

26
81,3%

Unsatisfied 10
20 %

4
22,2
%

6
18,7%

Total

18
32
100% 100%

Satisfied

50
100%

10

Graphical presentation

11

Graphical presentation

12

Graphical presentation

13

Graphical presentation

14

Graphical presentation
Other:
- Maps
- Chernoff faces
- Star plots, etc.

15

Description of numerical
data

Arranging data
Frequencies (relative and cumulative),
graphical presentation
Measures of central tendency and
variance
Assessing normality
16

Grouping

Sorting data
Groups (5-17 gr.) according
researchers criteria.

To assess distribution, for graphical presentation in excel

17

Frequencies, their comparison


and calculation
197
students
were
asked
about
the
amount
of money
(litas)
they had
in cash
at the

18

Gaphical presentation of
frequencies

19

Normal distributions
Most

of them around center


Less above and lower central
values, approximately the
same proportions
Most often Gaussian
distribution

20

Not normal distributions

More observations in one part.

21

Asymmetrical distribution

22

How would you


describe/present your
respondents if the data are
numeric?
2 groups of measures:
1. Central tendency (central
value, average)
2. Variance

23

MEASURES OF CENTRAL
TENDENCY

Means/averages (arithmetic,
geometric, harmonic, etc.)
Mode
Median
Quartiles

24

MEASURES OF CENTRAL
TENDENCY

Arithmetic mean (X, )

25

1
2

MEASURES OF CENTRAL
TENDENCY

Median (Me) the middle value or 50th


procentile (the value of the observation,
that divides the sorted data in almost
equal parts).
It is found this way

When

n odd: median is the middle observation


When n even: median is the average of values
of two middle observations

26

MEASURES OF CENTRAL
TENDENCY

Mode (Mo) the most common


values

Can be more than one mode

27

MEASURES OF CENTRAL
TENDENCY

Quartiles (Q1, Q2, Q3, Q4) sample


size is divided into 4 equal parts
getting 25% of observations in each
of them.

28

Is it enough measure of
central tendency to
describe respondents?

29

MEASURES OF VARIANCE
Min and max
Range
Standard deviation sqrt of
variance (SD)
Variance - V= (xi - x)2/n-1
Interquartile range (Q3-Q1 or
75%-25%) IQRT

30

What measures are to be used for


sample description?
If distribution is NORMAL

Mean
Variance (or standard deviation)

If distribution is NOT NORMAL

Median
IQRT or min/max

Those measures are used also with numeric ordinal data


31

X, Mo, Me
Mean~Median~Mode,
SD ir empyric rule

32

EMPYRICAL RULE

Number of observations (%) 1, 2 ir


2.5 SD from mean if distribution is
normal

33

Example

X=8
SD=2,5

-2SD

+2SD

34

Normality assessment
Summary

Graphical
Comparison of measures of central
tendency; empyrical rule (mean and
standard deviation)
Skewness and kurtosis (if Gaussian
=0)
Kolmogorov-Smirnov test
35

Boxplot
75th Procentile
75th Procentile
Mean( *)
Median
25th Procentile
25th Procentile
Outliers

Boxplot example
26,00
24,67
23,33
22,00
20,67
19,33
18,00
16,67
15,33
14,00
440

Central limit theorem

Inferential statistics

Confidence intervals
Hipotheses testing

39

Confidence intervals
Interval where the true value
most likely could occur.

40

The variance of samples


and their measures
X2, SD2; p2
X1, SD1; p1

X3, SD3; p3
X4; SD4; p4

, , p0
41

The variance of samples and


confidence intervals

, p0

42

Confidence interval
Statistical definition:
If the study was carried out 100 times, 100
results ir 100 CI were got, 95 times of 100 the
true value will be in that interval. But it will
not appear in that interval 5 times of 100.

43

Confidence intervals

(general, most common


calculation)
95% CI : X 1.96 SE

Xmin; Xmax

Note: for normal distribution, when n is large

95% CI : p 1.96 SE

pmin ; pmax

Note: when p ir 1-p > 5/n

44

SD
p
(
1

)
NN

Standard error (SE)


Numeric data
(X )

Categorical data
(p)

45

Width of confidence inerval


depends on:
a) Sample size;
b) Confidence level (guaranty - usually 95%,
but available any %);
c) dispersion.

46

Hipotheses testing
H0: 1=2; p1=p2; (RR=1, OR=1,
difference=0)
HA: 12; p1p2 (two sided, one
sided)

47

Hipotheses testing
Significance level (agreed 0.05).
Test for P value (t-test, 2 , etc.).
P value is the probability to get the
difference (association), if the null
hypothesis is true.
OR P value is the probability to get the difference
(association) due to chance alone, when the null
hypothesis is true.
48

Statistical agreements

If P<0.05, we say, that results cant


be explained by chance alone,
therefore we reject H0 and accept HA.

If P0.05, we say, that found


difference can be due to chance
alone, therefore we dont reject H 0.

49

Tests
Test depends on

Study design,
Variable type
distribution,
Number of groups, etc.

Tests (probability distributions):

z test
t test (one sample, two independent, paired)
2 (+ trend)
F test
Fisher exact test
Mann-Whitney
Wilcoxon and others.

50

Inferential statistics
Summary

P value tells, if there is statistically


significant difference (association).

CI gives interval where true value can


be.

51

Inferential statistics
Summary

Neither P value, nor CI give other


explanations of the result (bias and
confounding).

Neither P value, nor CI tell anything


about the biological, clinical or public
health meaning of the results.
52