You are on page 1of 101

BIOSTATISTICS

The word statistics is a Latin word


derived from status meaning
information useful to the state, e.g.,
the sizes of the populations and
armed forces.

BIOSTATISTICS

Statistics refers to the


numerical data relating to an
aggregate of facts.

Also used to refer to the


procedures and techniques
used to collect, process and
analyze data to make
inferences and to reach

IMPORTANT CHARACTERISTICS OF
BIOSTATISTICS:

It deals with uncertainties in


population groups and events.

It deals with data subjected to


random variations like height of
children etc.

The study design and data collection


procedures have to be correct to
obtain meaningful statistics.

Biostatistics can be divided into:

1. Descriptive: Deals with the concepts and


methods concerned with summarization and
description of the important aspects of the
numerical data.
2. Inferential: Deals with procedures for
making inferences about the characteristics
of the large groups of populations by using a
part of the data called the sample population.

Definitions
Population is a set of measurement of
interest to the sample collector.
Sample is any subset of
measurements selected from the
population.
Element/Unit an entity on which
measurements are obtained.

Observation set of measurement


obtained for each element
Data facts and figures collected,
summarised and analysed.
Data set a set of different
variables in a particular study.

Statistical analyses need


variability; otherwise
there is nothing to study

Statistics is concerned, mainly, with


variables

Variation is important!!!!

Any type of observation which can take


different values for different people, times,
places, species etc is called a

VARIABLE
Eg., height, weight, uric acid level, Xrays
findings, parity, social class etc.

A mathematical constant takes a fixed value


eg., the ratio of the circumference of a circle to
its diameter is a constant, 3.141592654 for all
sized circles

Types of variables
A QUALITATIVE variable is one which does
not take a numerical value. It may be
concerned with the characteristics eg., gender,
survival or death, place of birth, colour of eyes
etc.
A QUANTITATIVE variable takes
a numerical value. eg., height, blood pressure,
lung capacity, exact age, parity, number of
cases in a study, completed family size, age last
birthday etc.

TYPES OF VARIABLES

Variable
Qualitative
or categorical

Nominal
(not ordered)
e.g. ethnic
group

Ordinal
(ordered)
e.g. response
to treatment

Quantitative
measurement

Discrete
(count data)
e.g. number
of admissions

Continuous
(real-valued)
e.g. height

CATEGORICAL VARIABLES

Cannot be measured numerically

Categories must not overlap and


must cover all possibilities

CATEGORICAL NOMINAL
VARIABLES
Named categories
No implied order among categories
Examples:

Gender: Male/Female
Blood Groups: 0, A, B, AB
Ethnic Group: Chinese, Malay, Indian,
Jordanian
Eye color: brown/black/blue/green/mixed

CATEGORICAL ORDINAL VARIABLES

Same as nominal but ordered


categories
Differences between categories
may not be considered equal
Examples:

Grading: Excellent, satisfactory,


unsatisfactory
Pain severity: no pain, slight pain,
moderate pain, severe pain

QUANTITATIVE VARIABLES
Can be measured numerically
Examples:

weight
# of admissions to the hospital
concentration of chlorine

Can be discrete or continuous

DISCRETE NUMERICAL VARIABLES


Integers that correspond to a count
Can assume only whole numbers
Examples:

#
#
#
#

of
of
of
of

bacterial colonies on a plate


missing teeth
accidents in a time period
illnesses in a time period

CONTINUOUS DATA

Continuous data are measured

Can take any value within a defined


range
Limitations imposed by the measuring
stick

Examples: blood pressure, height, weight,


time

WHY DOES IT MATTER?


Categorical and quantitative variables are statistically
summarized and presented in different ways

Variable Type

Data Presentation

Quantitative

Graphs, Tables

Categorical

Charts, Tables

TYPES of DATA
Qualitative data Categorical data
Quantitative data Numerical data

Qualitative/Categorical Data
There are two types of categorical
data:

nominal

NOMINAL DATA

In NOMINAL DATA, the variables are divided into


named categories. These categories however,
cannot be ordered one above another (as they
are not greater or less than each other).

Example:
NOMINAL DATA CATEGORIES
Sex/ Gender:
male, female
Marital status: single, married, widowed,
divorced

separated,

ORDINAL DATA

In ORDINAL DATA, the variables are also


divided into a number of categories, but they
can be ordered one above another, from
lowest to highest or vice versa.

Example:
ORDINAL DATACATEGORIES
Level of knowledge: good, average, poor
Opinion on a statement: fully agree, agree,
disagree, totally disagree

Numerical Data
We speak of NUMERICAL DATA if the
VARIABLES are expressed in numbers. They
can be examined through:
Frequency Distribution
Percentages, Proportions, Ratios and Rates
Figures ETC.

Numerical Data
May be:
Discrete or Continuous
Discrete numerical data considers counts
which can be expressed only as whole
numbers e.g., number of people, parity,
number of males/females in a family etc.
Continuous numerical data considers
measures which can take any value
between two whole numbers e.g., weight,
height, uric acid levels etc.

SCALES OF MEASUREMENT
There are four scales (or levels) at which we measure:

__________________________________________________________
Lowest
Level
Scale
Characteristic
_________________________________________________________
Nominal naming

Ordinal ordering
Interval equal interval without absolute zero
Ratio
equal interval with absolute zero
__________________________________________________________
Highest
__________________________________________________________

DATA SUMMARIZATION

Measures of Central Location


Measures of Dispersion and
Measures of Shapes

Central Location

Number of people

Spread
Age

MEASURES OF CENTRAL LOCATION

Definition: a single value that


represents (is a good summary of) an
entire distribution of data

Also known as:

Measure of central tendency


Measure of central position

Common measures

Arithmetic mean
Median
Mode

Age
27
30
28
31
28
36
29
37
29
34
30
30
27
30

Raw data set:


Ages of students in a class (years)

Ob
s

Age

27

27

28

28

28

29

29

29

29

10

30

11

30

12

30

13

30

14

30

15

31

16

31

17

32

18

34

19

36

20

37

Order the data set from the


lowest value to the highest value
Add observation numbers

MODE
Definition: Mode is the value that occurs
most frequently
Method for identification
1. Arrange data into frequency
distribution or histogram,
showing the values of the
variable and the frequency with
which each value occurs
2. Identify the value that occurs
most often

Mode

Ob
s

Age

27

27

28

28

28

Age

Frequenc
y

29

29

27

29

28

29

29

10

30

11

30

30

12

30

31

13

30

32

14

30

33

15

31

16

31

34

17

32

35

18

34

36

19

36

20

37

37

Mode

Obs

Age

27

27

28

28

28

29

Mode
The most frequent value of the variable

Mode
= 30

29

29

29

10

30

11

30

12

30

13

30

14

30

15

31

16

31

17

32

2
7

3
2

3
3

3
4

3
5

3
6

3
7

18

34

19

36

20

37

Frequency

28

29 30 31

Age (years)

FINDING MODE FROM LENGTH OF


STAY DATA

0, 2, 3, 4, 5, 5, 6, 7,
8, 9,
9, 9, 10, 10, 10, 10, 10, 11,
12, 12,
12, 13, 14, 16, 18, 18, 19, 22,
27, 49

Mode = 10

FINDING MODE FROM HISTOGRAM

MODE SENSITIVE TO OUTLIERS?

20

Unimodal Distribution

18

Population

16
14
12
10
8
6
4
2
0

18
16
Population

14
12
10
8
6
4
2
0

Bimodal Distribution

MODE PROPERTIES / USES

Easiest measure to understand,


explain, identify
Always equals an original value
Insensitive to extreme values
(outliers)
Good descriptive measure, but poor
statistical properties
May be more than one mode
May be no mode
Does not use all the data

MEDIAN
Definition: Median is the middle
value; also, the value that splits the
distribution into two equal parts

50% of observations are below the median


50% of observations are above the median

Method for identification


1.
2.
3.

Arrange observations in order


Find middle position as (n + 1) / 2
Identify the value at the middle

Obs

Age

27

27

28

28

28

29

29

29

29

10

30

11

30

12

30

13

30

14

30

15

31

16

31

17

32

18

34

19

36

Median:
Odd Number of Values
n = 19
Median
Observation

=
=

n+1
2
19+1
2
20
2
10

Median age = 30 years

Obs

Age

27

27

28

28

28

29

29

29

29

10

30

11

30

12

30

13

30

14

30

15

31

16

31

17

32

18

34

19

36

Median:
Even Number of Values
n = 20
Median
Observation

=
=
=

n+1
2
20+1
2
21
2
10.5

Median age = Average value between 10th and


11th observation
30+30
2

30 years

Median at 50% = 10

FIND MEDIAN OF LENGTH OF STAY DATA;


IS MEDIAN SENSITIVE TO OUTLIERS?

0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 49

0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 149

MEDIAN PROPERTIES / USES

Does not use all the data


available
Insensitive to extreme values
(outliers)
Good descriptive measure but
poor statistical properties
Measure of choice for skewed
data
Equals an original value of n is
odd

Quartiles
Definition: Quartile is the value that splits
the distribution into four equal parts
25%
25%
25%
25%

of observations are below the first quartile (Q1)


of observations are between Q1 and Q2 (median)
of observations are between Q2 (median) and Q3
of observations are above Q3

Q1

Q2

Q3

Obs

Age

27

27

28

28

28

29

29

29

29

10

30

11

30

12

30

13

30

14

30

15

31

16

31

17

32

18

34

19

36

20

37

Quartiles
Q1 age = 28

Q2 age = 30

Q3 age = 31

n+1
Q1 observation = round
4
20+1
21
=
=
4
4
= 5.25 ~ 5th obs
Q2 observation =

10.5 (median)

3(n+1)
Q3 observation = round
4
3(20+1)
3(21)
=
=
4
4
= 15.75 ~ 16th obs

Percentiles
Value of the variable that splits the
distribution in 100 equal parts
35 % of observations are below the 35th percentile
65 % of observations are above 35th percentile

Obs

Age

27

27

28

28

28

29

29

Percentiles
Value
s
(Age)

Fre
q

Percent
(Freq/To
tal)

Cumulati
ve
Percent

27

10%

10%

29

28

15%

25%

29

29

20%

45%

10

30

30

25%

70%

11

30

12

30

31

10%

80%

13

30

32

5%

85%

14

30

34

5%

90%

15

31

36

5%

95%

16

31

37

5%

100%

17

32

18

34

Total

20

100%

19

36

20

37

25th Percentile

90th Percentile

ARITHMETIC MEAN
Arithmetic mean = average value

Method for identification


1.
2.

Sum up all of the values


Divide the sum by the
number of observations
(n)

Obs

Age

27

27

28

28

28

29

29

29

29

10

30

11

30

12

30

13

30

14

30

15

31

16

31

17

32

18

34

19

36

20

37

Arithmetic Mean
i

x
x
n

n = 20
xi = 605

x 605
20

30.25

FINDING THE MEAN LENGTH OF STAY DATA

0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 49
Sum = 360
n = 30
Mean = 360 / 30 = 12

CENTERING PROPERTY OF
MEAN
0
2
3
4
5
5
6
7
8
9

12
12
12
12
12
12
12
12
12
12
-71

= -12 9
= -10 9
= -9 10
= -8 10
= -7 10
= -7 10
= -6 10
= -5 11
= -4 12
= -3 12
-17 88

12
12
12
12
12
12
12
12
12
12

=
=
=
=
=
=
=
=
=
=

-3 12
-3 13
-2 14
-2 16
-2 18
-2 18
-2 19
-1 22
0 27
0 49

12
12
12
12
12
12
12
12
12
12

=
=
=
=
=
=
=
=
=
=

0
1
2
4
6
6
7
10
15
37

MEAN USES ALL DATA,


SO SENSITIVE TO OUTLIERS
6
5
4
3
2
1
0

Mean = 12.0

10

15

20
25
30
Nights of stay

Mean = 15.3

35

40

45

50

When to use the arithmetic mean?


Centered distribution
Approximately
symmetrical
Few extreme values
(outliers)
OK!

ARITHMETIC MEAN PROPERTIES /


USES

Probably best known measure of


central location
Uses all of the data
Affected by extreme values (outliers)
Best for normally distributed data
Not usually equal to one of the
original values
Good statistical properties

Var A
0 0
0 4
1 4
1 4
1 5
5 5
9 5
9 6
9 6
10
10

Var B
0
1
2
3
4
5
6
7
8
6 9
10 10

Var C
For each variable,
find the:
Sum
Mean
Median
Mode
Minimum value
Maximum value

Var A

Var B

Sum: 55 55 55
Mean:
Median:
Mode:
Min:
Max:

Var C
For each variable,
find the:
Sum
Mean
Median
Mode
Minimum value
Maximum value

Var A

Var B

Var C

Sum: 55 55 55
Mean: 5 5 5
Median: 5 5 5
Mode: 1,9 4,5,6 none
Min: 0 0 0
Max: 10 10 10

For each variable,


find the:
Sum
Mean
Median
Mode
Minimum value
Maximum value

Comparison of Mode, Median and Mean


Symmetrical:

Mode = Median = Mean

Skewed right:
Mode < Median < Mean

Skewed left:
Mean < Median < Mode

Measures of Central Location Summary


Measure of Central Location single measure
that represents an entire distribution
Mode most common value
Median central value
Arithmetic mean average value
Mean uses all data, so sensitive to outliers
Mean has best statistical properties
Mean preferred for normally distributed data
Median preferred for skewed data

Same center
but
different dispersions

MEASURES OF SPREAD
Definition: Measures that quantify
the variation or dispersion of a set
of data from its central location
Also known as:

Measure of dispersion
Measure of variation

Common measures

Range
Standard error
Interquartile range
95% confidence
interval
Variance / standard deviation

RANGE
Definition: difference between largest and
smallest values

Properties / Uses
Greatly affected by outliers
Usually used with median

FINDING THE RANGE OF LENGTH OF


STAY DATA

0, 2, 3, 4, 5, 5, 6, 7, 8,
9,
9, 9, 10, 10, 10, 10, 10, 11, 12,
12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 49

RANGE SENSITIVE TO OUTLIERS?


6
5
4
3
2
1
0

Range = 49 - 0 = 49

10

15

20
25
30
Nights of stay

35

40

Range = 149 - 0 = 149

45

50

INTERQUARTILE RANGE
Definition: the central 50% of a distribution
Properties / Uses
Used with median
Five-number summary for boxand-whiskers diagram:

Maximum (100%, largest value)


Third quartile (75%)
Median (50%)
First quartile (25%)
Minimum (0%, smallest value)

INTERQUARTILE RANGE
LENGTH OF STAY DATA
Q1
0, 2, 3, 4, 5, 5, 6, 7, 8,
9,
9, 9, 10, 10, 10, M 10, 10, 11, 12,
12,
Q3
12,th 13, 14, 16, 18, 18, 19, 22, 27,
Q1 = 25
percentile
@
(30+1)
/
4
=
7
6
49
Median = 50th percentile @ 15.5
10
Q3 = 75th percentile @ 3 (30+1) / 4 = 23
14

BOX-AND-WHISKERS DIAGRAM
LENGTH OF STAY DATA

BOX-AND-WHISKERS DIAGRAMS
VARIABLES A, B, C

VARIANCE AND STANDARD


DEVIATION
Definition: measures of variation that
quantifies how closely clustered the
observed values are to the mean

Variance
= average of squared deviations
from mean
= Sum (x mean)2 / n-1
Standard deviation
= square root of variance

EQUATIONS FOR VARIANCE AND


STANDARD DEVIATION

x : mean
xi : value
n : number
sd: variance
sd : standard deviation

i - x

SD =
n-1

SD =

x i - x
n-1

STEPS TO CALCULATE VARIANCE AND


STANDARD DEVIATION
x : mean
xi : value
n : number
sd: variance
sd : standard deviation

x i - x

SD
n-1
=
x

1. Calculate the arithmetic mean

x - x

i
2. Subtract the mean from each observation.

x i - x
4. Sum the squared differences
x i - x
3. Square the difference.

5. Divide the sum of the squared differences by n 1


6. Take the square root of the variance
SD

= s2

CENTERING PROPERTY OF
MEAN
0
2
3
4
5
5
6
7
8
9

12
12
12
12
12
12
12
12
12
12
-71

= -12 9
= -10 9
= -9 10
= -8 10
= -7 10
= -7 10
= -6 10
= -5 11
= -4 12
= -3 12
-17 88

12
12
12
12
12
12
12
12
12
12

=
=
=
=
=
=
=
=
=
=

-3 12
-3 13
-2 14
-2 16
-2 18
-2 18
-2 19
-1 22
0 27
0 49

12
12
12
12
12
12
12
12
12
12

=
=
=
=
=
=
=
=
=
=

0
1
2
4
6
6
7
10
15
37

LENGTH OF STAY DATA


(0 12)2
0
(2 12)2
1
(3 12)2
4
(4 12)2
16
(5 12)2
36
(5 12)2
36
(6 12)2
49
(7 12)2

= 144

(9 12)2 = 9 (12 12)2 =

= 100

(9 12)2 = 9 (13 12)2 =

81 (10 12)2 = 4 (14 12)2 =

64 (10 12)2 = 4 (16 12)2 =

49 (10 12)2 = 4 (18 12)2 =

49 (10 12)2 = 4 (18 12)2 =

36 (10 12)2 = 4 (19 12)2 =

25 (11 12)2 = 1 (22 12)2 =

STANDARD DEVIATION PROPERTIES /


USES

Standard deviation usually


calculated only when data are more
or less normally distributed (bell
shaped curve)
For normally distributed data,
68% of the data fall within 1 SD
95% of the data fall within 2 SD
99% of the data fall within 3 SD

NORMAL DISTRIBUTION
2.5%

95%
68%

Standard
deviation

Mean

2.5%

Match the Measures of Central Location & Sprea

Mode

Standard deviation

Median
Arithmetic mean

Range
Interquartile range

Match the Measures of Central Location & Sprea

Mode

Standard deviation

Median
Arithmetic mean

Range
Interquartile range

NAME THE APPROPRIATE


MEASURES OF CENTRAL LOCATION AND SPREAD

Distribution

Central Location Spread

Single peak, Mean* Standard


symmetrical deviation
Skewed or Median Range or
Data with outliers
Interquartile range
* Median and mode will be similar

Properties of
Measures of Central Location & Spread
For quantitative / continuous variables
Mode simple, descriptive, not always useful
Median best for skewed data
Arithmetic mean best for normally distributed
data
Range use with median
Standard deviation use with mean
Standard error used to construct confidence
intervals

Median

Mode

14
12
Population

10
8
6
4
2
0

Age

1st quartile
Minimum

3rd quartile

Interquartile interval
Range

Maximum

Measures of Shapes

THE NORMAL DISTRIBUTION

Many variables have a normal


distribution. This is a bell shaped curve
with most of the values clustered near the
mean and a few values out near the tails.

MEASURES OF VARIATION
Range is defined as the difference in value
between the highest (maximum) and the lowest
(minimum) observation
Variance is defined as the sum of the squares of
the deviation about the sample mean divided by
one less than the total number of items.
Standard deviation it is the square root of the
variance

.2

F r a c tio n

.1 5
.1
.0 5
0
0

V ar

10

15

The normal distribution is


symmetrical around the
mean. The mean, the median
and the mode of a normal
distribution have the same
value.

An important characteristic of
a normally distributed
variable is that 95% of the
measurements have value
which are approximately
within 2 standard deviations
(SD) of the mean.

ESTIMATIONS

The basic problems to which Statistics


are applied in practice arise when trying
to deduce something about a population
from the evidence provided by a sample
of observations taken from that
population.

The population
parameters do not change
and remain constant
whereas the sample
estimates can change and
take any random value.

Population
parameters

Sample
estimates

Mean

Standard
deviation

SD

Proportion

Population
correlation
coefficient

HOW TO DETERMINE THE


EXTENT TO WHICH THE
SAMPLE REPRESENTS THE
POPULATION AS A WHOLE.

To find out to what extent a


particular sample value
deviates from the population
value, a range or an interval
around the sample value can
be worked out which will most
probably contain the
population value.

This range or interval is called


the CONFIDENCE INTERVAL.

The calculation of a confidence


interval takes into account the
STANDARD ERROR. The standard
error gives an estimate of the
degree to which the sample mean
varies from the population mean. It
is computed on the basis of the
standard deviation.

The standard error for the mean is


calculated by dividing the standard
deviation by the square root of the
sample size:
standard deviation/ Sample
size
n
or SD /

95% CONFIDENCE INTERVAL

When describing variables statistically


you usually present the calculated
x ).
sample mean x 1.96 times the SE(

This is then called the 95%


CONFIDENCE INTERVAL. It means
that there is 95% probability that the
population mean lies within this
interval.

Note that the larger the sample


size, the smaller the standard
error and the narrower the
confidence interval will be. Thus
the advantage of having a large
sample size is that the sample
mean will be a better estimate of
the population mean.

If the sample size is large, small


differences can be significant but
a large difference may not
achieve statistical significance
due to small sample size. This
leads us to calculating the

Confidence Intervals.