You are on page 1of 81

Biostatistics

By. Afework H. (MPH in Epidemiology)


Teaching methods and Assessment
• Credit hour – 4hr (7ECTS)

• Lectures will be the major mode of delivery of the subject

• Methods of evaluation i.e. continues assessment


• Quiz 5 %

• Group Assignment 20%

• Individual assignment 10%

• Test1 10%

• Group work with presentation 15%

• Final written examination 40%

• Total 100%
01/15/2024 Biostatistics 2
References
• 1. LECTURE NOTES biostatistics For Health Science Students, by Getu
Degu and Fasil Tessema
https://www.cartercenter.org/resources/pdfs/health/ephti/library/lec
ture_notes/env_health_science_students/ln_biostat_hss_final.pdf
• 2. Introductory Biostatistics for the Health Sciences, 2003 by John
Wiley & Sons, Inc.
https://onlinelibrary.wiley.com/doi/book/10.1002/0471458716
• 3. Knapp RG &miller MC III. Clinical Epidemiology and Biostatistics.
Williams and Wilkins, Baltimore, Maryland. 1992

01/15/2024 Biostatistics 3
Chapter one
Introduction to Biostatistics
OBJECTIVES

1. Define Statistics and Biostatistics

2. Enumerate the importance and limitations of statistics

3. Define and Identify the different types of data

4. Understand why we need to classify variable

5. Classify Biostatistics based on data handling.

6. Mention the importance of statistics in the health field


01/15/2024 Biostatistics 4
Introduction
Definition:

The term ‘statistics’ could take on different

meanings:

I. In a plural sense It could mean ‘statistical data’.

- collection of numerical facts

- counts or measurements

01/15/2024 Biostatistics 5
Cont….
• - E.g., the population statistics of Ethiopia include: total population
number, age-sex distribution of the population, fertility rates, birth
rates, death rates, educational status of the population, etc.

• II. In the singular sense, a ‘statistic’ refers to a summary measure


obtained from a sample of a population.
- E.g.,
- Mean.
- Proportion.
- Standard deviation. etc.
01/15/2024 Biostatistics 6
Introduction..
• III. ‘Statistics’ is also used to mean ‘statistical methods’

- In this context, it refers to a body of methods that are used for:

- collecting,

- organizing,

- analyzing, and

- interpreting data for understanding a phenomenon or making wise decisions.

01/15/2024 Biostatistics 7
Cont.……
• Unless and otherwise explicitly indicated, keep this last
meaning of the term in mind whenever we talk about
statistics

• Thus, biostatistics is a discipline in which the different


statistical methods are applied in:
- biological,
- medical, and
- public health data.
01/15/2024 Biostatistics 8
Cont..

• Statistical data: When it means statistical data it refers to


numerical descriptions of things.

• NB: Even though statistical data always denote figures


(numerical descriptions) it must be remembered that all
'numerical descriptions' are not statistical data

• In order that numerical descriptions may be called statistics


they must possess the following characteristics
Characteristics of statistics
They must be in aggregates

 It is effected by many causes

It should be numerically expressed:

It must be enumerated or estimated accurately:

It should be collected in a systematic manner:

It should be collected for a predetermined purpose

It should be capable of being placed in relation to each other

• NB: It can be concluded that all statistics are numerical data but all numerical
data are not statistics unless they satisfy all the essential characteristics of
statistics
Statistics can be classified as
(ii) Descriptive Statistics

• Used to describe the basic features of the data in a study.

• They provide simple summaries about the sample and the


measures.

• Thus are used to organize, present and summarize data


(ii) Inferential Statistics

• Used to make conclusions about the general population


based on results obtained from aBiostatistics
01/15/2024
sample 11
Rationales of statistics
• Why study statistics?

We have the following rationales:

1. To organize data on a wider and more formal basis.

2. Medicine and public health are being increasingly quantitative.

3. Great deal of inherent variation in most biological processes.

4. Planning, conduct and interpretation of research are reliant on statistics

5. Statistics
01/15/2024 and statistical jargons pervade the medical and public health literature
Biostatistics 12
Limitations of statistics:
1. It deals with only those subjects of inquiry that are capable of being
quantitatively measured and numerically expressed.

2. It deals on aggregates of facts and no importance is attached to


individual items–suited only if their group characteristics are
desired to be studied.

3. Statistical data are only approximately and not mathematically


correct.
01/15/2024 Biostatistics 13
Scales of measurement (Types of data)

 Statistical data:- results of measurement or observations of

any statistical study on variables

 Variables:- characteristic or an attribute of a person , an object, etc

that measured and takes any value for individual person or object .

01/15/2024 Biostatistics 14
Cont.….
• There are four levels of measurement scales, and, therefore, four types

of data.

1. Nominal data

2. Ordinal data

3. Interval data

4. Ratio data
01/15/2024 Biostatistics 15
Nominal Data
 Represent categories or names. No implied order.

 Individuals are simply placed into the proper categories or names

 Each item must fit into one and only one category

 Often reported as non-numerical labels but can be numerically coded.

 Yet, you cannot apply arithmetic operations on the codes.

01/15/2024 Biostatistics 16
Cont.….

• Examples of nominal data:


• Ethnic group—Anglo-Saxon, Afro-American, Hispanic, other.

• Sex—Male, Female

• Marital status—Single, Married, Divorced, Widowed

• Educational status—literate, Illiterate

01/15/2024 Biostatistics 17
Ordinal Data
 Represents categories or names

• But here, the categories or names have a ranked order.


Example: Opinion of people on an issue
1. strongly agree 2. agree 3. no opinion 4. disagree 5. strongly disagree

• Distance between two consecutive ranks or any two ranks cannot be


meaningfully interpreted

• Ordinal data can also be numerically coded; however, the numerical


codes cannot be arithmetically treated.
01/15/2024 Biostatistics 18
Interval Data

 Applies to numerical data that have no true zero origin.


◦ E.g., temperature in degree Celsius or degree Fahrenheit

 Quantitative differences can be meaningfully interpreted.


◦ E.g., the difference between 10 degrees and 12 degrees is the same
as the difference between 30 and 32 degrees.

01/15/2024 Biostatistics 19
Cont.,,,,
• But, ratios cannot be meaningfully interpreted.

◦ i.e., 20 degrees Celsius is not twice as much as 10 degrees Celsius,

for example. “Zero” doesn’t mean “no” here; “zero” means “less”

• Addition and subtraction operations applicable.

• Division and multiplication not applicable.

01/15/2024 Biostatistics 20
Ratio data
 Applies to numerical data that have an absolute zero-point origin.
◦ E.g., length, height, weight, pressure, etc.

 Here “zero” means “no” Thus, ratios can be meaningfully interpreted.


◦ E.g., an area of 5m2 is half as large as 10m2; a person whose weight is
90 KGs weighs 1.5 times as much as a person whose weight is 60KGs; etc

 All arithmetic operations applicable.

01/15/2024 Biostatistics 21
Exercise-1

The following are list of different attributes/ variables or data. Classify


the variables/data in to different measurement scales.
1. Your checking account number as a name for your account.
2. A response to the statement "Abortion is a woman's right" where
"Strongly Disagree" = 1, "Disagree" = 2, "Agree" = 3, and "Strongly
Agree" = 4, as a measure of attitude toward abortion.
3. Times for swimmers to complete a 50-meter race
4. Months of the year as September, October…
5. Economic status of a family when classified as low, middle and
upper classes.
6. Blood type of individuals as A, B, AB and O.
7. Regions of Ethiopia as region 1, region 2, region 3…
01/15/2024 Biostatistics 22
Cont.……
• The four types of data can be broadly
classified as “Quantitative” and “Qualitative”.

• Quantitative data can be classified into

(I) Quantitative discrete data


◦ Integers that represent a count of some sort.
◦ Example: Number of the Ethiopia population in 2010.

• Number of heartbeats per minute

• Number of RBCs per ml of blood Etc.

01/15/2024 Biostatistics 23
quantitative data….
(ii)Numerical continuous data
◦ Observations theoretically lie along a continuum.
◦ Restricting factor is the degree of accuracy of the measuring
instrument.
◦ Most clinical measurements are numerical cont.
E.g., blood pressure, serum cholesterol level, weight, height, etc.

01/15/2024 Biostatistics 24
Chapter two
Data collection, organization and presentation

• At the end of this chapter, the students will be able to:

1. Identify the different methods of data organization and presentation

2. Understand the criterion for the selection of a method to organize and present data

3. Identify the different methods of data collection and criterion that we use to select

a method of data collection

4. Define a questionnaire, identify the different parts of a questionnaire and indicate

the procedures to prepare a questionnaire

01/15/2024 Biostatistics 25
Data collection
• Before any statistical work can be done data must be collected.

• Depending on the type of variable and the objective of the study different data

collection methods can be employed

• Various data collection techniques can be used such as:

• Observation

• Face-to-face and self-administered interviews

• Postal or mail method and telephone interviews

• Using available information


01/15/2024 Biostatistics 26
• Focus group discussions (FGD)
Data collection…..
• Other data collection techniques –
• Rapid appraisal techniques,
• 3L technique,
• Nominal group techniques,
• Delphi techniques,
• Life histories,
• case studies, etc
Problems in gathering data
• Common problems might include:

ƒ Language barriers

ƒ Lack of adequate time

ƒ Expense

ƒ Inadequately trained and experienced staff

ƒ Invasion of privacy

ƒ Suspicion

ƒ Bias (spatial, project, person, season, diplomatic, professional)

ƒ Cultural norms (e.g. which may preclude men interviewing women)

NB; some of the problems can be addressed by a selection of appropriate


01/15/2024 Biostatistics 28
collection methods and training of the staff involved.
Choosing a Method of Data Collection
• Decision-makers need information that is relevant, timely, accurate, and usable.

• Some methods pay attention to timeliness and reduction in cost.

• Others pay attention to the strength of the method in using scientific


approaches.

• The challenge is to find ways, which lead to information that is cost-effective,


relevant, timely, and important for immediate use.

• Generally choice of methods of data collection is largely based on the accuracy of


the information they yield
01/15/2024 Biostatistics 29
Choosing a Method of Data Collection…

• The statistical data may be classified depending upon the

sources.

1) Primary data: These are those data, which are collected by the

investigator himself for the purpose of a specific inquiry or

study.

• Original in character and are mostly generated by surveys

conducted by individuals or research institutions.


2. Secondary data
• When an investigator uses data, which have already been
collected by others,

• such data are primary data for the agency that collected initially

• secondary for someone else who uses these data for his own
purposes.

• less expensive to collect both in money and time.


Secondary data

• May have errors, due to its purpose being different from the purpose
of the user of these secondary data

• There may have bias and the size of the sample may be inadequate,

• Or there may have been arithmetic or definition errors,

• Hence, it is necessary to critically investigate the validity of the


secondary data.
Cont.….
• The selection of the method of data collection is also based on practical
considerations, such as:

• 1) The need for personnel, skills, equipment, etc.

• 2) The acceptability of the procedures to the subjects –

• 3) The probability that the method will provide good coverage, i.e.

• The investigator’s familiarity with a study procedure may be a valid


consideration.


Types of Questions
• Interviews and self-administered questionnaires are probably the most commonly
used research data collection techniques

• Therefore, designing good “questioning tools” forms an important and time


consuming phase in the development of most research proposals

• Standardized methods of asking questions are usually preferred in community


medicine research, since they provide more assurance that the data will be
reproducible.

• Depending on how questions are asked and recorded we can distinguish two major
possibilities
01/15/2024 - Open –ended questions,Biostatistics
and closed questions 34
Steps in Designing a Questionnaire

• Step1: CONTENT: Take your objectives and variables as your starting point.

• Step 2: FORMULATING QUESTIONS

• Step 3: SEQUENCING OF QUESTIONS

• Step 4: FORMATTING THE QUESTIONNAIRE

Step 5: TRANSLATION
Methods of data organization & presentation
• Raw data cant show useful information

• Collected data need to be organized in a way that will show patterns of


variation clearly

• For data to be more easily appreciated & to draw quick comparisons


use:
◦ 1) Tables and/ or
◦ 2) Graphs
3. Frequency distribution
01/15/2024 Biostatistics 36
1.Tables
• Statistical tables orderly and systematic presentation of data
in rows and columns.
rows columns

01/15/2024 Biostatistics 37
Cont.,,,,,
• Importance of statistical tables
1. Tabulated data can be easily understood.
2. Have lasting impression.
3. Facilitate comparison.
4. Make easier the summation of items and detection of omissions and errors.
5. Avoid unnecessary repetitions and details

01/15/2024 Biostatistics 38
What do you feel?
• Example: Consider the following narrative
description!
“Seven (4.8%) of the smokers and 28 (9.5%) of the chewers started the habit
when they were primary school students….. Forty six (31.7 %) of the
lifetime smokers and 134 (45.6%) of the lifetime chewers started smoking
and chewing when they were senior secondary
school students. Thirty seven (25.5 %) of the ever smokers and 52 (17.7 %)
of the lifetime chewers started smoking and chewing during
their first year at college.”

01/15/2024 Biostatistics 39
Construction of tables
1. Tables should be as simple as possible.
2. Tables should be self-explanatory. For that purpose
• Title should be clear and to the point( a good title answers: what? when?
where? how classified ?) and it be placed above the table.
• Each row and column should be labeled.
• Numerical entities of zero should be explicitly written rather than indicated by
a dash. Dashed are reserved for missing or unobserved data.
• Totals should be shown either in the top row and the first column or in the last
row and last column.
3. If data are not original, their source should be given in a footnote
Parts of a statistical table
Title
Caption
Stub
Body
Head note [optional]
Foot note [optional]
Source [optional]

01/15/2024 Biostatistics 41
Types of Tables

1. Simple or one-way table


◦ Shows only one characteristic.
◦ For example, educational level of people in a company can be
given in a simple table.
Education level number

illiterate 150

literate 70

01/15/2024 Biostatistics 42
2. Two-way table, or a cross-tabulation shows two characteristics
and is formed when either of the caption or the stub is divided into two
or more parts.
Example: HIV sero-status vs. sex

sex HIV serostatus

positive negative

male 3 5

female 4 7
01/15/2024 Biostatistics 43
Cont.…..
3. High order table; one in which three or more characteristics
are represented.
Table 4. distribution of health professions by sex and resident.
profession Sex resident total
Urban rural
doctors male 8 35 43
female 2 16 18
nurses Male 46 36 82
female 23 77 100
total 79 164 243

01/15/2024 Biostatistics 44
2. Graphical presentation of data
Why graphs?
◦ Attraction.
◦ Help in deriving required information in less time and with ease.
◦ Facilitate comparison.
◦ Reveal unsuspected patterns.
◦ Greater memorizing value

01/15/2024 Biostatistics 45
Limitations of Diagrammatic Representation

1. The technique of diagrammatic representation is made use only for


purposes of comparison.

2. Diagrammatic representation is not an alternative to tabulation.

3. It can give only an approximate idea and as such where greater


accuracy is needed diagrams will not be suitable.

4. They fail to bring to light small differences


Construction of graphs
• The choices of the particular diagrams depend on personal choices
and/or the type of the data.
• Bar and pie charts are commonly used for qualitative or quantitative
discrete data.
• Histograms, frequency polygons are used for quantitative
continuous data.
There are, however, general rules that are commonly accepted about
the construction of graphs.
Construction of graphs
1. Every graph should be self-explanatory and as simple as possible.

2. Titles are usually placed below the graph and it should again question what?
Where? When? How classified?

3. Legends or keys should be used to differentiate variables if more than one is


shown

4. The axes label should be placed to read from the left side and from the bottom.

5. The units in which the scale is divided should be clearly indicated.

6. The numerical scale representing frequency must start at zero or a break in the
line should be shown.
Types of Diagrams
◦ line diagram (graph)
◦ bar diagram (graph)
◦ pie diagram (chart)
◦ Histogram
◦ Frequency polygon
◦ Ogive curve
◦ ‘box-and-whisker’ plot
◦ Scatter plots
01/15/2024 Biostatistics 49
Line Graph
 For the study of some variables according to the passage of time.

 The time plotted along the x-axis

 The value of the quantity being studied along the y-axis.

 Depicts consecutive trends of a series over a long period.

01/15/2024 Biostatistics 50
Response to administration of zidovudine in two groups of AIDS
patients in hospital X, 1999

8
7
6
Blood zidovudine
concentration

5
4
3
2
1
0
10
20
70
80
100
120
170
190
250
300
360
Time since administration (Min.)

Fat malabsorption Normal fat absorption

01/15/2024 Biostatistics 51
Bar graph
• Used to present a categorical variable

• Magnitudes are represented by proportional lengths of bars

• A space is left in between the bars

• The bars differ in length, not in width.


Three types:
a. Simple Bar graph.
b. Multiple bar graph
c. Component bar graph
01/15/2024 Biostatistics 52
a. Simple bar graph

 Used to show only a single variable.

See the example on the next page.

01/15/2024 Biostatistics 53
Distribution of patients in hopital X by source of referal, 1999
769
800

700 623
600

No. of pat i ent s


500
400

300 256

200 161
97
100
0
Other GP OPD Casualty Other
hospital
Source of referal

01/15/2024 Biostatistics 54
b. Multiple bar graph

• Used to depict two or more variables

• Each bar has separate components adjoining each other.

Can you read this graph?

01/15/2024 Biostatistics 55
c. Component (stacked) bar graph
• If there are different quantities forming the sub-divisions of the totals,
simple bars may be sub-divided in the ratio of the various sub-divisions
to exhibit the relationship of the parts to the whole.

• The order in which the components are shown in a “bar” is followed in


all bars used in the diagram.

• Can be classified into

a. An actual component and

b. Percentage component bar graph


01/15/2024 Biostatistics 56
Example: Plasmodium species distribution for
confirmed malaria cases, Zeway, 2003

100 Mixed
P. vivax
80 P. falciparum

60
Percent

40

20

0
August October December
2003

01/15/2024 Biostatistics 57
Pie chart
 It is a circle divided into sectors/sections by calculating the angle at
the center proportional to the quantity of the item being represented.

 Used to present a single categorical variable (preferably a variable


with few categories)

01/15/2024 Biostatistics 58
Distribution fo cause of death for females, in England and Wales, 1989

Others
8%
Digestive System
4%
Injury and Poisoning
3%

Circulatory system
Respiratory system
42%
13%

Neoplasmas
30%

01/15/2024 Biostatistics 59
Histogram

 Used to present a single numerical continuous variable.

 Class boundaries are presented along the x axis.

 For each class a bar whose width extends from the lower boundary to the upper
boundary of the class and whose length is determined by the class frequency will
be erected

 There will be no gap between the bars.

 Frequencies will be labeled along the y-axis

 Used to show the distributional pattern of a variable


01/15/2024 Biostatistics 60
Age group 15-19 20-24 25-29 30-34 35-39 40-44 45-49
No. of women 11 36 28 13 7 3 2

Age of women at the time of marriage

40

35

30
No of women

25

20

15

10

0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group

01/15/2024 Biostatistics 61
Frequency polygon
 Used to present only one numerical continuous variable.
 Based on a grouped frequency distribution
 Each class represented by its mid-point
 Frequencies of each class are labeled on the y-axis along the mid-
points of classes
 Points representing the mid-points of successive classes are joined by
straight lines
 The curve must be extended to the x-axis at each end
 The total area under the polygon will be equal to the total area
under the histogram.

01/15/2024 Biostatistics 62
Frequency polygon
700

600

500

400

300

200

100 Std. Dev = 6.13


Mean = 27.6
0 N = 2087.00
15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0

N1AGEMOTH

01/15/2024 Biostatistics 63
Ogive curve
 cumulative frequency curve

 Upper-class boundaries of each class are graphed against the


cumulative frequency of each class.

 Compute the cumulative frequency of the distribution

 Prepare a graph with the cumulative frequency on the vertical axis and
the true upper-class limits (class boundaries) of the interval scaled
along the X-axis (horizontal axis).


01/15/2024 Biostatistics 64
Cumulative Frequency and Cum. Rel. Freq. of Age
of 25 ICU Patients

Relative Cumulative Cumulative


Age Interval Frequenc Frequency frequency Rel. Freq.
y (%) (%)

10-19 3 12 3 12
20-29 1 4 4 16
30-39 3 12 7 28
40-49 0 0 7 28
50-59 6 24 13 52
60-69 1 4 14 56
70-79 9 36 23 92
80-89 2 8 25 100

Total 25 100
01/15/2024 Biostatistics 65
Cumulative frequency of 25 ICU patients

01/15/2024 Biostatistics 66
Box and whisker plot

Five figure summary of a distribution.


Shows (in ascending order):
◦ Minimum
◦ Lower quartile
◦ Median Values
◦ Upper quartile
◦ Maximum
Also shows outlier values
01/15/2024 Biostatistics 67
01/15/2024 Biostatistics 68
Scatter plot

Used to show the relationship between two numerical


continuous variables.
E.g., between body weight and height, between body-mass
index and systolic blood pressure, etc.
See example on next slide

01/15/2024 Biostatistics 69
• A scatter diagram is constructed by drawing X-and Y-axes.
• Each observation is represented by a point or dot().

Age and percentage saturation of bile for women patients in


hospital Z, 1998
160

140

120
Saturation of bile

100

80

60

40

20

0
0 10 20 30 40 50 60 70 80
Age

01/15/2024 Biostatistics 70
3.Frequency distribution

• Frequency distribution is a summarized presentation of the values of a


variable arranged in order of magnitude either individually (for a
discrete variable), or in to classes (for a continuous variable), or into
categories (in case of qualitative data) along with their frequencies in
to rows and columns.

01/15/2024 Biostatistics 71
Frequency distribution
A frequency distribution has two main parts; namely,

i. The values of the variable (if quantitative) or the categories (if


qualitative), and

ii. The number of observations (frequency) corresponding to the values


or categories.

gender frequency
male 40
01/15/2024 female 38 Biostatistics 72
Frequency Distributions
Higher dimensional tables of FD tables consists of.
◦ Classes/ categories,
◦ Frequencies of the classes or categories, and
◦ Other pertinent information
Can be:
1. Categorical distributions, or
2. Grouped frequency distributions

01/15/2024 Biostatistics 73
1.1. Categorical distribution

Frequency distribution for categorical or nonnumerical data.

Example: Frequency distribution of marital status for a group of people

Marital statis frequency Relative frequency


Single 36 32/80
married 24 24/80
Widowed 7 8/80
divorced 10 12/80
separated 3 4/80
01/15/2024 Biostatistics 74
2.2. Grouped frequency distribution
 To summarize and present voluminous numerical data sets

 Ranges (interval) of values are included in a class

 The number of observations included in a given class is called the

frequency of that particular class

01/15/2024 Biostatistics 75
Steps to construct GFD
• (1) Choosing the classes,

• (2) sorting (or tallying) of the data into these classes,

• (3) counting the number of items in each class, and

• (4) displaying the results in the forma of a chart or table


Rules for the construction of grouped
frequency distribution:
• 1. Choosing the number of classes
Not too few
Not too many
Depends on your knowledge of the data and purpose of the data
presentation.
Suggested: Between 5 & 20 classes.
Rule of thumb; Sturge's formula

• K=1+3.322×logn
01/15/2024 Biostatistics 77
2. Determine the length or width
(W) of the class interval

01/15/2024 Biostatistics 78
3. Determine class limits
• Definition: the smallest and largest values that
go into any class are called Class Limits.

• They can be either upper or lower class limits


Consider the following:
 Class limit should be definite and clearly stated i.e. avoid open-ended classes

 To find the upper limit of the first class, subtract one from the lower limit of the second class.
Then continue to add the class width to this upper limit to find the rest of the upper limits.

 Find the boundaries by subtracting 0.5 units from the lower limits and adding 0.5 units to the
upper limits. The boundaries are also halfway between the upper limit of one class and the lower
limit of the next class. Depending on what you're trying to accomplish, it may not be necessary to
01/15/2024 Biostatistics 79
find the boundaries.
Determine the true class limits or class
boundaries
 Limits which are determined mathematically to make an interval of a continuous
variable continuous in both directions.

 The true limits are what the tabulated limits would correspond with if one could
measure exactly

 E.g., if tabulated lower limit is 5.2 true lower limit=5.15

 Use one more decimal place than that used in the data set

 Adding the class width (w) to the lower boundary of a class gives the upper
boundary of that particular class and the lower boundary of the next higher class
01/15/2024 Biostatistics 80
Determine the mid-points of classes (class
marks or Xc)

01/15/2024 Biostatistics 81

You might also like