You are on page 1of 237

Introduction to Biostatistics

Wondimu Ayele(Msc, PhD fellow )


SP, AAU
January 2019
Objective
– Define statistics and its importance in different
discipline

– Define variable and data

– Describe types of data and measurement scales

– Organize and display data

– Define and calculate measures of central tendency


and measures of spread
Biostatistics -Notes WA , SPH AAU ,2016
• References
1. M. Pagano & K. Gauvereau: principles of Biostatistics
2. Colton T. : Statistics in Medicine
3. Bland M. : An Introduction to Medical Statistics
4. Daniel W. : Biostatistics: A Foundation for analysis in Health
Sciences
5. David S. Moor, G.P.McCable: Introduction to the practice of
Statistics
6. Kleinbaum, K.Muller: Applied Regression Analysis and other
Multivariate Methods
7. L. D. Fisher & G. Van Belle: Biostatistics
8. Kirkwood B. : Essentials of Medical Statistic
Biostatistics -Notes WA , SPH AAU ,2016
9. A. R. Feinstein: Principles of Medical statistics
10. R. G. Knapp & M. C. Miler: Clinical epidemiology and
biostatistics
11. D. J. Sheskin: Hand book of Parametric and
Nonparametric Statistical Procedure
12. Armitage P. & Berry G. : Statistical Methods in Medical
Research
13. P. S.R.S. Rao: Sampling methodologies with application
14. R.N.Forthofer & E. S. Lee: Introduction to Biostatistics

Biostatistics -Notes WA , SPH AAU ,2016


Introduction
• What is Statistics?
• Methods for collecting, organizing, presenting,
analyzing, & drawing of inferences about a body
of the data when only a part of the data is
observed.

Biostatistics -Notes WA , SPH AAU ,2016


WHY WE NEED STATISTICS
• To present the data in a concise and definite form.
Statistics helps in classifying and tabulating raw data for
processing and further tabulation for end users.

• To make it easy to understand complex and large data


• This is done by presenting the data in the form of tables,
graphs, diagrams etc. or by condensing the data with the
help of means, dispersion etc
.
• For comparison : Tables, measures of means and
dispersion can help in comparing different sets of data..

Biostatistics -Notes WA , SPH AAU ,2016


WHY WE NEED STATISTICS
• In measuring the magnitude of a phenomenon.
• Statistics has made it possible to count the population of a
country, the industrial growth, the Agricultural growth, the
educational level, Health status.
• Everything in medicine, be it research, diagnosis or
treatment depends on counting/measurement.
– High/ Low B.P??
– Pulse rate.
– Incidence of disease.
– Death rate.
– Enlargement of liver/ spleen

Biostatistics -Notes WA , SPH AAU ,2016


Statistics and Health
• Biostatistics
• Health statistics
• Medical Statistics
• Vital Statistics
• Not Mutually Exclusive terms
 what is Biostatistics ?
• An application of statistical method to biological
phenomena.

Biostatistics -Notes WA , SPH AAU ,2016


Why need biostatistics?
1. Main reason: handling variations
– Biological variation
• Attribute differ not only among individuals but also
within same individual over time
• Example: height, weight, blood pressure, eye color.
– Sample variation
• Biomedical research projects are usually carried out
on small numbers of study subjects

Biostatistics -Notes WA , SPH AAU ,2016


Why need to learn biostatistics?
2. Essential for scientific method of investigation
– Formulate hypothesis
– Design study to objectively test hypothesis
– Collect reliable and unbiased data
– Process and evaluate data rigorously
– Interpret and draw appropriate conclusions
3. Essential for understanding, appraisal and critique of
scientific literature

Biostatistics -Notes WA , SPH AAU ,2016


Examples of uses of biostatistics
• To define what is normal/ healthy in a population (Setting
limits of normality).

• To compare drug action –potency/efficacy

• Confirm association between two attributes: Cancer and


smoking or Socioeconomic status and malnutrition

• Usefulness of vaccines

Biostatistics -Notes WA , SPH AAU ,2016


Uses in Public Health Planning

• Recording of vital events

• Incidence/prevalence of disease.

• Leading causes of death/ morbidity in the community

• Demographic characteristics of a community.

• Health system research.

Biostatistics -Notes WA , SPH AAU ,2016


Application of Biostatistics
1. Genetically statistics
2. Numerical Taxonomy
3. Statistical Ecology
4. Statistical Ethnology
5. Forest menstruation
6. Forest and Agricultural yield table
7. Biomass estimation
8. Statistical environment management
9. Demography
10. Medical sciences
11. Biological variation and uncertainties
Biostatistics -Notes WA , SPH AAU ,2016
Limitation of statistics
• Statistics does not deal with individual measurements.
Since statistics deals with aggregates of facts, it can not
be used to study the changes that have taken place in
individual cases.
• Statistics cannot be used to study qualitative
phenomenon like morality, intelligence, beauty etc. as
these can not be quantified. However, it may be possible
to analyze such problems statistically by expressing them
numerically.
• Statistical results are true only on an average- The
conclusions obtained statistically are not universal truths.
They are true only under certain conditions. This is
because statistics as a science is less exact as compared
to the natural science.

Biostatistics -Notes WA , SPH AAU ,2016


Limitation of statistics

• Statistical data can be treated as approximations or


as estimates and not a precise measurement.
• Statistical results might lead to fallacious conclusions.
• Requires one who has a sound knowledge of
statistical methods can efficiently handle statistical
data.

Biostatistics -Notes WA , SPH AAU ,2016


Types of Statistics
Statistics

Probability
Sampling theory

Descriptive Statistics Inferential statistics

Measure of Measure of Test Estimation


Tabular Diagrammatic
Central Variability hypothesis Theory
representation representation
Tendency

Non Parametric Parametric Point Interval


test test estimation Estimation
Biostatistics -Notes WA , SPH AAU ,2016
Population & Sample
• Target population: A collection of items that have
something in common for which we wish to draw
conclusions at a particular time.
• Study Population: The specific population from which data
are collected
• Sample: A subset of a study population, about which
information is actually obtained.
• Generalizability is a two‐stage procedure: we want to able
to generalize from the sample to the study population and
then from the study population to the target population

Biostatistics -Notes WA , SPH AAU ,2016


Population and sample
• E.g.. In a study of the prevalence of HIV among Student in
Addis Ababa University, a random sample of all
pharmacy students in college of Health science of AAU
were included.
• Target population; all student in Addis Ababa University
• Study population; all student in college of Health science
of AAU
Sample; all Pharmacy student in Health science college of
AAU.

Sample
Study population

Target population
Biostatistics -Notes WA , SPH AAU ,2016
Parameter and Statistic
 Parameter: A descriptive measure computed
from the data of a population.
 Statistic: A descriptive measure computed from
the data of a sample.

Biostatistics -Notes WA , SPH AAU ,2016


Scales of measurement
• Clearly not all measurements are the same.
• Measuring an individuals weight is qualitatively different
from measuring their response to some treatment on a three
category of scale, “improved”, “stable”, “not improved”.
• Measuring scales are different according to the degree of
precision involved.

• There are four types of scales of measurement.

Biostatistics -Notes WA , SPH AAU ,2016


Scales of measurement
1. Nominal scale: uses names, labels, or symbols to assign
each measurement to one of a limited number of categories
that cannot be ordered.
Examples: Blood type, sex, race, marital status, Adolescence
stage, Color of cars.
2. Ordinal scale: assigns each measurement to one of a
limited number of categories that are ranked in terms of a
graded order.
• Examples: Patient status, Cancer stages, Socioeconomic
status, IQ of children.

Biostatistics -Notes WA , SPH AAU ,2016


Scales of measurement
3. Interval scale: assigns each measurement to one of an
unlimited number of categories that are equally spaced.

It has no true zero point.

Example: Temperature measured on Celsius or Fahrenheit

4. Ratio scale: measurement begins at a true zero point


and the scale has equal space.

Examples: Height, weight, blood pressure

Biostatistics -Notes WA , SPH AAU ,2016


• DATA: Collection of information, comprised either
individual or group.

Variables: A characteristic which takes different


values in different persons, places, or things.

Example:

Animals of the same species may differ in their Length,


weight, age, sex, Diastolic BP, heart rate, etc

Biostatistics -Notes WA , SPH AAU ,2016


Types of variable
Qualitative/ Categorical variable : records which
group or category an individual/observation belongs in;
classifies
• doesn’t make sense to perform arithmetic on this type of
variable
Example, gender, ethnic group, type of diagnosis as present or
absent, etc
Quantitative variable: Variable that has magnitude.
 A true numerical value; it indicates an amount; often
obtained from a measuring instrument;
 it makes sense to perform arithmetic on these types of
variables. E.g. Weight, Length, Age etc
Biostatistics -Notes WA , SPH AAU ,2016
Types of Variable
Discrete variable: It can only have a finite number
of values in any given interval.
– Indivisible units
– Restricted to whole numbers
– Can be counted
• Example.
– # of children in a family
– # of houses in a neighborhood
– # of patients discharged from the hospital on a given day

Biostatistics -Notes WA , SPH AAU ,2016


SUMMARY

Variable

Types
of Qualitative Quantitative
variables or categorical measurement

Nominal Ordinal Discrete Continuous


(not ordered) (ordered) (count data) (real-valued)
e.g. ethnic e.g. response e.g. # of e.g. height
group to treatment admissions

Measurement scales
Biostatistics -Notes WA , SPH AAU ,2016
Types of variable
Continuous variable: It can have an infinite number
of possible values in any given interval.
• Unlimited number of possible values
• Infinite number of values can fall b/n any 2
observed values
• No gaps between units
Example. time taken to solve a problem
height or weight, weight/Temperature of patients

Biostatistics -Notes WA , SPH AAU ,2016


Sources of data
Routinely kept records
– Hospital medical records, accounting records
Survey
– Mode of transportation used by patients to visit the
clinic.
Experiments
– Best strategies for maximizing patient compliance.
External sources.
– An already published data

Biostatistics -Notes WA , SPH AAU ,2016


Types of data
1. Primary source data: primary data are those data which are collected by
the investigator himself (herself) for the purpose of a specific goal or study.

Example: data gathered from interview, questionnaire, or field observation of the


investigator or researcher.

2. Secondary source data: when an investigator uses data which have already
been collected by others. Secondary sources can be individuals or agencies,
which supply data originally collected for other purposes by them or others.

• They are less expensive in time and cost than Primary data.
• Usually they are published or unpublished materials, records, reports,
e t c.

Biostatistics -Notes WA , SPH AAU ,2016


Descriptive Statistics
• Techniques used to organize and summarize a
set of data in a concise way.
–Organization of data
–Summarization of data
–Presentation of data
• Numbers that have not been summarized
and organized are called raw data.

Biostatistics -Notes WA , SPH AAU ,2016


Descriptive cont...
• Statistics is used to organize and interpret
research observations and findings

• Before interpretation & communication of the


findings, the raw data must be organized and
presented in a clear and understandable way

Biostatistics -Notes WA , SPH AAU ,2016


Descriptive cont….
Ordered array: A simple arrangement of individual
observations in order of magnitude.
Frequency distribution: A table which involves a listing
of all observed values of the variable being studied and
how many times each value is observed.
a) Qualitative variable: Count the number of cases in each
category.
b) Quantitative variable: Select a set of continuous, non-
overlapping intervals such that each value in the set of
observations can be placed in one, and only one of the
intervals
Biostatistics -Notes WA , SPH AAU ,2016
Descriptive cont…
Frequency distribution:
• The actual summarization and organization of
data starts from frequency distribution.
• The distribution condenses the raw data into a
more useful form and allows for a quick visual
interpretation of the data.

Biostatistics -Notes WA , SPH AAU ,2016


Frequency distributions for
categorical variables
• Summarizing categorical variables (nominal &
ordinal) is simple

• Count the number of observations (frequency)


in each category and present as relative
frequencies (percentages)

• Often presented in the form of table, bar and


pie charts
Biostatistics -Notes WA , SPH AAU ,2016
Frequency , categorical cont...
• A relative frequency distribution: shows the
proportion of counts that fall into each class or
category
• A relative frequency value for any category is
obtained by dividing the number of
observations in that category by the total
number of observations.
• This can be reported as a percentage by
multiplying the resulting fraction by 100.
Biostatistics -Notes WA , SPH AAU ,2016
Cumulative frequency distribution
 Cumulative frequencies: When frequencies of
two or more classes are added.

 Cumulative relative frequency: The percentage


of the total number of observations that have a
value either in that interval or below it.

 Mid-point: The value of the interval which lies


midway between the lower and the upper limits of
a class.

Biostatistics -Notes WA , SPH AAU ,2016


Cumulative frequency cont…
True limits(class boundaries): Are those limits
that make an interval of a continuous variable
continuous in both directions

Used for smoothening of the class intervals

Subtract 0.5 from the lower and add it to the


upper limit

Biostatistics -Notes WA , SPH AAU ,2016


Frequency distributions
• Data contain information and that summarization is a way
of making it easier to determine the nature of the
information.
• Relative frequency distributions: is most often used in
scientific publications to describe quantitative data sets.
They are better suited to the description of large data sets
and they permit a greater flexibility in the choice of class
widths.
-A frequency distribution is a table that organizes data
into classes.
-non overlapping classes, i.e. classes without common
items.

Biostatistics -Notes WA , SPH AAU ,2016


Guidelines for constructing tables
• Keep them simple
• Limit the number of variables to three or less
• All tables should be self-explanatory
• Include clear title telling what, when and where
• Clearly label the rows and columns
• State clearly the unit of measurement used
• Explain codes and abbreviations in the foot-note
• Show totals
• If data is not original, indicate the source in foot-note
Biostatistics -Notes WA , SPH AAU ,2016
• Example 1 The classification of students of a group by
the score on the subject “Statistical analysis” is presented
in Table 2.0a. The table of frequencies for the data set
generated by computer using the software SPSS is shown
in Figure 2.1.

Biostatistics -Notes WA , SPH AAU ,2016


Frequency percent Valid percent Cumulative
percent

Bad 6 13.3 13.3 13.3

Excellent 18 40.0 40.0 53.3

Good 15 33.4 33.4 88.7

Medium 6 13.3 13.3 100

Total 45 100 100

Biostatistics -Notes WA , SPH AAU ,2016


Steps to follow to construct a grouped frequency
distribution.
1. Make sure that you have a quantitative data
2. Find the range of the data
3. Determine the number of classes that you wish to have or
use sturge’s rule
4. Determine the width of the class
5. Determine the first lower class limit of the first class and
all the subsequent lower class limits
6. Write all the upper class limits of the classes
7. Finally, for each class, count the number of observation
and construct the freq. distribution, accordingly
Biostatistics -Notes WA , SPH AAU ,2016
• Example 3.6 Construct frequency table for the data set of
the above example on Age of 189 subjects.
K=1+3.322log(n) ~9 (Use 6 for the simplicity)
W=R/k ~5.788 (Use 10 for simplicity)
• where
• K = number of class intervals n = number of observations
• W = width of the class interval
R = Range where R= L-S
Where, L = the largest value and S= the smallest value in
certain observation.

Biostatistics -Notes WA , SPH AAU ,2016


Biostatistics -Notes WA , SPH AAU ,2016
Remarks:

• All classes of frequency table must be mutually exclusive.

• Classes may be open-ended when either the lower or the


upper end of a quantitative classification scheme is
limitless.

For example Class: age


– birth to 7 8 to 15 ........64 to 71 72 and older

– Classification schemes can be either discrete or continuous.

Biostatistics -Notes WA , SPH AAU ,2016


Diagrammatic Representation

It is Pictorial or graphic


presentations of numerical data

Biostatistics -Notes WA , SPH AAU ,2016


Graphical description of quantitative data:
Histogram and Polygon:
 There is an old saying that “one picture is worth a
thousand words”.
Indeed, statisticians have employed graphical techniques
to describe sets of data more vividly.
Bar charts and pie charts were presented before to
describe qualitative data.
With quantitative data summarized into frequency,
relative frequency tables , however, histograms and
polygons are used to describe the data.

Biostatistics -Notes WA , SPH AAU ,2016


Importance of diagrammatic
representation
 Much attractive than mere figures
 Required information can be obtained in
Less time without mental strain.
 Facilitates comparison
 Pattern of change in data can be detected
easily
 Stays in memory for more time
 Used to understand patterns and trends
Biostatistics -Notes WA , SPH AAU ,2016
Limitations of diagrams
 Can not be used as substitute for data
 Not an alternative to tabulation
 No accuracy ensured , gives only approximate
idea
 When graphs are poorly designed, they not
only do not effectively convey your message,
they often mislead and confuse.

Biostatistics -Notes WA , SPH AAU ,2016


Diagrammatic……
Specific types of graphs include:
• Bar graph
Nominal, ordinal data
• Pie chart

• Histogram
• Stem-and-leaf plot
• Box plot Quantitative
• Scatter plot data
• Line graph
• Others

Biostatistics -Notes WA , SPH AAU ,2016


Graphical description of qualitative data
• Bar graphs and pie charts are two of the most widely used
graphical methods for describing qualitative data sets.
• Bar graphs give the frequency (or relative frequency) of
each category
• Example 1.3a (Bar Graph)

45
40
35
30
25
20
15
10
5
0

• Bad
Figure 1.3 Bar graph Excellent
showing the Good
number of students Medium
of each category

Biostatistics -Notes WA , SPH AAU ,2016


Two-way table (Cross tabulation):

• This table shows two characteristics and is formed when either of the two
variables (the caption or the stub) is divided into two or more parts.

• For instance , the marital status and cervical cancer status can be presented
in the following two way table.

Marital status Cervical Cancer status


Positive Negative
Single 49 47
Married 216 108
Widowed 87 86
Div/sep 15 45

Biostatistics -Notes WA , SPH AAU ,2016


Graphical (Diagrammatic) Presentation of Data.
• I. Bar Graph

• The bar graph is very commonly used and is better for representation of
qualitative data. Bars are vertical lines, where the lengths of the bars are
proportional to their corresponding numerical values and the bars should be
equally space.

• Example: if following data indicates the number clinical Nurses in given


woreda, it can be presented using different diagrams.
Degree Diploma Certefficate
Private 45 66 21
Gov't 48 46 12
NGO 12 24 4

Biostatistics -Notes WA , SPH AAU ,2016


70
60
50
40
30
20
10
0
Private Gov't NGO

degree Diploma Certefficate

Graph2.1 the bar Graph presentation for the number


clinical nurses in given woreda

Biostatistics -Notes WA , SPH AAU ,2016


Multiple bar graph

Sub-divided bar graph

Biostatistics -Notes WA , SPH AAU ,2016


III. Pie diagram (Pie chart)

• Pie chart enables us to show the partitioning of a total in to its component parts.

• The diagram is in the form of circle and component as slices of the circle.

• The size of the slice represents the proportion of the component out of the total.

• The angle of a component (x) is calculated as:

 value of component X  0
Degree of X=   ×360
 total value of the components 
Example: The following data indicates the marital status of 40 women who came for the
service of contraceptives to St. Paul HMMC. Present the data using Pie- diagram.

Marital status Married widowed separated single


Frequency 8 12 16 4

Biostatistics -Notes WA , SPH AAU ,2016


• Degree of the slice for married is calculated as:
 number of married women 
deg ree of Married women     3600

 total women 
 8 
deg ree of Married women =   ×3600  720
 40 
Like with the slice degree of the pie chart of the women for widowed, separated and
single women becomes is 108, 144and 36, respectively.
Frequency

Single
10% Married
20%

Separated
40% Widowed
30%

Graph 2.3: The Pie- diagram presentation of 40 women who came for for
contraceptive service to St. Paul HMMC.
Biostatistics -Notes WA , SPH AAU ,2016
Pie charts
Divide a complete circle (a pie) into slices, each
corresponding to a category, with the central angle and
hence the area of the slice proportional to the category
relative frequency.
Example 1.4b (Pie Chart)

Figure 1.3 Pie chart showing the number of students of each


category
Biostatistics -Notes WA , SPH AAU ,2016
Graphical description of quantitative data:
• Stem and Leaf displays
• Widely used in exploratory data analysis when the data set
is small.
• In order to explain what is a stem and what is a leaf we
consider the data from the table 1.4.1 (A foundation for
analysis in the health sciences. Biostatistics, Daniel)
• Steps to follow in constructing a Stem and Leaf Display
– Divide each observation in the data set into two parts,
the Stem and the Leaf.
– List the stems in order in a column, starting with the
smallest stem and ending with the largest.
– Proceed through the data set, placing the leaf for each
observation in theBiostatistics
appropriate stem
-Notes WA , SPH AAU ,2016 row.
Example 1.5

Table 1.4.1 contains a list of the ages of subjects who


participated in the study on smoking cessation discussed
in Example 1.4.1. As can be seen, this unordered table
requires considerable searching for us to ascertain such
elementary information as the age of the youngest and
oldest subjects.

Biostatistics -Notes WA , SPH AAU ,2016


Biostatistics -Notes WA , SPH AAU ,2016
Biostatistics -Notes WA , SPH AAU ,2016
Biostatistics -Notes WA , SPH AAU ,2016
The stem and leaf display of Figure 2.3.8 partitions the data
set into 11 classes corresponding to 11 stems. Thus, here
two-lines stems are used. The number of leaves in each
class gives the class frequency.
Advantages of a stem and leaf display over a frequency
distribution (considered in the next section):
1. the original data are preserved.
2. a stem and leaf display arranges the data in an orderly
fashion and makes it easy to determine certain numerical
characteristics to be discussed in the coming topics.
3. the classes and numbers falling in them are quickly
determined once we have selected the digits that we want
to use for the stems and leaves.
Biostatistics -Notes WA , SPH AAU ,2016
Histogram
• When plotting histograms, the phenomenon of interest is
plotted along the horizontal axis, while the vertical axis
represents the number, proportion or percentage of
observations per class interval – depending on whether or
not the particular histogram is respectively, a frequency
histogram, a relative frequency histogram or a percentage
histogram.

• Histograms are essentially vertical bar charts in which the


rectangular bars are constructed at midpoints of classes.
Biostatistics -Notes WA , SPH AAU ,2016
• Example 3.7 Below we present the frequency histogram
for the data set considered above, for which the
frequency table is constructed in Table 2.3.2.

Biostatistics -Notes WA , SPH AAU ,2016


• Remark: When comparing two or more sets of data, the
various histograms can not be constructed on the same
graph because superimposing the vertical bars of one on
another would cause difficulty in interpretation.

• For such cases it is necessary to construct relative


frequency or percentage polygons.

Biostatistics -Notes WA , SPH AAU ,2016


Polygons
• As with histograms, when plotting polygons the
phenomenon of interest is plotted along the horizontal
axis while the vertical axis represents the number,
proportion or percentage of observations per class interval
• depending on whether or not the particular polygon is
respectively, a frequency polygon, a relative frequency
polygon or a percentage polygon. For example, the
frequency polygon is a line graph connecting the
midpoints of each class interval in a data set, plotted at a
height corresponding to the frequency of the class.

Biostatistics -Notes WA , SPH AAU ,2016


• Example 3.8 Figure 2.3.4 is a frequency
polygon constructed from data in Table 2.3.2.

Biostatistics -Notes WA , SPH AAU ,2016


Cumulative distributions and cumulative polygons
• Other useful methods of presentation which facilitate
data analysis and interpretation are the construction of
cumulative distribution tables and the plotting of
cumulative polygons.
• A cumulative frequency distribution enables us to see
how many observations lie above or below certain
values, rather than merely recording the number of items
within intervals.

Biostatistics -Notes WA , SPH AAU ,2016


Ogive curve
• We may, for example, be interested in knowing
the number of patients whose weight is less than
50 Kg or more than say 60 Kg.
• To get this information it is necessary to change the form
of the frequency distribution from a ‘simple’ to a
‘cumulative’ distribution.
• Ogive curve turns a cumulative frequency distribution in
to graphs.

Biostatistics -Notes WA , SPH AAU ,2016


• Example: Heart rate of patients admitted in
• hospital Y, 2013.

Biostatistics -Notes WA , SPH AAU ,2016


Biostatistics -Notes WA , SPH AAU ,2016
Box and Whisker plot
• It is another way to display information when the
objective is to illustrate certain locations in the
distribution.
• A box is drawn with the top of the box at the third
quartile and the bottom at the first quartile.
• The location of the mid‐point of the distribution is
indicated with a horizontal line in the box.
• Finally, straight lines, or whiskers, are drawn from the
centre of the top of the box to the largest observation and
from the centre of the bottom of the box to the smallest
observation.

Biostatistics -Notes WA , SPH AAU ,2016


A box and Whisker diagram

A b and Whisker diagram

Biostatistics -Notes WA , SPH AAU ,2016


Scatter plot
• Most studies in medicine involve measuring more than
one characteristic, and graphs displaying the relationship
between two characteristics are common in the literature.
• When both the variables are qualitative then we can use a
multiple bar graph.
• When one of the characteristics is qualitative and the other
is quantitative, the data can be displayed in box and
whisker plots.
• To illustrate the relationship between two characteristics
when both are quantitative variables we use bivariate
plots (also called scatter plots or scatter diagrams).

Biostatistics -Notes WA , SPH AAU ,2016


Scatter plot

Biostatistics -Notes WA , SPH AAU ,2016


Line graph
 Useful for assessing the trend of particular situation overtime.
 Helps for monitoring the trend of epidemics.
 The time, in weeks, months or years, is marked along the
horizontal axis
 Values of the quantity being studied is marked on the vertical
axis.
 Values for each category are connected by continuous line.
 Sometimes two or more graphs are drawn on the same graph
taking the same scale so that the plotted graphs are
comparable.
Biostatistics -Notes WA , SPH AAU ,2016
Example: Infant and under five mortality rate in Ethiopia, 1970-2005 (Tefera Darge
2011; EDHS, 2000, 2005)

1970-75 1975-80 1980-85 1985-90 1990-95 1995-2000 2000-05


IMR 239 219,4 199,5 190 165 141 123
U5MR 160 138,8 127 104,8 95 83 77

160 IMR U5MR


138.8
127
104.8
95
239
219.4 83
199.5 190 77
165
141
123

1970-75 1975-80 1980-85 1985-90 1990-95 1995-2000 2000-2005

Graph 2.4 Infant and under five mortality rate in Ethiopia, 1970-2005
(Tefera Darge 2011; EDHS, 2000, 2005)

Biostatistics -Notes WA , SPH AAU ,2016


No. of microscopically confirmed malaria cases by species and
month at Zeway malaria control unit, 2003

2100
No. of confirmed malaria cases

1800 Positive
1500 P. falciparum
P. vivax
1200

900

600

300

0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Months
Biostatistics -Notes WA , SPH AAU ,2016
Line graph cont..
 Line graph can be also used to depict the
relationship between two continuous
variables like that of scatter diagram .

The following graph shows level of zidovudine


(AZT) in the blood of AIDS patients at several
times after administration of the drug, for with
normal fat absorption and with fat mal
absorption.
Biostatistics -Notes WA , SPH AAU ,2016
Line graph cont…..
Response to administration of zidovudine in two groups of AIDS
patients in hospital X, 1999

8
7
Blood zidovudine
concentration

6
5
4
3
2
1
0
10
20
70
80
100

120
170
190
250
300
360
Tim e since adm inistration (Min.)

Fat malabsorption Normal fat absorption

Biostatistics -Notes WA , SPH AAU ,2016


Descriptive summary statistics
• Introduction
• In the previous section data were collected and
appropriately summarized into tables and charts.
• Next a variety of descriptive summary measures will be
developed.
• These descriptive measures are useful for analyzing and
interpreting quantitative data, whether collected in raw
form (ungrouped data) or summarized into frequency
distributions (grouped data)

Biostatistics -Notes WA , SPH AAU ,2016


Types of numerical descriptive measures
Four types of characteristics which describe a
data set pertaining to some numerical variable or
phenomenon of interest are:
1. Location
2. Dispersion
3. Relative standing
4. Shape

Biostatistics -Notes WA , SPH AAU ,2016


numerical descriptive measures
• In any analysis and/or interpretation of numerical data, a
variety of descriptive measures representing the properties
of location, variation, relative standing and shape may be
used to extract and summarize the salient features of the
data set.
• If these descriptive measures are computed from a sample
of data they are called statistics . In contrast, if these
descriptive measures are computed from an entire
population of data, they are called parameters.
• Since statisticians usually take samples rather than use
entire populations, our primary emphasis deals with
statistics rather than parameters.

Biostatistics -Notes WA , SPH AAU ,2016


Measures of central tendency (MCT)
• On the scale of values of a variable there is a
certain stage at which the largest number of items
tend to cluster.
• Since this stage is usually in the centre of
distribution, the , tendency of the statistical data to
get concentrated at certain values is called “central
tendency”
• The various methods of determining the actual
value at which the data tends to concentrate are
called measures of central tendency.

Biostatistics -Notes WA , SPH AAU ,2016


Measures of central tendency (MCT)
• The most important objective of calculating measure of
central tendency is to determine a single figure which may
be used to represent a whole series involving magnitude of
the same variable.
• In that sense it is an even more compact description of the
statistical data than the frequency distribution.
• Since a measure of central tendency represents the entire
data, it facilitates comparison with in one group or between
groups of data.

Biostatistics -Notes WA , SPH AAU ,2016


Position

Biostatistics -Notes WA , SPH AAU ,2016


Characteristics of a good measure of central tendency
A measure of central tendency is good or satisfactory if it
possesses the following characteristics.
1. It should be based on all the observations
2. It should not be affected by the extreme values
3. It should be as close to the maximum number of values
as possible
4. It should have a definite value
5. It should not be subjected to complicated and boring
calculations
6. It should be capable of further algebraic treatment
7. It should be stable with regard to sampling

Biostatistics -Notes WA , SPH AAU ,2016


Measures of location ( or central tendency)
A. Arithmetic Mean
A) Ungrouped data
Sample mean n

x  x2  ...  xn x i
x 1  i 1

• n n

Population mean
N

 x Sum of the values of all observations in population


i
 i 1

N Total number of observations in population

Biostatistics -Notes WA , SPH AAU ,2016


Arithmetic Mean
b) Grouped data
• In calculating the mean from grouped data, we assume
that all values falling into a particular class interval are
located at the mid-point of the interval. It is calculated as
follow:

where,
k = the number of class intervals
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
Biostatistics -Notes WA , SPH AAU ,2016
Example 1: if the mark of five medical students is: 80, 75, 60, 50, 90 the mean
mark of the students is calculated as:
n 5

X i XX1  X 2  X 3 .  ...  X 5
i
X i 1
 i 1

n 5 5
80  75  60  50  90
X  71.
5
• Therefore, the mean mark of the students was 71.

Exercise: You measure the body lengths (in inches) of 10 full-term infants at
birth and record the following: 17.5, 19.5, 17.5, 19, 20, 21, 18,
19.5, 18, 10.75. Compute the mean length of the infants for these
data.

Biostatistics -Notes WA , SPH AAU ,2016


Example 2: If the age patients diagnosed in a given day is given below.
Compute the mean age of the patients diagnosed per day.

Age of patients (year) 54 64 74 84 94 104


Number of patients 2 3 5 10 3 2

The mean agenof the patients can be calculated as:


X f i i
x1f1  x 2 f 2  ...  x n f n
X i 1

n
f1  f 2  ...  f n
f
i 1
i

54  2  64  3  74  5  84  10  94  3  104  2
X  68.72
2  3  5  10  3  2

Hence, the mean age of the patients is that were diagnosed


in that day was 68.72 year.

Biostatistics -Notes WA , SPH AAU ,2016


The arithmetic mean properties.
• For given set of data there is one and only one arithmetic
mean.
• The arithmetic mean is easily understood and easy to
compute.
• Algebraic sum of the deviations of the given values from
their arithmetic mean is always zero.
• The arithmetic mean possesses all the characteristics of a
central value, except No.2, which is greatly affected by
the extreme values.
• In case of grouped data if any class interval is open,
arithmetic mean can not be calculated

Biostatistics -Notes WA , SPH AAU ,2016


Mean Sensitive to Outliers

6
5
4
3 Mean = 12.0
2
1
0
0 5 10 15 20 25 30 35 40 45 50
Nights of stay

Mean = 15.3

Biostatistics -Notes WA , SPH AAU ,2016


Advantage and disadvantage of Mean

Advantage Disadvantage
 Mathematical center of a  It is affected by extreme
distribution. values and skewed
 Just as far from scores above it distributions that are not
as it is from scores below it. representative of the rest of the
data.
 Good for interval and ratio
data.  May not exist in the data.
 includes all the values of the
data set and unique .
 Inferential statistics is based on
mathematical properties of the
mean Biostatistics -Notes WA , SPH AAU ,2016
Measures of location ( or central tendency)
• Example Consider 189 subjects: 48,35,…66.

By definition , the mean is calculated as:

= (48+35+…+66)/189 = 55.032

Biostatistics -Notes WA , SPH AAU ,2016


Median
Definition 3.2
• The median m of a sample of n observations arranged in
ascending or descending order is the middle number that
divides the data set into two equal half: one half of the
items lie above this point, and the other half lie below it.
a) Median (~x ) a), Ungrouped data

 xk if n  2k  1 ( n is odd)
~ 
X  Median   1
 xk  xk 1  if n  2k ( n is even)
2
 n  1  th
  largest value, when n (size of the data) is odd
 2 
median(X)  
 1   n 
th
 n  2 
th


 2   2   2   value, when n is even
  
Biostatistics -Notes WA , SPH AAU ,2016
Example : to find the median of: 6,2,7,13,4,9,15,1,12.

Arrange the data in increasing order: 1, 2, 4, 6, 7, 9, 12, 13, 15.

 n 1
th

The sample size, n=9 (odd). So the median is the value,  . value
 largest
 2 

 9 1
th

 The median of the data becomes   larg est value  the 5 value ;
th

which is 7.  2 

Exercise: Compute the sample median for the birth weight data Solution:
3265, 3314, 2581, 2759, 2834, 2838, 2841, 3031, 3200, 3245, 3260, 3323,
4146, 3609, 3484,, 3101, 3248, 2069, 3649, 3541.

Biostatistics -Notes WA , SPH AAU ,2016


Median
• In calculating the median from grouped data, we assume
that the values within a class‐interval are evenly
distributed through the interval.

• The first step is to locate the class interval in which it is


located. We use the following procedure.

• Find n/2 and see a class interval with a minimum


cumulative frequency which contains n/2.

• To find a unique median value, use the following


interpolation formal.
Biostatistics -Notes WA , SPH AAU ,2016
Median

• where,
Lm = lower true class boundary of the interval containing the
median
Fc = cumulative frequency of the interval just above the
median class interval
fm = frequency of the interval containing the median
W= class interval width
n = total number of observations

Biostatistics -Notes WA , SPH AAU ,2016


Properties of the median
• There is only one median for a given set of data
• The median is easy to calculate
• Extreme values in data set do not affect the median as
strongly as they do the mean.
• Median can be calculated even in the case of open end
intervals
• It is not a good representative of data if the number of
items is small

Biostatistics -Notes WA , SPH AAU ,2016


Advantage and Disadvantage of Median

Advantage Disadvantage

 Not influenced by extreme scores  May not exist in the data.


or skewed distributions.  Doesn’t take actual value into
 Good for ordinary data account
 Easier to compute than the mean.

Biostatistics -Notes WA , SPH AAU ,2016


• Example 1 Find the median of the data set consisting of the observations
7, 4, 3, 5, 6, 8, 10.
Solution First, we arrange the data set in ascending order
3 4 5 6 7 8 10.
Since the number of observations is odd, n = 2 x 4 - 1, then median m = x4
= 6. We see that a half of the observations, namely, 3, 4, 5 lie below the
value 6 and an another half of the observations, namely, 7, 8 and 10 lie
above the value 6.
Example 2 Suppose we have an even number of the observations 7, 4, 3,
5, 6, 8, 10, 1. Find the median of this data set.
Solution First, we arrange the data set in ascending order
1 3 4 5 6 7 8 10.
Since the number of the observations n = 2 x 4, then by Definition
Median = (x4+x5)/2 = (5+6)/2 = 5.5

Biostatistics -Notes WA , SPH AAU ,2016


Mode
a) Ungrouped data
• The mode of a data set is the value of that occurs with
the greatest frequency, i.e., is repeated most often in the
data set.
• If all the values are different there is no mode, on the
other hand, a set of values may have more than one
mode.
b) Grouped data
• In designating the mode of grouped data, we usually refer
to the modal class, where the modal class is the class
interval with the highest frequency.
• Mode= lm+( A /A1+A2)W, where A frequency of mode class A1 difference of
frequency immediately above modal class, A2 d/f b/n model class frequency and the frequency
below the model class , W widthBiostatistics
of the class-Notes WA , SPH AAU ,2016
interval
Properties of mode
• The mode can be used as a summary measure for
nominal, ordinal, discrete and continuous data, in
general however, it is more appropriate for nominal
and ordinal data.
• It is not affected by extreme values

• It can be calculated for distributions with open end


classes

• Often its value is not unique


• The main drawback of mode is that often it does not
exist

Biostatistics -Notes WA , SPH AAU ,2016


Advantage and Disadvantage of Mod
Advantage Disadvantage
 The mode is not used as often
 Good for nominal data. to measure central tendency
 Like the median, the mode as are the mean and the
is not unduly affected by median.
extreme values.  Too often, there is no modal
 We can use the mode no value because the data set
matter how large, how contains no values that occur
small, or how spread out the more than once.
values in the data set happen  Ignores most of the
to be. information in a distribution.
 Easiest to compute and  When data sets contain two,
understand three, or many modes, they
are difficult to interpret and
compare.
Biostatistics -Notes WA , SPH AAU ,2016
Example Find the mode of the data set in given below.

Biostatistics -Notes WA , SPH AAU ,2016


Geometric mean (GM)
• If x, xi ... xn, x are n positive observed values, then

and

• The geometric mean is generally used with data


measured on a logarithmic scale.

Biostatistics -Notes WA , SPH AAU ,2016


Harmonic mean (HM)
• Just as the geometric mean is based on an arithmetic
mean of logarithms, so is the harmonic mean based on
arithmetic mean of the reciprocals.
• We define it as the reciprocal of the arithmetic mean of
the reciprocal of the given numbers.
• If the given numbers are X1 X2... xn, , then

Biostatistics -Notes WA , SPH AAU ,2016


Weighted mean (WM)
• If the given numbers are X1 X2... xk, and have
known weights w1 w2 ... wk,

Biostatistics -Notes WA , SPH AAU ,2016


Comparing the Mean, Median and Mode
• In general, for a data set 3 measures of central tendency:
the mean , the median and the mode are different. For
example, for the data set on Age of 189 subjects, mean
=55.032, median = 54 and mode = 53.
• If all observations in a data set are arranged symmetrically
about a observation then this observation is the mean, the
median and the mode.
• Which of these three measures of central tendency is
better? The best measure of central tendency for a data set
depends on the type of descriptive information you want.

Biostatistics -Notes WA , SPH AAU ,2016


Percentiles and Quartiles

• The quartiles are sets of values which divide the distribution


into four parts such that there are an equal number of
observations in each part.
– Q1 = [(n+1)/4]th
– Q2 = [2(n+1)/4]th
– Q3 = [3(n+1)/4]th

Biostatistics -Notes WA , SPH AAU ,2016


Percentiles and Quartiles
• Percentiles divide the data into 100 parts of observations in
each part.
• It follows that the 25th percentile is the first quartile, the 50th
percentile is the median and the 75th percentile is the third
quartile.
 Percentile = p(n+1), p=the required percentile

Biostatistics -Notes WA , SPH AAU ,2016


Percentile Cont....

The pth percentile is a value that is  p% of the


observations and  the remaining (1-p)%.
The pth percentile is:
– The observation corresponding to p(n+1)th if
p(n+1) is an integer
– The average of (k)th and (k+1)th observations if
p(n+1) is not an integer, where k is the largest
integer less than p(n+1).
• If p(n+1) = 3.6, the average of 3th and 4th observations.
• P50 =50 th percentile=Q2 , P25= 25 th percentile=Q1
Biostatistics -Notes WA , SPH AAU ,2016
Example
Given a sample of size n = 60, find the 10th
percentile of the data set.
p(n+1) = 0.10(60+1) = 6.1
= Average of 6th and 7th
10% of the observations are less than or equal to this
value and 90% of them are greater than or equal to
the value

Biostatistics -Notes WA , SPH AAU ,2016


Exercise; Birth weight (gm) data for 20 infants
2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101,
3200, 3245, 3248, 3260, 3265, 3314, 3323, 3484,
3541, 3609, 3649, 4146
Question
1. Compute the Q3, 10th and 90th percentiles
15.75 =15 th +0.75(16-15)
3323+ 0.75(3484-3323)=
2 +0.1(3 -2 )
2581+ 0.1(2759-2581)
Answer 117117
10th percentile = 0.1(20+1) = 2.1 = Average of 2nd and 3rd value =
(2581+2759)/2 = 2670gm
90th percentile = 0.9(20+1) = 18.9 = Average of 18th and 19th value
= (3609+3649)/2 = 3629gm
We estimate that 80% fall between 2670-3629gm

Biostatistics -Notes WA , SPH AAU ,2016


Percentiles
• Simply divide the data into 100 pieces.
• Percentiles are not dependent on the distribution of the
data.

Biostatistics -Notes WA , SPH AAU ,2016


Using measures of central tendency
• Given a set of observations, an investigator may naturally
ask which measure of central tendency is best to use with
the data.
• Two factors are important in making this decisions:
1. The scale of measurement
2. The shape of the distribution of observations

Biostatistics -Notes WA , SPH AAU ,2016


1. The arithmetic mean is used for interval and ratio data
and for symmetric distribution.
2. The median and quartiles are used for ordinal, interval
and ratio data whose distribution is skewed.
3. For nominal data mode is the appropriate MCT.
4. The geometric mean is used primarily for observations
measured on a logarithmic scale.
5. Harmonic mean is a suitable MCT when the data
pertains to rates and time.
6. Weighted mean is commonly used in the construction of
index number.

Biostatistics -Notes WA , SPH AAU ,2016


Measures of variability
• The measure of central tendency alone is not
enough to have a clear idea about the distribution
of the data.
• Moreover, two or more sets may have the same
mean and/or median but they may be quite
different.
• Thus to have a clear picture of data, one needs to
have a measure of dispersion or variability
(scatterdness) amongst observations in the set.

Biostatistics -Notes WA , SPH AAU ,2016


Measures of variability
• Reporting only an average without accompanying measure
of variability may misrepresent a set of data.
• – Two datasets can have the same average but very
different variability.

Biostatistics -Notes WA , SPH AAU ,2016


Variation is important: Non statistician drowning in a river of average depth 0.3 meter.
Biostatistics -Notes WA , SPH AAU ,2016
Objectives of Measuring Variation
1. To judge the reliability of a measure of central tendency
2. To compare two or more sets of data with regard to their
variability

3. To control variability itself like in quality control, body


temperature, etc

4. To make further statistical analysis or to facilitate the use of


other statistical measures.

>
Biostatistics -Notes WA , SPH AAU ,2016
Range (R)
• R = xmax – xmin, where
XL is the largest value and XS is the smallest value.
Example: for the given data set: 100, 95, 125, 45, 70, the range is calculated
as:
R= xmax – xmin
R= 125 – 45
Range = 80.
Properties of Range
• Range and relative range are easy to calculate and simple to understand.
• Both cannot be computed for grouped data with open ended classes.
• They do not tell us anything about the distribution of values in the
series.
Exercise1: Find the range for the monthly salary of ten workers in a certain
health center given below. 462, 480, 534, 624, 498, 552,606, 588, 516,
570.
Biostatistics -Notes WA , SPH AAU ,2016
Interquartile range (IQR)
• IQR = Q3 ‐ Q1, where
Q3 is the third quartile and Q1 is the first quartile.
Example: Suppose the first and third quartile for weights of
girls 12 months of age are 8.8 Kg and 10.2 Kg respectively.
The interruptible range is therefore,
IQR = 10.2 Kg – 8.8 Kg
i.e., 50% of infant girls at 12 months weigh between
8.8 and 10.2 Kg.

Biostatistics -Notes WA , SPH AAU ,2016


Properties
• It is a simple and versatile measure
• It encloses the central 50% of the observations
• It is not based on all observations but only on two specific
values
• It is important in selecting cut‐off points in the formulation
of clinical standards
• Since it excludes the lowest and highest 25% values, it is
not affected by extreme values
• It is not capable of further algebraic treatment

Biostatistics -Notes WA , SPH AAU ,2016


Quartile deviation (QD)
 QD = (Q3- Q1)/2

Coefficient of quartile deviation (CQD)


CQD=(Q3- Q1)/(Q3+Q1)
 CQD is an absolute quantity (unitless) and is useful to
compare the variability among the middle 50%
observations

Biostatistics -Notes WA , SPH AAU ,2016


Variance and standard deviation
• A measure of dispersion relative to the scatter of the values
about their mean.
The population variance of the population of the
observations x is defined by the formula
• The variance is the average of the squares of the deviations
taken from the mean.
• The sum of squared deviations divided by the number of
deviations from the mean gives us the average sums of
squared deviations known as the variance

Biostatistics -Notes WA , SPH AAU ,2016


Sample Variance
• The sum of squared deviations divided by the number of
deviations from the mean gives us the variance

Why divide by n‐1


• Samples give us estimates of population parameters
(population mean and variance)
• Dividing by n underestimates the population variance and
this is easily demonstrated

Biostatistics -Notes WA , SPH AAU ,2016


Another feature about n‐1
• In many statistical tests we sum variances from groups and
we lose a data point or what is sometimes referred to as
degrees of freedom.
• As noted already in order to make estimates from samples
to a population certain conditions have to be met.
• An additional one being that the sum of the deviation
scores around the mean must add up to zero.
• For each sample estimate we therefore lose a degree of
freedom – all numbers on which the estimate is based are
free to vary except one.

Biostatistics -Notes WA , SPH AAU ,2016


Variance and standard deviation
(A), Ungrouped data
• • Let X1, X2, ..., XN be the measurement on N
• population units, then σ2 =

• The sample variance of the set x1, x2, ..., xn of n


observations is

(B), grouped data

Biostatistics -Notes WA , SPH AAU ,2016


Group data

• Where
mi = the mid‐point of the ith class interval
fi = the frequency of the ith class interval
= the sample mean
k = the number of class intervals

Biostatistics -Notes WA , SPH AAU ,2016


Example: If the blood sugar level of small population is: 80, 70, 95, 100, 125.
Calculate the variance and standard deviation of the data.

Solution: As the data is collected from the population, the variance is calculated
using:
 X  
2

 2

i

N
 But first theN mean is calculated as:
 
 i 1
Xi

80  70  95  100  125
N 5
470
  94
5
 To calculate the variance:
N

X  
2

80  94    70  94    95  94   100  94   125  94 


i 2 2 2 2 2

 
2 i 1

N 5
1770
=  354
5
 The standard deviation will be:

S.D    var iance  354  18.8


Biostatistics -Notes WA , SPH AAU ,2016
For grouped data with frequency, the population variance is
calculated as:

 f X  
2

2 
i i

N
The standard deviation is the square root of the variance.
i.e. S.D    Variance

• Example: In the study, the weight of six new born babies was recorded
below. Find the variance and S.D

Weight (K.G) 1.5 2.5 3


Frequency 2 3 1

Biostatistics -Notes WA , SPH AAU ,2016


Solution: Before calculating the variance the mean weight of the
babies will be:
• N

 Xi fi
1.5  2  2.5  3  3  1
 i 1
 2.25
N
2  3 1
• f
i 1
i

 fi  X i   
2
2(1.5  2.25)2  3(2.5  2.25)2  1(3  2.25)2
 
2

N 6
1.5  0.75  0.75
=  0.5
6

• Hence, the weight variability of the new born babies is 0.5


• And the standard deviation will be:
S.D  var iance  0.5  0.707

Biostatistics -Notes WA , SPH AAU ,2016


Example: If the blood sugar level of small population is: 80, 70, 95, 100, 125.
Calculate the variance and standard deviation of the data.

Solution: As the data is collected from the population, the variance is calculated
using:
 X  
2

 2

i

N
 But first theN mean is calculated as:
 
 i 1
Xi

80  70  95  100  125
N 5
470
  94
5
 To calculate the variance:
N

X  
2

80  94    70  94    95  94   100  94   125  94 


i 2 2 2 2 2

 
2 i 1

N 5
1770
=  354
5
 The standard deviation will be:

S.D    var iance  354  18.8


Biostatistics -Notes WA , SPH AAU ,2016
For grouped data with frequency, the population variance is
calculated as:
 i i
  
2
f X 
2
 
N

The standard deviation is the square root of the variance.


i.e. S.D    Variance

• Example: In the study, the weight of six new born babies was recorded
below. Find the variance and S.D

Weight (K.G) 1.5 2.5 3


Frequency 2 3 1

Biostatistics -Notes WA , SPH AAU ,2016


Solution: Before calculating the variance the mean weight of the
babies will be:
• N

 Xi fi
1.5  2  2.5  3  3  1
 i 1
 2.25
N
2  3 1
• f
i 1
i

 fi  X i   
2
2(1.5  2.25)2  3(2.5  2.25)2  1(3  2.25)2
 
2

N 6
1.5  0.75  0.75
=  0.5
6

• Hence, the weight variability of the new born babies is 0.5


• And the standard deviation will be:
S.D  var iance  0.5  0.707

Biostatistics -Notes WA , SPH AAU ,2016


Sample Variance ( S2)
For ungrouped data , sample variance is calculated using:
n

 (X i  X) 2
S2  i 1

n 1 Where X is the sample mean and n is the total


number of observations in the sample.

• Note: - for the sample data we divide by (n-1) instead of n as in the case of
population variance, as it gives better and unbiased estimator of the
population variance.

• Sample Standard Deviation ( S)


S.D  var iance
For grouped data the sample variance
n
is calculated as:
f i (X i  X) 2
S2  i 1
n

f
i 1
i -1
Biostatistics -Notes WA , SPH AAU ,2016
Example: If samples of 6 children were taken from the population with age
of: 17, 18, 19, 20, 22, 24. Calculate;
A) the variance B) the standard deviation
 First the sample mean is calculated as:
n

X i
17  18  19  20  22  24 120
X 11
   20
n 6 6

As the sample is considered, the variance can be formulated


as: 
n
( X i  X )2
(17  20)2  (18  20) 2  (19  20) 2  (20  20) 2  (22  20) 2  (24  20) 2
2
S 
i 1

n 1 6 1
9  4  1  0  4  16 34
=   6.8
5 5

The S.D can be calculated as


S.D  var iance  6.8  2.61

Biostatistics -Notes WA , SPH AAU ,2016


Exercise: calculate the variance and standard deviation for the following data.

1) 19, 20, 24, 12, 17, 22, 18, 20, 23, 17.
Age Frequency
2) 22 3
23 2
24 4
26 1

Q1 Q2
Mean 19.2 23.4

SD 3.489667 1.264911

Variance 12.17778 1.6

Biostatistics -Notes WA , SPH AAU ,2016


Properties
• The main demerit of variance is, that its unit is the
square of the unite of measurement of variate values
• The variance gives more weightage to the extreme
values as compared to those which are near to mean
value, because the difference is squared in variance.
• The drawbacks of variance are overcome by the standard
deviation.

Biostatistics -Notes WA , SPH AAU ,2016


Standard deviation (σ, S)
• It is the positive square root of the variance.

Biostatistics -Notes WA , SPH AAU ,2016


Properties
• Standard deviation is considered to be the best measure
of dispersion and is used widely because of the
properties of the theoretical normal curve.

• There is however one difficulty with it. If the units of


measurements of variables of two series is not the
same, then there variability can not be compared by
comparing the values of standard deviation.

Biostatistics -Notes WA , SPH AAU ,2016


Coefficient of variation (CV)
• In situations where either two series have different units of
measurements, or their means differ sufficiently in size, the
coefficient of variation should be used as a measure of
dispersion.
• It is the best measure to compare the variability of two
series of sets of observations.
• A series with less coefficient of variation is considered
more consistent.
• Coefficient of variation of a series of variate values is the
ratio of the standard deviation to the mean multiplied by
100.

Biostatistics -Notes WA , SPH AAU ,2016


• Example 3.6 Suppose that each day laboratory technician
A completes 40 analyses with a standard deviation of 5.
Technician B completes 160 analyses per day with a
standard deviation of 15. Which employee shows less
variability?
• At first glance, it appears that technician B has three times
more variation in the output rate than technician A. But B
completes analyses at a rate 4 times faster than A. Taking all
this information into account, we compute the coefficient of
variation for both technicians:

Biostatistics -Notes WA , SPH AAU ,2016


Example: In count of red blood cell (RBC) per ml of plasma concentration,
Abebe and Alemu get the following result. Which of the two lab technician
perform a reliable (consistent) measurement?

Laboratory technician Abebe Alemu


Mean count 79 64
Standard deviation 23 11

Solution: Alemu Abebe


S
S
CV  100 CV   100
x x
23
11   100  29.11%
 100  17.19% 79
64

• Interpretation: the measurement of Abebe has more variability (less


consistency) than Alemu’s measurment.

Biostatistics -Notes WA , SPH AAU ,2016


Characteristics of a distribution
• The measure of central tendency and variation discussed before do not
reveal the entire story about frequency distributions.

• Two distributions may have the same mean and standard deviation but they
may differ in their shape of the distribution.
• Further description of their characteristics is necessary that is provided by
Skewness.

• In a symmetrical distribution the values of mean, median and mode are


alike. The term ‘Skewness’ refers to lack of symmetry or departure from
the symmetry.

• If extremely low or extremely high observations are present in a


distribution, then the mean tends to shift towards those scores.

Biostatistics -Notes WA , SPH AAU ,2016


Skewness
• The skewness of a distribution is measured by comparing the
relative positions of the mean, median and mode.
 Distribution is symmetrical
Mean = Median = Mode
 Distribution skewed right
Median lies between mode and mean, and mode is less than mean
 Distribution skewed left
Median lies between mode and mean, and mode is greater than
mean

Biostatistics -Notes WA , SPH AAU ,2016


• Based on the type of skewness, distributions can be:
• a) Negatively skewed distribution: occurs when majority of scores are at
the right end of the curve and a few small scores are scattered at the left
end.

• b) Positively skewed distribution: Occurs when the majority of scores are


at the left end of the curve and a few extreme large scores are scattered at
the right end.

• c) Symmetrical distribution: It is neither positively nor negatively


skewed. A curve is symmetrical if one half of the curve is the mirror image
of the other half.

Biostatistics -Notes WA , SPH AAU ,2016


Introduction to Probability

Biostatistics -Notes WA , SPH AAU ,2016


Objective

• To provide understanding of probability and


their applications

• Calculation of probabilities using frequency


distribution

• Explain probability distribution and set the


ground for development of statistical inference

Biostatistics -Notes WA , SPH AAU ,2016


Introduction to sets
• A set is a collection of objects, sets are usually designated
by capital letters A, B,. . . etc

Example A= {a, b, c d} in the set “a” is a member of set


“A” and is denoted as a  A.

• Universal set (U); is a set of all objects under consideration (U),

• Empty/null set (); is a set that contains no members.

• Given two sets A and B; If being a member of A implies being a


member B, then A is a subset of B, denoted as A  B.

Biostatistics -Notes WA , SPH AAU ,2016


Introduction to sets
• Two sets A and B are equal: if A & B have the same members.

• If A  B= C  set C is A union B and contains elements


which are in A or in B or in both.

• If D = A  B  set D is A intersection B and consists of


elements which are in A and in B.

• Example A = {1, 2, 3, 4, 5} B= {a, b, 1, 2, 5, c, 6}

• A  B = {1, 2, 3, 4, 5, 6, a, b, c}

• A  B= {1, 2, 5}
Biostatistics -Notes WA , SPH AAU ,2016
Basic characteristics of Set
1. A = A, A = A, AU = U, AU= A

2. AA = A , A A = A;

3. AB = BA; A B=BA

4. (AB)C = A(BC); (AB) C=A(BC),

5. A(BC)=(AB)(AC); A
(BC)=(AB)U(AC)

6. (Ac)c = A

7. (AB) c = AcBc; (AB) c = AcBc


Biostatistics -Notes WA , SPH AAU ,2016
Probability
• Probability is the language of chance. The deliberate use of
chance is the central idea of statistical designs for producing data.

• Probability provide necessary tools to capture the


uncertain state of our knowledge.

• Probabilistic experiment to be any process that produces


outcomes which are not predictable in advance.

Biostatistics -Notes WA , SPH AAU ,2016


Probability
• Probabilities are used in everyday communication.
– A patient has a 50 – 50 chance of surviving a certain
operation
– The chance of a 30 year old woman to celebrate her 70th
birthday is 30%
• Because medicine is an inexact science, physicians seldom can
predict an outcome with absolute certainty.
• Example1
• To formulate a diagnosis, a physician must rely on available
diagnostic information about a patient;
– History and physical examination
– Laboratory studies,Biostatistics
X‐ray-Notes findings, ECG, etc
WA , SPH AAU ,2016
Probability
• Because no test result is absolutely accurate, it does affect
the probability of the presence (or absence) of a disease.

Example2
– We may hear a physician say that a patient has a 50—50 chance
of surviving a certain operation .

– Another physician may say that she is 95 percent certain that a


patient has a particular disease.

Biostatistics -Notes WA , SPH AAU ,2016


Probability
• understanding of probability is fundamental for
quantifying the uncertainty that is inherent in the decision
making process.

• Probability theory also allows us to draw conclusions


about a population of patients based on known information
about a sample of patients drawn from that population.

Biostatistics -Notes WA , SPH AAU ,2016


Basic terms
• A random experiment is an experiment for which the
outcome cannot be predicted with certainty, but all
possible outcomes can be identified prior to its
performance, and it may be repeated under the same
conditions.

• We call a phenomenon random if:-


– The exact outcome is not predictable in advance.

– however, there is a predictable long term pattern that can be


described by the distribution of outcomes of very many trials
Biostatistics -Notes WA , SPH AAU ,2016
Basic terms
• Sample space is the set of all possible outcomes of a
random experiment. It is denoted by S P(S) = 1
• In tossing a single six-sided die once the sample space is
S = {1, 2, 3, 4, 5, 6} .
• Equally likely: A set of events is equally likely if one of
them cannot be expected to happen in preference to
another.
– E.g. If A coin toss the outcome will be either heads
or tails.

Biostatistics -Notes WA , SPH AAU ,2016


Basic terms
• Mutually exclusive events: if the occurrence of one of
them preclude the occurrence of all others.
Two events A and B are mutually exclusive if they cannot
occur at the same time
P (A ∩ B) = 0
Example:
o A coin toss cannot produce heads and tails
simultaneously.
o Weight of an individual can’t be classified
simultaneously as “underweight”, “normal”,
“overweight”
Biostatistics -Notes WA , SPH AAU ,2016
Basic terms
• Independent Events: Two events A and B are
independent
 if the probability of the first one happening is the same no
matter how the second one turns out.
 The outcome of one event has no effect on the occurrence
or non-occurrence of the other.
Example:
 The outcomes on the first and second coin tosses are
independent

Biostatistics -Notes WA , SPH AAU ,2016


Basic terms
• Experiment = any process with an uncertain outcome

– When an experiment is performed, one and only one


outcome is obtained.

• Event = something that may happen or not when the


experiment is performed

– An event either occurs or it does not occur.


– Events are represented by uppercase letters such as A, B, & C

Biostatistics -Notes WA , SPH AAU ,2016


Examples
1. Experiment is blood test to determine HIV status. Possible
outcomes are {HIV +} and {HIV -}.
– A1 could be the event that a test comes out positive.

– A2 could be the event that a test comes out negative.

2. Experiment is blood test and further screening to determine


HIV status (HIV+ or HIV-) and AIDS status (D+ or D-).
Events are:
– {(HIV +;D+)}; {(HIV +;D-)}; {(HIV -;D+)}; {(HIV -;D-)}

Biostatistics -Notes WA , SPH AAU ,2016


3. Experiment is to record the number of people that get tested for
HIV in one week at a given clinic. Suppose 500 is the maximum
possible number of tests given in a week. Then any non-negative
integer less than or equal to 500 is a conceivable outcome.
Events are:{0}; {1}; {2}; … ; {500}
• Note that unions and intersections of events are events.
A1 is the event that greater than 100 people get tested.
A2 is the event that fewer than 220 people get tested.
A3 is the event that greater than 100 people but fewer than 220
get tested.

• The probability of an event A, denoted by P(A), in general, is


the chance A will happen. But how to measure the chance of
occurrence , i.e., how determine the probability an event?
Biostatistics -Notes WA , SPH AAU ,2016
4. Let a box containing 100 marbles, 90 of them red and
the other 10 blue.
 If the question is: ‘‘Are there red marbles in the box?’’,
someone who saw the box’s contents would answer
‘‘90%.’’
 But if the question is: ‘‘If I take one marble at random,
do you think I would have a red one?’’, the answer
would be ‘‘90% chance.’’
 The first 90% represents a proportion; the second 90%
indicates the probability.

Biostatistics -Notes WA , SPH AAU ,2016


Approaches to probability
1. Subjective Probability: Definitions of probability as a
quantitative measure of the “degree of certainty” of the
observer of experiment.

2. Classical definition: Definitions that reduce the concept


of probability to the more primitive notion of “equal
likelihood”

3. Statistical definition: Definitions that take as their point


of departure the “relative frequency” of occurrence of the
event in a large number of trials.
Biostatistics -Notes WA , SPH AAU ,2016
Approaches to probability
1. Subjective probability: measures the confidence or a wish
that a particular individual has in the truth of a particular
proposition.
– E.g. If some one says that he is 95% certain that a cure for
AIDS will be discovered within 5 years, then he means that
Pr(discovery of cure of AIDS within 5 years) = 95%.
• Although the subjective view of probability has enjoyed
increased attention over the years, it has not been fully
accepted by scientists.

Biostatistics -Notes WA , SPH AAU ,2016


Approaches to probability
2. The classical definition of probability:
 The probability P(A) of an event A is equal to the number
of possible simple events (outcomes) favorable to A
divided by the total number of possible simple events of
the experiment, i.e., where m= number of the simple
events into which the event A can be decomposed.
The probability of an event A can be: P(A)  m
N

Example 1. Consider the experiment of tossing a


balanced coin. P(H)=P(T)=1/2.

Biostatistics -Notes WA , SPH AAU ,2016


Example 2. Consider the experiment of tossing a
balanced . k=1, 2, 3, 4, 5, 6) are observed on the upper
face of the die. Therefore, P(Dk) =1/6 (k=1, 2, 3, 4, 5, 6).

 Let Dodd is the event that an odd number of dots are


observed,

 Deven an even number of dots are observed,

– we have P(Dodd)=3/6=1/2, P(Deven) = 3/6 = ½.

– Let A the event that a number less than 6 of dots is


observed then P(A) = 5/6
Biostatistics -Notes WA , SPH AAU ,2016
Approaches to probability
3. The statistical/Relative frequency probability:
The absolute frequency (A) of an event A in n trails is the
number of times A occurs, and the relative frequency of A in
these trials is: f ( A)
P(A) 
n

Example 1. Suppose that of 158 people who attended a


dinner party, 99 were ill due to food poisoning. The
probability of illness for a person selected at random is
Pr (illness) = 99/158 = 0.63 or 63%

Biostatistics -Notes WA , SPH AAU ,2016


Example 2. The record of a certain health center showed
that out of 10000 smokers, 2940 developed lung cancer.
If one smoker is randomly selected from these group,
what is the probability that he will develop lung cancer.
Let L:=the smoker develops lung cancer
P(L)=2940/10000=0.294

 Note : We will adopt the relative frequency interpretation


of probability, which says that the probability that an
event A occurs is equal to the proportion of the time that
A occurs if we repeat the random experiment again and
again to infinity:

Biostatistics -Notes WA , SPH AAU ,2016


Properties of probability
• The mathematical development of probability starts with
three basic rules or axioms:
1. The numerical value of a probability always lies between
0 and 1, inclusive. 0  P(E)  1
– A value 0 means the event can not occur
– A value 1 means the event definitely will occur
– A value of 0.5 means that the probability that the event
will occur is the same as the probability that it will not
occur.

Biostatistics -Notes WA , SPH AAU ,2016


Properties of probability
2. The sum of the probabilities of all mutually
exclusive outcomes is equal to 1.
– P(E1) + P(E2 ) + .... + P(En ) = 1.

3. For any two events A and B P(A or B) is:


– P(A or B) = P(A) + P(B) - P(A and B) (Addition rule)

– For two mutually exclusive events A and B,

P(A or B ) = P(A) + P(B).

Biostatistics -Notes WA , SPH AAU ,2016


Properties of probability

4. For any two independent events A and B:

P(A and B) =P(A) P(B) (Multiplication rule)

5. The complement of an event A, denoted by Ā or Ac, is


the event that A does not occur then P(Ac) = 1 ‐P(A)
(complementary events)

Biostatistics -Notes WA , SPH AAU ,2016


Basic Probability Rules
1. Addition rule
A. If events A and B are mutually exclusive:

 P(A or B) = P(A) + P(B)

 P(A and B) = 0

 If not mutually exclusive:

 P(A or B) = P(A) + P(B) - P(A and B)

 P(event A or event B occurs or they both occur)

Biostatistics -Notes WA , SPH AAU ,2016


Example: The probabilities below represent years of
schooling completed by mothers of newborn infants

1. What is the probability that a


mother has completed < 12
years of schooling?
2. What is the probability that a
mother has completed 12 or
more years of schooling?

Biostatistics -Notes WA , SPH AAU ,2016


Class work
The probability that at least three individuals
among the five develop hepatitis B is

Biostatistics -Notes WA , SPH AAU ,2016


Basic Probability Rules
 What is the probability that a mother has
completed < 12 years of schooling?
P( 8 years) = 0.056 and
P(9-11 years) = 0.159
 Since these two events are mutually exclusive,
P( 8 or 9-11) = P( 8 U 9-11)
= P( 8) + P(9-11) = 0.056+0.159
= 0.215
 What is the probability that a mother has completed 12 or
more years of schooling?
P(12) = P(12 or 13-15 or 16) = P(12 U 13-15 U 16)
= P(12)+P(13-15)+P(16)
= 0.321+0.218+0.230
= 0.769 Biostatistics -Notes WA , SPH AAU ,2016
Basic Probability Rules
B. If A and B are not mutually exclusive events,
then subtract the overlapping:
P(AU B) = P(A)+P(B) − P(A ∩ B)

Biostatistics -Notes WA , SPH AAU ,2016


Basic Probability Rules
2. Multiplication rule
 If A and B are independent events, then
P(A ∩ B) = P(A) × P(B)

More generally, if dependent


P(A ∩ B) = P(A) P(B|A) = P(B) P(A|B)
P(A and B) denotes the probability that A and B both
occur at the same time.

Biostatistics -Notes WA , SPH AAU ,2016


Conditional Probability
 Refers to the probability of an event, given that another
event is known to have occurred.
 “What happened first is assumed”

 Hint - When thinking about conditional probabilities,


think in stages. Think of the two events A and B occurring
chronologically, one after the other, either in time or
space.
• Conditional probabilities, probabilities based on the
knowledge that some event has occurred.

Biostatistics -Notes WA , SPH AAU ,2016


Conditional Probability
• Conditional probabilities are denoted by P(B/A) or
P(Event/conditioning event).
• The formula for calculating a sample conditional
probability is :

Biostatistics -Notes WA , SPH AAU ,2016


Conditional Probability
The conditional probability that event B has occurred
given that event A has already occurred is denoted
P(B|A) and is defined

Provided that P(A) ≠ 0.

Biostatistics -Notes WA , SPH AAU ,2016


Conditional Probability
Example1.
Table1. A study investigating the effect of prolonged exposure to
bright light on retina damage in premature infants.

Retinopathy Retinopathy TOTAL


YES NO

Bright light 18 3 21
Reduced light 21 18 39

TOTAL 39 21 60

Biostatistics -Notes WA , SPH AAU ,2016


• Pr(D+/reduced light)= Pr(D+&Reduced
light)/Pr(reduced light)
=21/60/39/60=21/39=54%
• Pr(D+/bright light)=Pr(D+& bright light)
/Pr(bright light) =18/60/21/60=18/21=86%

Biostatistics -Notes WA , SPH AAU ,2016


Conditional Probability
• We want to know whether the probability of retinopathy
for the bright‐light infants differs form the probability of
retinopathy for the reduced‐light infants.
These probabilities are

• We want to compare the probability of retinopathy, given


that the infant was exposed to bright light, with that the
infant was exposed to reduced light.

• Exposure to bright light and exposure to reduced light are


conditioning events, events we want to take into account
when calculating conditional probabilities.
Biostatistics -Notes WA , SPH AAU ,2016
Conditional Probability
• For the retinopathy data, the conditional probability of
retinopathy, given exposure to bright light, is
• P(Retinopathy/exposure to bright light) is
= No. of infants with retinopathy exposed to bright light
No. of infants exposed to bright light
= 18/21 = 0.86
• P(Retinopathy/exposure to reduced light)
= No. of infants with retinopathy exposed to reduced light
No. of infants exposed to reduced light
= 21/39 = 0.54
• The conditional probabilities suggest that premature infants
exposed to bright light have a higher risk of retinopathy than
premature infants exposed to reduced light.
Biostatistics -Notes WA , SPH AAU ,2016
Class work
Table 2, shows the frequency of cocaine use by gender
among adult cocaine users.
_________________________________________________________________________________________________________________

Life time frequency Male Female Total


of cocaine use
__________________________________________________________________________________________________________________

1-19 times 32 7 39
20-99 times 18 20 38
more than 100 times 25 9 34
----------------------------------------------------------------------------------------------------
Total 75 36 111
----------------------------------------------------------------------------------------------------------------------
1. What is the probability of a person randomly picked is a male?
2. What is the probability of a person randomly picked uses cocaine more than 100
times?
3. Given that the selected person is male, what is the probability of a person
randomly picked uses cocaine more than 100 times?
4. Given that the person has used cocaine less than 100 times, what is the
probability of being female?
5. What is the probability of a person randomly picked is a male and uses cocaine
more than 100 times?
Biostatistics -Notes WA , SPH AAU ,2016
Conditional Probability
1. For independent events A and B,
P(A/B) = P(A).
2. For non independent events A and B
P(A and B) = P(A/B) P(B), (General Multiplication Rule)

3. Bays theorem:
P(A/B) = P(B/A) P(A)
P(B)

Biostatistics -Notes WA , SPH AAU ,2016


Conditional Probability
Home work
From a city population, the probability of selecting a male or a smoker
is 7/10, a male smoker is 2/5, and a male, if a smoker is already
selected is 2/3 . Find the probability of selecting (a) a non-smoker, (b)
a male, and (c) a smoker, if a male is first selected.
Let A: a male is selected
B: a smoker is selected. We are given
P(AB) =7/10 , P(AB) =2/5 , P(A|B) = 2/3
The probability of selecting a non-smoker is
P(Bc) = 1–P(B) = 1 - P(AB)/ P(A|B)
[P(A/B) = 1- P(AB)/ P(B) =
1 –(2/5)/(2/3)  P(B’) = 1 -3/5=2/5
The probability of selecting a male (by addition theorem) is:
P(A) = P(AB) + P(AB) – P(B)
= (7/10 )+(2/5)-(3/5)=1/2
Class work
Find the probability of selecting a smoker if a male is first selected is
Biostatistics -Notes WA , SPH AAU ,2016
P(B|A) ????
Home work
1. Consider the experiment of tossing a fair die and
define the following events:
A = {Observe an even number of dots}
B = { Observe a number of dots less or equal to 4}.
Are events A and B independent?
2. Suppose that three programmers are designing computer code for a
project: Mr. A has designed 60% of the code, Mr. B 30% and Mr. C
10%. Suppose further that Mr. A has a bug in 3% of her work, Mr. B
in 7% of her work, and Mr. C in 5% of his.
A. What percentage of the code written has a bug?
B. Given that you find a bug in a line of code, who is most likely to
have written it? Who is least likely?
C. How does the ordering compare to the unconditional probabilities
and why does this relationship make
Biostatistics -Notes WA , SPHsense?
AAU ,2016
Baye’s Theorem
• In the health sciences field a widely used application of probability
laws and concepts is found in the evaluation of screening tests and
diagnostic criteria.
• Of interest to clinicians is an enhanced ability to correctly predict
the presence or absence of a particular disease from a knowledge of
test results (positive or negative) and/ or the status of presenting
symptoms (present or absent).

Biostatistics -Notes WA , SPH AAU ,2016


Baye’s Theorem
• Also of interest is information regarding the likelihood of
positive and negative test results and the likelihood of the
presence or absence of a particular symptom in patients
with and without a particular disease.

• In our consideration of screening tests, we must be aware


of the fact that they are not always perfect. That is, a
testing procedure may yield a false positive or a false
negative.

Biostatistics -Notes WA , SPH AAU ,2016


Bayes Theorem
Total probability
If the event B may occur together with one and only one
of n mutually exclusive events A1, A2, ..., An then
P(B)= P(A1)P(B|A1)+P(A2)P(B|A2)+ ...+P(An)P(B|An).

Bayes’s Formula
If the event B may occur together with one and only one
of n mutually exclusive events A1, A2, ..., An then

P(Ak )P(B|Ak ) P(Ak )P(B|Ak )


P(Ak|B)   n
 P(A j )P(B|A j )
P(B)
j 1

Biostatistics -Notes WA , SPH AAU ,2016


Sensitivity and Specificity
• Data for assessing the sensitivity and specificity of a test are usually
of the form
Disease Category
Test result Diseased(+) Nondiseased (- total
)
+ A B A+B
- C D C+D
total A+C B+D 1.00
Sensitivity: is the proportion of diseased people who would
be correctly classified
estimated by Sens = A/(A + C).
Specificity: is the proportion of non diseased people who
would be correctly classified
estimated by Spec = D/(B
Biostatistics + D).
-Notes WA , SPH AAU ,2016
Sensitivity and Specificity
• The prevalence of a disease is the percent of the population
with the disease estimated by R = (A + C)/(A + B + C + D).
Note that a random sample is required to estimate prevalence.
• Positive Predictive Value: is the proportion of people who
tested positive that truly are positive.
estimated by PPV =A/(A + B).
• Negative Predictive Value: is the proportion of people who
tested negative that truly are negative.
estimated by NPV =D/(C + D).
• False Negative: The probability of a false negative is the
probability of testing negative given a truly positive condition.
• False Positive: The probability of a false positive is the
probability of testing positive given a truly negative condition.
Biostatistics -Notes WA , SPH AAU ,2016
Example1
Data for assessing the sensitivity and specificity of a test are usually of
the form
Disease Category
Test result Diseased(+) Nondiseased (-) total

+ 10000 5000 15000


- 1000 84000 85000
total 11000 89000 100000
 The estimated Sensitivity is Sens = A/(A + C)=90.9%
 The estimated Specificity is Spec = D/(B + D)=94.4%
 The estimated prevalence is R = (A + C)/(A + B + C + D)=11.00%.
 The estimated PPV is PPV =A/(A + B)=66.7%
 The estimated NPV is NPV =D/(C + D)=98.8%
Biostatistics -Notes WA , SPH AAU ,2016
PROBABILITY DISTRIBUTION

Biostatistics -Notes WA , SPH AAU ,2016


Probability distribution
• Every random variable has a corresponding probability
distribution.

• A probability distribution applies the theory of probability


to describe the behavior of the random variable.

• The term Probability distribution or just distribution refers


to the way data are distributed, in order to draw
conclusions about a set of data.

Biostatistics -Notes WA , SPH AAU ,2016


Probability distribution
• Probability distribution is listing of all the possible values
that a random variable can take along with their
probabilities.

• A probability distribution of a random variable can be


displayed by a table or a graph or a mathematical formula.

• Random Variable is any quantity or characteristic that is


able to assume a number of different values such that any
particular outcome is determined by chance

• Random variables can be either discrete or continuous


Biostatistics -Notes WA , SPH AAU ,2016
• HHH HHT HTH THH
• TTT TTH THT HTT

• 0 1/8
• 1 3/8
• 2 3/8
• 3 1/8

Biostatistics -Notes WA , SPH AAU ,2016


Probability distribution
• The random variable domain is the sample space and its
range is the set of real numbers.
Example1 Number of HIV+ patients up on taking a single
blood test to determine the status.

Example2 Observe 100 babies to be born in a clinic. The


number of boys, which have been born, is a random
variable. It may take values from 0 to 100.

Example3 Select one student from an university and


measure his/her height and record this height by x. Then x
is a random variable, assuming values from, say from 100
cm to 250 cm independence upon each specific student.
Biostatistics -Notes WA , SPH AAU ,2016
Basic definition
 A discrete random variable is able to assume only a finite or
countable number of outcomes
 A continuous random variable can take on any value in a specified
interval.
Example 1 Experiment is surgery on two people. Outcomes are {ss,sf,fs,ff}.
Example2 Experiment is to observe the number of people that get tested for
HIV in one week at a given clinic. Suppose 500 is the maximum
possible number of tests given in a week. Then any non-negative
integer less than or equal to 500 is a conceivable outcome.
X = number of tests in a given week.
Example3 Experiment is to record the number of places that a person has
lived in his or her lifetime. Possible outcomes are {1; 2; 3; …,}
X = number of places a person has lived.
Example4 . Experiment is to record the sex of a person. Outcomes {m, f}

Biostatistics -Notes WA , SPH AAU ,2016


Discrete Probability distributions
• For a discrete random variable X, a probability
distribution is a function that assigns to any possible value
x of X the probability P(X = x).
Two Requirements for a Probability Distribution:
1. The sum of the probabilities of all the events in the
sample space must equal 1; that is
ΣP(X)=1.
2. The probabilities of each event in the sample space must
be between or equal to 0 and 1. That is, 0≤P(X)≤1.

Biostatistics -Notes WA , SPH AAU ,2016


Example1:
• Consider again the experiment of taking a single blood
test to determine HIV status. Let the random variable X
denote the number of positive tests.
• Then X(HIV+)=1, X(HIV-)=0
If we knew that the prevalence of HIV was 0.11, then
P(X = 1) = 0.11 and P(X = 0) = 0.89
• These two equations completely describe the probability
distribution of the discrete (dichotomous) random
variable X.

Biostatistics -Notes WA , SPH AAU ,2016


Example 2 Consider the value on the face showing
up from tossing a die.
• The probability distribution of this variable is
Value on Face 1 2 3 4 5 6
Probability 1/6 1/6 1/6 1/6 1/6 1/6
• Notice that the total probability is 1.

Biostatistics -Notes WA , SPH AAU ,2016


• Example -3
The data shows the number of diagnostic services
a patient receives

Biostatistics -Notes WA , SPH AAU ,2016


• What is the probability that a patient receives exactly 3
diagnostic services?
P(X=3) = 0.031
• What is the probability that a patient receives at most one
diagnostic service?
P (X≤1) = P(X = 0) + P(X = 1)
= 0.671 + 0.229
= 0.900
• What is the probability that a patient receives at least four
diagnostic services?
P (X≥4) = P(X = 4) + P(X = 5)
= 0.010 + 0.006
= 0.016
Biostatistics -Notes WA , SPH AAU ,2016
Expected Value of a Discrete Random variable
• The average value assumed by a random variable is called
its expected value, or the population mean
• It is represented by E(X) or µ=ΣX.P(X) the symbol E(X) is
used for the expected value.
Example expected value For the diagnostic service data:
Mean (X) = 0(0.671) +1(0.229) +2(0.053) +3(0.031) +4(0.010)
+5(0.006)
= 0.498 ≈ 0.5
• We would expect an average of 0.5 services for each visit

Biostatistics -Notes WA , SPH AAU ,2016


Variance of a Discrete Random Variable
• The variance of a random variable X is called the
population variance and is represented by Var (X) or σ2
σ2 = ∑(xi-µ)2P(X=xi)
Variance for above diagnostic service is
σ2 = ∑(xi-µ)2P(X=xi) = (0 − 0.5)2(0.671) +(1 − 0.5)2(0.229)
+(2 − 0.5)2(0.053) +(3 − 0.5)2(0.031)+(4 − 0.5)2(0.010)
+(5 − 0.5)2(0.006) = 0.782
Standard deviation = σ = √0.782 = 0.884

Biostatistics -Notes WA , SPH AAU ,2016


Factorials

• Given the positive integer n, the product of all the whole


numbers from n down through 1 is called n factorial and is
written n!.

• n! = nx(n‐1)x(n‐2)x…x2x1 = nx(n‐1)!

• By definition; 0!=1.

Biostatistics -Notes WA , SPH AAU ,2016


Factorials
• Permutation: An ordered arrangement of objects.

• Combinations: An arrangement of objects without


regard to order.

Biostatistics -Notes WA , SPH AAU ,2016


Binomial distribution
• It is one of the most widely encountered discrete
distributions.
• The origin of binomial distribution lies in Bernoulli’s trials.
• When a single trial of some experiment can result in only
one of two mutually exclusive outcomes (success or
failure; dead or alive; sick or well, male or female) the trail
is called Bernoulli trial.
Example1.
– Let X represents smoking status; X=1 smoker and X=0
non-smoker. The two outcomes are mutually exclusive.
– Take the case of USA; in 1987, 29% of the adults in USA
were smokers, therefore Pr (X=1) = 0.29 and Pr (X=0) =
1-0.29 = 0.71.

Biostatistics -Notes WA , SPH AAU ,2016


Binomial distribution
• Suppose an event can have only binary outcomes A and B.
Pr (X=success) = Pr (X=1) = p
• Pr (X=failure) = Pr (X=0) = 1-p

• If an experiment is repeated n times and the outcome is


independent from one trial to another, the probability
P(X=x) that outcome X occurs exactly x times is
Pr (X= x) = n! p x (1- p) n- x
x ! (n- x )!
where , n (trials) & p (each probability outcome of event X)
are parameters of the binomial distribution , x is number of
successes. and n! read as ”n factorial” or factorial n” is the
product of all integers 1 to n inclusive. By definition
1!=0!=1.

Biostatistics -Notes WA , SPH AAU ,2016


Binomial distribution
 Example 2
 Suppose now we randomly select two individuals in USA, see the
smoking status of the two persons,
 What is the probability
– That both are non smokers?
– one is a smoker?
– both are smokers?
 If Pr (X=1) = p and pr (X=0) = 1- p, then the above can be calculated
using the multiplicative rule.
_________________________________________________________________________________________________________________

Outcome of X
Person1 Person2 Prob No of smokers
_____________________________________________________________________________________________________________________

0 0 (1- p)(1- p)=0.71×0.71=0.50 0


0 1 (1- p) p=0.71×0.29=0.21 1
1 0 p (1- p)=0.29×0.71=0.21 1
1 1 p p=0.29 ×0.29=0.08 2
_______________________________________________________________
Biostatistics -Notes WA , SPH AAU ,2016
Characteristics of a Binomial Distribution
1. The experiment consist of n identical trials. There are
only two possible mutually exclusive outcomes, on each
trial.
2. The probability of A remains the same from trial to trial.
This probability is denoted by p, and the probability of B
is denoted by q. Note that q=1‐ p.
3. The trials are independent.
4. The binomial random variable X is the number of A’s in n
trials. n and p are the parameters of the binomial
distribution.
5. The mean is np and the variance is np(1‐ p)

Biostatistics -Notes WA , SPH AAU ,2016


 The general form of the Binomial pmf is given by:
• b(x; n, p) = nCx px qnx , (where q = 1  p), and its
cumulative density function
( cdf )is given by:
x x

F(x) = B(x; n, p) =  b(i; n, p) = 


i 0
n Ci  p i  q ni
i 0

It is paramount to observe that the binomial random variable ,


X, is the sum of n independent Bernoulli random variable, Xi,
i.e., X = X1 + X2 + ... + Xn
Where Xi represents the Bernoulli rv at the ith trial whose value is
equal to 0 or 1 (0 for failure and 1 for success) so that the Rx =
0, 1, 2, ..., n.

Biostatistics -Notes WA , SPH AAU ,2016


 Class work 1
1. Each child born to a particular set of parents has a probability
of 0.25 of having blood type O. If these parents have 5
children. What is the probability that
a. Exactly two of them have blood type O
b. At most 2 have blood type O
c. At least 4 have blood type O
d. 2 do not have blood type O.

Biostatistics -Notes WA , SPH AAU ,2016


Class work 2
2. Suppose you take a sample of N independent biologists
to determine how many of them use valid statistical
methods.
• In particular, you have a sample of N independent,
identically distributed RVs. With Yi with p=P(Y=1)
• What is the distribution of the number of successes
Y=∑NI=1 Yi in N trials? Y~Bin(y;N,p)
• Calculate the probability that 0 out of 10 biologists use valid
statistical methods when the probability of using valid statistical
methods is 0.8

Biostatistics -Notes WA , SPH AAU ,2016


The Poisson distribution
• Discrete probability distribution is used to model the
number of occurrences of an event that takes place
infrequently in time or space
• Applicable for counts of events over a given interval of
time, for example:
– number of patients arriving at an emergency
department in a day
– number of new cases of HIV diagnosed at a clinic in a
month
– Daily number of new cases of breast cancer notified
to a cancer registry
– Number of abnormal cells in a fixed area of
histological slides from a series of liver biopsies
Biostatistics -Notes WA , SPH AAU ,2016
The Poisson distribution
• The theoretical situation giving rise to data of this type is
easier to describe in relation to events occurring over
time (or space) at a fixed rate on average, but where each
event occurs independently and at random.
• Such data will have a Poisson distribution
• Suppose events happen randomly and independently in
time at a constant rate. If events happen with rate l
events per unit time, the probability of x events
happening in unit time is:

Biostatistics -Notes WA , SPH AAU ,2016


• where x = 0, 1, 2, . . .x is a potential outcome of X
• t time of segment of interest
• The constant (lambda) represents the rate at which
the event occurs, or the expected number of events
per unit time
• e = 2.71828
• It depends up on just one parameter, which is the )

Biostatistics -Notes WA , SPH AAU ,2016


Three assumptions of Poisson distribution
1. The probability that a single event occurs within a
given small subinterval is proportional to the
length of the subinterval
2. The rate at which the event occurs is constant over
the entire interval t
3. Events occurring in consecutive subintervals are
independent of each other

Biostatistics -Notes WA , SPH AAU ,2016


Example
Example1
The daily number of new registrations of cancer is 2.2 on average.
• What is the probability of
a) Getting no new cases
b) Getting 1 case
c) Getting 2 cases
d) Getting 3 cases
e) Getting 4 cases
solution
• a) P(X=0)= 0 .111
• b) P(X=1) = 0.244
• c) P(X=2) = 0.268
• d) P(X=3) = 0.197
• e) P(X=4) = 0.108
Biostatistics -Notes WA , SPH AAU ,2016
The Poisson distribution
• Characteristics;
• The Poisson distribution is very asymmetric when its mean
is small
• With large means it becomes nearly symmetric
• It has no theoretical maximum value, but the probabilities
tail off towards zero very quickly
• λ is the parameter of the Poisson distribution
• The mean is λ and the variance is also λ.

Biostatistics -Notes WA , SPH AAU ,2016


Probability distribution of continuous variables
• Under different circumstances, the outcome of a random
variable may not be limited to categories or counts.
Example 1
– Suppose, X represents the continuous variable
‘Height’; rarely is an individual exactly equal to 170cm
tall
– X can assume an infinite number of intermediate
values 170.1, 170.2, 170.3 etc.

• Because a continuous random variable X can take on an


uncountable infinite number of values, the probability
associated with any particular one value is almost equal to
zero.

Biostatistics -Notes WA , SPH AAU ,2016


Probability distribution of continuous variables
• However the probability that X will assume
some value in the interval enclosed by two
ranges say x1 and x2 is a value greater than
given by

• As a continuous variable can take an infinite


number of values, it helps to visualize the
probability distribution as a curve and
probabilities as ‘area under the curve’.
• It is also called normal distribution.

Biostatistics -Notes WA , SPH AAU ,2016


Normal Distribution
• The Normal Distribution is by far the most important
probability distribution in statistics.
• It is also sometimes known as the Gaussian distribution,
after the mathematician Gauss.
• The distributions of many medical measurements in
populations follow a normal distribution (eg. Serum uric
acid levels, cholesterol levels, blood pressure, height and
weight)
• The normal distribution is a theoretical, continuous
probability distribution whose equation is:

for -∝ < x < +∝


Biostatistics -Notes WA , SPH AAU ,2016
Normal Distribution
• The normal distribution for any given interval
between a and b is:

Biostatistics -Notes WA , SPH AAU ,2016


Characteristics of the Normal Distribution
1. It is a probability distribution of a continuous variable. It
extends from minus infinity( -∞) to plus infinity (+∞).

2. It is unimodal, bell-shaped and symmetrical about x = u.

3. It is determined by two parameters: referred as the mean μ


(read as ‘mu’) and standard deviation σ (read ‘sigma’).
– Changing μ alone shifts the entire normal curve to the left or
right.
– Changing σ alone changes the degree to which the distribution
is spread out.
– The mean μ can be any number (negative, positive or zero).
– The standard deviation σ must be a positive number.
Biostatistics -Notes WA , SPH AAU ,2016
Characteristics of the Normal Distribution
4. The height of the frequency curve, which is called the
probability density, cannot be taken as the probability of a
particular value.
– This is because for a continuous variable there are infinitely
many possible values so that the probability of any specific
value is zero.
5. An observation from a normal distribution can be related to a
standard normal distribution: (SND) which has a published
table.
– Thus an observation x from a normal distribution with
mean μ and standard deviation σ can be related to a
Standard normal distribution by calculating :
SND = Z = (x - μ ) / σ
Biostatistics -Notes WA , SPH AAU ,2016
6. Perpendiculars of the area under the curve.

– ± SD contain about 68%;


– ±2 SD contain about 95%;
– ±3 SD contain about 99.7%

7. The distribution is completely determined by


the parameters m and s.
Biostatistics -Notes WA , SPH AAU ,2016
Normal curve

Biostatistics -Notes WA , SPH AAU ,2016


Normal probability
• Normal curve area for Z value of 1.95 in the table

Biostatistics -Notes WA , SPH AAU ,2016

You might also like