Professional Documents
Culture Documents
Rediet Eristu
Rediet.E 3
Methods of Evaluation
Quizzes
Assignment
Test (1&2)
Final exam
Rediet.E 4
Lesson objective
At the end of this chapter it’s expected to know
©Definition and types of statistics
Rediet.E 5
Introduction
• What is statistics?
- we use statistics every day, often without realising it.
• Statistics: A field of study concerned with:
– collection, organization, analysis, summarization and
interpretation of numerical data, &
– the drawing of inferences about a body of data when
only a small part of the data is observed
Rediet.E 6
Biostatistics ?
- The application of statistical methods to the fields of
biological and medical sciences are able to
methodically distinguish between true differences
among observations and random variations caused by
chance alone
· Concerned with interpretation of biological data & the
communication of information derived from these data
· Has central role in medical investigations
Rediet.E 7
Rational for studying Biostatistics
Facts are now measured quantitatively in medicine and public
health
8
Rediet.E
Limitation of statistics
It deals on aggregates of facts : no importance to
individual items
Statistical data are only approximately : not
mathematically correct
Rediet.E 9
Phases of statistical investigation
I. Collection of data
II. Organization of data
III. Presentation of the data
IV. Analysis of data: The process of extracting
relevant
V. Inference
Rediet.E 10
Types of Statistics
1. Descriptive statistics:
Descriptive statistics are methods for organizing and
summarizing data
Rediet.E
11
Types of Statistics
2. Inferential statistics:
• Inferential statistics are methods for using sample data to
make general conclusions (inferences) about populations
Rediet.E 14
E.g.: In a study of the prevalence
of HIV among adolescents in
Ethiopia, a random sample of
adolescents in Lideta Kifle
Ketema of AA were included
Rediet.E 15
Basic terms cont . . .
o Census
A census is the collection of data from every member
of the population
o Parameter
A parameter is a numerical measurement describing
some characteristics of a population
o Statistic
A statistic is a numerical measurement describing
some characteristics of a sample
16
Rediet.E
Basic terms cont . . .
• Data are observations (such as measurements,
genders, survey responses) that have been collected
• It is the raw material for statistics
• Can be obtained from:
– Routinely kept records, literature
– Surveys
– Counting
– Experiments
– Reports
– Observation
– Etc
Rediet.E 17
Rediet.E 18
Haileab.f (Bsc.) 19
Chapter 2
Descriptive Statistics
Rediet.E 20
Descriptive Statistics
• Techniques used to organize and summarize a
set of data in a concise way
– Organization of data
– Summarization of data
– Presentation of data
• Numbers that have not been summarized and
organized are called raw data
Rediet.E 21
Variable
• Variable: A characteristic which takes different values in
different persons, places, or things
Rediet.E 23
1. Discrete: It can only have a limited number of
discrete values (usually whole numbers).
– E.g., the number of pregnancy mother has had in her life. You
can’t have 2.5 pregnancy
• Characterized by gaps or interruptions in the values
(integers).
• Can assume only whole numbers
Rediet.E 24
2. Continuous variable: It can have an infinite number of
possible values in any given interval.
• Can take any value within a defined range
Rediet.E 25
Scales of measurement
Rediet.E 26
1. Nominal scale:
• The simplest type of data, in which the values fall
into unordered categories or classes
• Consists of “naming” observations or classifying
them into various mutually exclusive and
collectively exhaustive categories
• Uses names, labels, or symbols to assign each
measurement.
– Examples: Blood type, sex, race, marital status,
etc.
• If nominal data can take on only two possible
values, they are called dichotomous or binary
Rediet.E 27
2. Ordinal scale:
• Assigns each measurement to one of a limited
number of categories that are ranked in terms of
order
• Although non-numerical, can be considered to have
a natural ordering
• Examples: Patient status, cancer stages, social
class, Likert scales etc.
Rediet.E 28
Example of ordinal scale:
Rediet.E 29
3. Interval scale:
- Measured on a continuum and differences between any
two numbers on a scale are of known size
Example: Temp. in oF on 4 consecutive days
Days: A B C D
Temp. oF: 50 55 60 65
For these data, not only is day A with 50o cooler than
day D with 65o, but is 15o cooler
- It has no true zero point. “0” is arbitrarily chosen and
doesn’t reflect the absence of temp
Rediet.E 30
4. Ratio scale:
- Measurement begins at a true zero point and the
scale has equal space
- Examples: Height, age, weight, BP, etc.
• Note on meaningfulness of “ratio”-
– Someone who weighs 80 kg is two times as heavy as
someone else who weighs 40 kg. This is true even if
weight had been measured in other measurements
Rediet.E 31
Scales of Measurement
• Nominal = Naming
• Ordinal = Naming + Order
• Interval = Naming + Order + Equal Intervals
• Ratio = Naming + Order + Equal Intervals + True
Zero
Rediet.E 32
Degree of precision in measuring
Nominal
Ordinal
Interval
Ratio
Rediet.E 33
Exercise:- Consider the following Scales of measurement
(types of data) and answer questions A to D
1. Blood group
2. Temperature (Celsius)
3. Sex
4. Job satisfaction index (1-5)
5. Number of heart attacks
6. Calendar year
7. Serum uric acid (mg/100ml)
8. Number of cases of each reportable disease reported by a
health worker
9. The average weight gain of six 1-year old dogs with a special
diet supplement was 950 grams last month.
10. Injury severity (a score between 1and 3 is allocated
depending on the severity) – scores 1 and 3 show mild and
very severe respectively
Rediet.E 34
Exercise cont.----
A) Identify the type of data (nominal, ordinal, interval ratio).
Confirm your answers by giving your own examples
Rediet.E 35
Chapter 3
Rediet.E 36
Source of Data
Source of data
Rediet.E 37
Internal and External Source of Data
Internal Sources of Data External Sources of Data
Many institutions and o When information is
departments have collected form outside
information about their agencies, it is called
regular functions, for their external source of data
own internal purpose o Such type of data are
When those information is either Primary or
used in any survey, it’s Secondary
called Internal Source Of o
This type of information
Collection of Data
can be collected by Census
E.g.., Public health Institutes
or Sampling method by
& Nursing association
conducting surveys
members etc.
Rediet.E 38
Primary Data
• Primary data are those which are collected for
the first time
• It is real time data which are collected by the
researcher himself
• This is the process of Collecting and making
use of the data
• This Data originated by the researcher
specifically to address the research problem
Rediet.E 39
Method of Collecting Primary Data
1. Direct personal Investigation ( i.e. Interview
Method)
2. Indirect oral investigation ( i.e. through
enumerators)
3. Investigation through Local reporters
Questionnaire
4. Investigation through mailed Questionnaire
5. Investigation through Observation
Rediet.E 40
Secondary Data
• Secondary data are those that have already been
collected by others
• These are usually in journals, periodicals, research
publications, official records etc.
• Secondary data may be available in the published
or unpublished form
• When it is not possible to collect the data by
primary method, the investigator go for Secondary
method
• This Data collected for some purpose other than
the problem at hand
Rediet.E 41
Method of Collecting Secondary Data
1. Published Sources
a) International Publication
b) Government Publications
Rediet.E 42
Difference between Primary and Secondary Data
Rediet.E 43
• Data collection methods?
Rediet.E 44
Data collection methods
Before any statistical work can be done data must be
collected.
Data collection is a crucial stage in the planning and
implementation of a study.
Data collection techniques allow us to systematically
collect data about our objects of study (people, objects,
and phenomena) and about the setting in which they
occur.
In the collection of data we have to be systematic. If
data are collected haphazardly(lacking order or organization), it
will be difficult to answer our research questions in a
conclusive way.
Rediet.E 45
Data collection methods…
Rediet.E 46
Data collection methods…
Rediet.E 47
Data collection methods…
For quantitative data, we usually use questionnaires
(standard or structured)
- The questionnaire could be self-administered or
interviewer-administered (either face-to-face or
telephone, or other electronic media such as
online internet)
Rediet.E 48
Data collection methods…
Rediet.E 49
Data collection methods…
Rediet.E 50
Data collection methods…
Rediet.E 51
Types of Questions
For example
Can you describe exactly what the traditional birth
attendant did when your labor started?
What do you think are the reasons for a high drop-out
rate of village health committee members?
What would you do if you noticed that your daughter
(school girl) had a relationship with a teacher?
Rediet.E 53
Closed Questions
Closed questions offer a list of possible options or
answers from which the respondents must choose
When designing closed questions one should try to:
Offer a list of options that are exhaustive and
mutually exclusive
Rediet.E 54
Closed Questions…
For example
What is your marital status?
1. Single
2. Married/living together
3. Separated
4. divorced
5. widowed
Have you ever gone to the local village health worker for
treatment?
1. Yes
2. No
Closed questions may also be used if one does not want to
waste the time of the respondent and interviewer by
obtaining more information Rediet.E
than one needs 55
Problems in gathering data
It is important to recognize some of the main problems that may
be faced when collecting data so that they can be addressed in the
selection of appropriate collection methods and in the training of
the staff involved
Common problems might include:
Language barriers
Lack of adequate time
Expense
Inadequately trained and experienced staff
Invasion of privacy
Suspicion (mistrust)
Bias:(Systematic error (not random) in a study that leads to an
incorrect estimate (RR) of the association between exposure and
disease)
Cultural norms (e.g. whichRediet.E
may preclude (prevent) men 56
Methods of data collection summary
Types of data Data type by source Methods of data collection
primary Observation
Rediet.E 58
Presentation of Results
For data to be more easily appreciated and to draw
quick comparisons, it is often useful to arrange the data
in the form of a table, or in one of a number of different
graphical forms
Rediet.E 59
Statistical tables
• A statistical table is an orderly and systematic
presentation of numerical data in rows and
columns
Rediet.E 60
Importance of statistical Tabulation
Statistical data arranged in tables have some definite
advantages over those descriptively stated
Rediet.E 61
Parts of a table
a) Title
b) Captions
c) Stubs
d) Body
e) Head note
f) Foot note
g) Source
Most of the times, the 1st four parts are present in all tables while
the presence of the remaining three depends upon the specific
purpose
Rediet.E 62
Parts of a table cont...
a) Titles : It explains - What the data are about
- from where the data are collected
- time period of the data
- how the data are classified
b) Captions: The titles of the columns are given in captions.
In case there is a sub-division of any column there
would be sub-caption headings also
12 19 27 36 42 59
15 22 31 39 43 61
17 23 31 41 44 65
18 26 34 41 54 67
Rediet.E 66
• Data contain information and that
summarization is a way of making it easier to
determine the nature of this information
• The actual summarization and organization of
data starts from frequency distribution
• Frequency distribution: A table which has a
list of each of the possible values that the data
can assume along with the number of times
each value occurs
Rediet.E 67
Relative frequency: useful at times to know the proportion,
rather than the number of values falling within a particular
class interval
a)Table for Qualitative variable: Count the number of cases
in each category
- Example1: The intensive care unit type of 25 patients entering
ICU at a given hospital:
1. Medical
2. Surgical
3. Cardiac
4. Other
Rediet.E 68
Frequency Relative Frequency
ICU Type (How often) (Proportionately often)
Medical 12 0.48
Surgical 6 0.24
Cardiac 5 0.20
Other 2 0.08
Total 25 1.00
Rediet.E 69
b) Table for Quantitative variable:
- Select a set of continuous, non-overlapping
intervals such that each value can be placed in
one, and only one, of the intervals
- The first consideration is how many intervals to
include
Rediet.E 70
To determine the number of class intervals and the
corresponding width, we may use:
Sturge’s rule:
K 1 3.322(logn)
LS
W
K
where
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value
Rediet.E 71
Example:
– Leisure time (hours) per week for 40 college
students:
23 24 18 14 20 36 24 26 23 21 16 15 19 20 22
14 13 10 19 27 29 22 38 28 34 32 23 19 21 31
16 28 19 18 12 27 15 21 25 16
K = 1 + 3.22 (log40) = 6.32 ≈ 6
Maximum value = 38, Minimum value = 10
Width = (38-10)/6 = 4.66 ≈ 5
Rediet.E 72
Time Relative Cumulative
(Hours) Frequency Frequency Relative
Frequency
10-14 5 0.125 0.125
15-19 11 0.275 0.400
20-24 12 0.300 0.700
25-29 7 0.175 0.875
30-34 3 0.075 0.950
35-39 2 0.050 1.00
Total 40 1.00
Rediet.E 73
• Cumulative frequencies: When frequencies of two or more
classes are added
Rediet.E 75
Simple Frequency Distribution
• Primary and secondary cases of syphilis morbidity
by age, 1989
Age group Cases
(years) Number Percent
Rediet.E 78
2.Diagrammatic Representation
Rediet.E 79
Importance of diagrammatic representation:
Rediet.E 80
Graphical Presentation…
Limitations of Graphical presentation
The technique of diagrammatic presentation is made
use only for purposes of comparison. It is not to be
used when comparison is either not possible or is not
necessary
Diagrammatic presentation is not an alternative to
tabulation. It only strengthens the textual exposition
of a subject, and cannot serve as a complete substitute
for statistical data
It can give only an approximate idea and as such
where greater accuracy is needed diagrams will not
be suitable
They fail to bring to light small differences
Rediet.E 81
General directions for the construction of diagrams
Rediet.E 83
Specific types of graphs include:
• Bar graph Nominal, ordinal
• Pie chart data
• Histogram
• Stem-and-leaf plot
• Box plot
Quantitative
• Scatter plot data
• Line graph
• Others
Rediet.E 84
1. Bar charts (or graphs)
Rediet.E 86
2.2 Sub-divided bar chart
• If there are different quantities forming the
sub-divisions of the totals, simple bars may
be sub-divided in the ratio of the various
sub-divisions to exhibit the relationship of
the parts to the whole
• The order in which the components are
shown in a “bar” is followed in all bars used
in the diagram
– Example: Stacked and 100% Component bar
charts
Rediet.E 87
Example: Plasmodium species distribution for
confirmed malaria cases, Zeway, 2003
100 Mixed
P. vivax
80 P. falciparum
60
Percent
40
20
0
August October December
2003
Rediet.E 88
2.3. Multiple bar graph
• Bar charts can be used to represent the
relationships among more than two
variables
• The following figure shows the relationship
between children’s reports of breathlessness
and cigarette smoking by themselves and
their parents
Rediet.E 89
Prevalence of self reported breathlessness among school
childeren, 1998
35
Breathlessness, per cent
30
25
20
15
10
5
0
Neither One Both
Parents smooking
We can see from the graph quickly that the prevalence of the symptoms
increases both with the child’s smoking and with that of their parents
Rediet.E 90
2. Pie chart
• Shows the relative frequency for each category
by dividing a circle into sectors, the angles of
which are proportional to the relative frequency
• Used for a single categorical variable
• Pie chart is important for depicting discrete
variables with relatively few categories
• Use percentage distributions
Rediet.E 91
Distribution fo cause of death for females, in England and Wales, 1989
Others
8%
Digestive System
4%
Injury and Poisoning
3%
Circulatory system
Respiratory system
42%
13%
Neoplasmas
30%
Rediet.E 92
3. Histogram
• Histograms are special type of bar graph in which
frequency distributions with continuous class
intervals turned into graphs
• To construct a histogram, we draw the class
boundaries on a horizontal line and the frequencies on
a vertical line
40
35
30
25
No of women
20
15
10
0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group Rediet.E 94
Two problems with histograms
1. They are somewhat difficult to construct
2. The actual values within the respective
groups are lost and difficult to reconstruct
Rediet.E 95
4. Stem-and-Leaf Plot
• A quick way to organize data to give visual impression
similar to a histogram while retaining much more detail
on the data
• 43, 28, 34, 61, 77, 82, 22, 47, 49, 51, 29, 36,
66, 72, 41
2 2 8 9
3 4 6
4 1 3 7 9
5 1
6 1 6
7 2 7
8 2
Rediet.E 97
Example: 3031, 3101, 3265, 3260, 3245, 3200, 3248,
3323, 3314, 3484, 3541, 3649 (BWT in g)
Rediet.E 98
5. Frequency polygon
• A frequency distribution can be portrayed graphically
in yet another way by means of a frequency polygon
• It is special kind of line graph
600
500
400
300
200
N1AGEMOTH
Rediet.E 100
6. Ogive Curve (The Cumulative Frequency Polygon)
Rediet.E 101
Cumulative frequency of 25 ICU patients
Rediet.E 102
7. Scatter plot
Rediet.E 103
• For two quantitative variables we use
bivariate plots (also called scatter plots or
scatter diagrams)
Rediet.E 104
• A scatter diagram is constructed by drawing X-and Y-axes.
140
120
Saturation of bile
100
80
60
40
20
0
0 10 20 30 40 50 60 70 80
Age 105
Rediet.E
8. Line graph
• Useful for assessing the trend of particular situation overtime.
• Helps for monitoring the trend of epidemics
• The time, in weeks, months or years, is marked along the
horizontal axis, and
• Values of the quantity being studied is marked on the vertical
axis
• Values for each category are connected by continuous line
• Sometimes two or more graphs are drawn on the same graph
taking the same scale so that the plotted graphs are comparable
Rediet.E 106
No. of microscopically confirmed malaria cases by species and
month at Zeway malaria control unit, 2003
No. of confirmed malaria cases
2100
1800 Positive
1500 P. falciparum
P. vivax
1200
900
600
300
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Months
Rediet.E 107
Chapter 5
Summarizing Data
Rediet.E 108
Properties of frequency distribution
Other than the mentioned advantages of using diagrams,
we use graphical representations to demonstrate three
properties of frequency distributions:
central location,
variation or
dispersion, and skewness
When we graph frequency distribution data, we
often find that the graph looks like with a large part of
the observations clustered around a central value. This
clustering is known as the central location or central
tendency of a frequency distribution
Rediet.E 109
The central values that result from the various
methods are known collectively as measures of
central location
Rediet.E 110
Fig. Three curves identical in shape with different central location
Rediet.E 111
Measures of Central Tendency/ Measures of Location
Rediet.E 113
Properties of the arithmetic mean
Uniqueness: For a given set of data there is one and only
one arithmetic mean
Simplicity: The mean is easily understood and easy to
compute
Center of gravity: Algebraic sum of the deviations of
the given values from their arithmetic mean is always
zero. i.e.∑(xi- ) )=0. So, mean is the center of gravity of
the given data set
Sensitivity: Since each and every value in a set of data
enters into the computation of the mean, it is greatly
affected by extreme values
So, in skewed distribution, it is undesirable measure of
central tendency Rediet.E 114
2. Median
An alternative measure of central location, perhaps second in
popularity to the arithmetic mean
Suppose there are n observations in a sample. If these
observations are ordered from smallest to largest, then the
median is defined as follows:
The median, is a value such that at least half of the
observations are less than or equal to median and at least
half of the observations are greater than or equal to median .
Median means middle, and the median is the middle of a set
of data that has been put into rank order
To find the median of a data set:
Arrange the data in ascending order
Find the middle observation of this ordered data
Rediet.E 115
Median…
Median =
Median =
Rediet.E 117
3. Mode
Mode is the value appearing most frequently
It can be obtained by counting the number of appearance for
each observation from the list
Important for summarising nominal/categorical types of data
Disadvantage,
In small number of observations, there may be no mode.
In addition, sometimes, there may be more than one mode
such as when dealing with a bimodal (two-peaks)
distribution
Example
a. 22, 66, 69, 70, 73. (no modal value)
b. 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal
value = 3.0 kg)
Rediet.E 118
Properties of Mode
It is not affected by extreme values
It can be calculated for distributions with open
end classes
Often its value is not unique
The main drawback of mode is that often it
does not exist
Rediet.E 119
Central Tendency cont---
Rediet.E 120
Quartiles: is quintiles which divide the distribution into
four equal parts
- The 25th percentile demarcates the first quartile(Q1)
- the median or 50th percentile demarcates the second
quartile(Q2)
- the 75th percentile demarcates the third quartile (Q3)
- and the 100th percentile demarcates the fourth
quartile(Q4)
Rediet.E 121
Central Tendency cont---
Rediet.E 123
Skewness cont---
If extremely low or extremely high observations are
present in a distribution, then the mean tends to shift
towards those scores
Based on the type of skewness, distributions can be:
Symmetrical distribution: It is neither
positively nor negatively skewed. A curve is
symmetrical
if one half of the curve is the mirror image of
the other half
If the distribution is symmetric and has only one
mode, all three measures are the same, an
example being the normal distribution
Rediet.E 124
Rediet.E 125
Positively skewed distribution: Occurs when the
majority of scores are at the left end of the curve
and a few extreme large scores are scattered at the
right end
Rediet.E 126
Rediet.E 127
Negatively skewed distribution: occurs when
majority of scores are at the right end of the curve
and a few small scores are scattered at the left end
Rediet.E 128
Rediet.E 129
Summary
Given a set of observations, an investigator may
naturally ask which measure of central tendency is best
to use with the data
Two factors are important in making this decisions:
1. The scale of measurement
2. The shape of the distribution of observations
More over, two or more sets may have the same mean
and/or median but they may be quite different
Rediet.E 135
Then the range for the dataset can be computed by first
arranging the all observation in to ascending order as:
1.98, 2.02, 2.33, 2.33, 2.43, 2.51, 2.88, 2.98, 3.01, 3.25.
Maximum-Minimum=3.25-1.98=1.27
It is based upon two extreme cases in the entire
distribution, the range may be considerably changed if
either of the extreme cases happens to drop out, while
the removal of any other case would not affect it at all
Rediet.E 137
IQR Cont---
Rediet.E 139
Example …
1st quartile = The {1/4 (n+1)}th observation = (2.25) th
Variance:
While the inter-quartile range eliminates the problem of outliers
it creates another problem in that you are eliminating half of
your data
The solution to both problems is to measure variability from the
center of the distribution
Rediet.E 142
• Mathematically the formula for sample variance is
defined as:
Rediet.E 143
4. Standard Deviation
Standard Deviation:
The sample and population standard deviations are
denoted by S and σ (by convention) respectively
The standard deviation(S.D.), is just the positive
square root of the variance
It expresses exactly the same information as the
variance, but re-scaled to be in the same units as the
mean
The best measures for normally distributed data
Mathematically: Population standard deviation
101,105,110,114,115,124,125,125,130,133,135,136,137
,140,145
Rediet.E 145
Example 1
Find the variance and standard deviation of the
above distribution
Solutions
The mean of the sample is 125 m2.
Variance (sample) = s2 = Σ(xi –x)2/n-1 = {(101-125)
2
+(105-125) 2 + ….(145-125) 2 } / (15-1)
= 2502/14
= 178.71 m4
Hence, the standard deviation
=
= 13.37 m2
Rediet.E 146
Example 2
Consider the dataset about current age of women
which was collected from 240 women
The variance for the dataset can be computed as:
Rediet.E 147
5. Coefficient of variance
The standard deviation is an absolute measure of deviation
of observations around their mean and is expressed with the
same unit of the data
Due to this nature of the standard deviation it is not directly
used for comparison purposes with respect to variability
Coefficient of variation, is often used for this purpose
The coefficient of variation (CV) is defined by:
CV =
Rediet.E 150
Summary
Data type vs Measure of central tendency and
dispersion
Rediet.E 151
Thank you very much
Rediet.E 152