You are on page 1of 152

University of Gondar

College of Medicine and Health Science


department of Epidemiology and Biostatistics

Biostatistics for 4th year Psychiatry students

Rediet Eristu

Rediet .E (MPH in Biostat) 1


Course content
 Chapter 1: Introduction to Statistics / Biostatistics
 Chapter 2: Descriptive Statistics
 Chapter 3 : Source of data and methods of data collection
 Chapter 4 : Data organization and presentation
 Chapter 5: Summarizations of data
 Chapter 6: Probability & Probability distribution
 Chapter 7: Sampling techniques and Sample size
determination
 Chapter 8: Inference and Estimation
Chapter 9:Demography and health survey
Rediet.E 2
What is expected from you

Attendance at most class (100%)


 If you missed three days class : will not take
exam
Participation in class
Feedback towards lecturing method
Submission of assignment on time

Rediet.E 3
Methods of Evaluation
 Quizzes

 Assignment

 Test (1&2)

 Final exam

Rediet.E 4
Lesson objective
At the end of this chapter it’s expected to know
©Definition and types of statistics

©Definitions of common terms in statistics

©Describe the rational and limitations of


statistics

Rediet.E 5
Introduction
• What is statistics?
- we use statistics every day, often without realising it.
• Statistics: A field of study concerned with:
– collection, organization, analysis, summarization and
interpretation of numerical data, &
– the drawing of inferences about a body of data when
only a small part of the data is observed

Rediet.E 6
Biostatistics ?
- The application of statistical methods to the fields of
biological and medical sciences are able to
methodically distinguish between true differences
among observations and random variations caused by
chance alone
· Concerned with interpretation of biological data & the
communication of information derived from these data
· Has central role in medical investigations

Rediet.E 7
Rational for studying Biostatistics
 Facts are now measured quantitatively in medicine and public
health

 The planning, conduct, and interpretation of much of


medical/public health research are becoming increasingly
reliant on statistical technology. For example like;

- Is this new drug or procedure better than the one commonly in


use? How much better?
- In testing a new drug how many patients must be treated, and in
what manner, in order to demonstrate its worth?
- Which group of the population is more affected by
malaria/Fistula ? Etc.---

8
Rediet.E
Limitation of statistics
 It deals on aggregates of facts : no importance to
individual items
 Statistical data are only approximately : not
mathematically correct

Rediet.E 9
Phases of statistical investigation
I. Collection of data
II. Organization of data
III. Presentation of the data
IV. Analysis of data: The process of extracting
relevant
V. Inference

Rediet.E 10
Types of Statistics
1. Descriptive statistics:
Descriptive statistics are methods for organizing and
summarizing data

• Helps to identify the general features and trends in a set


of data and extracting useful information

• For example, tables or graphs are used to organize data,


and descriptive values such as the average score are
used to summarize data

Rediet.E
11
Types of Statistics
2. Inferential statistics:
• Inferential statistics are methods for using sample data to
make general conclusions (inferences) about populations

• Because a sample is typically only a part of the whole


population, sample data provide only limited information
about the population. As a result, sample statistics are
generally imperfect representatives of the corresponding
population parameters

• Statistical summaries which are common in inferential


statistics: Principles of probability, estimation, confidence
interval, comparison of two or more means or proportions,
hypothesis testing, etc 12
Rediet.E
Basic Terms in statistics
. Population:
A population is the complete collection of all elements
(scores, people, measurements, and so on) to be
studied. The collection is complete in the sense that it
includes all subjects to be studied
• Target population:
– A collection of items that have something in
common for which we wish to draw conclusions at a
particular time
. Study Population:
The subset of the target population that has at least some
chance of being sampled
13
. Sample:
A subset of a study population, about which
information is actually obtained
. Sampling :
Sampling is the technique of selecting
representative portion of the entire population

Rediet.E 14
E.g.: In a study of the prevalence
of HIV among adolescents in
Ethiopia, a random sample of
adolescents in Lideta Kifle
Ketema of AA were included

Sample Target Population: All


adolescents in Ethiopia
Study Population Study population: All
adolescents in Addis Ababa
Target Population Sample: Adolescents in Lideta
Kifle Ketema who were included
in the study

Rediet.E 15
Basic terms cont . . .
o Census
 A census is the collection of data from every member
of the population
o Parameter
 A parameter is a numerical measurement describing
some characteristics of a population
o Statistic
 A statistic is a numerical measurement describing
some characteristics of a sample

16
Rediet.E
Basic terms cont . . .
• Data are observations (such as measurements,
genders, survey responses) that have been collected
• It is the raw material for statistics
• Can be obtained from:
– Routinely kept records, literature
– Surveys
– Counting
– Experiments
– Reports
– Observation
– Etc
Rediet.E 17
Rediet.E 18
Haileab.f (Bsc.) 19
Chapter 2

Descriptive Statistics

Rediet.E 20
Descriptive Statistics
• Techniques used to organize and summarize a
set of data in a concise way
– Organization of data
– Summarization of data
– Presentation of data
• Numbers that have not been summarized and
organized are called raw data

Rediet.E 21
Variable
• Variable: A characteristic which takes different values in
different persons, places, or things

• Any aspect of an individual or object that is measured (e.g.,


BP) or recorded (e.g., age, sex) and takes any value

• There may be one variable in a study or many

• E.g., A study of treatment outcome of TB

• Variables can be broadly classified into:


– Categorical (or Qualitative) or
22
– Quantitative (or numerical variables) Rediet.E
• Categorical variable: A variable or characteristic which
can not be measured in quantitative form but can only be
sorted by name or categories
Eg- Sex, marital status etc
• Quantitative variable: A variable that can be measured
(or counted) and expressed numerically.
Eg- Height, wt, # of children, etc.

Rediet.E 23
1. Discrete: It can only have a limited number of
discrete values (usually whole numbers).
– E.g., the number of pregnancy mother has had in her life. You
can’t have 2.5 pregnancy
• Characterized by gaps or interruptions in the values
(integers).
• Can assume only whole numbers

Rediet.E 24
2. Continuous variable: It can have an infinite number of
possible values in any given interval.
• Can take any value within a defined range

• Does not possess the gaps or interruptions


• Examples – blood pressure, height, weight, time; Weight
is continuous since it can take on any number of values
(e.g., 34.575 Kg).

Rediet.E 25
Scales of measurement

• All measurements are not the same.


• Measuring weight = eg. 40kg
• Measuring the status of a patient on scale =
“improved”, “stable”, “not improved”.
• There are four types of scales of measurement

Rediet.E 26
1. Nominal scale:
• The simplest type of data, in which the values fall
into unordered categories or classes
• Consists of “naming” observations or classifying
them into various mutually exclusive and
collectively exhaustive categories
• Uses names, labels, or symbols to assign each
measurement.
– Examples: Blood type, sex, race, marital status,
etc.
• If nominal data can take on only two possible
values, they are called dichotomous or binary
Rediet.E 27
2. Ordinal scale:
• Assigns each measurement to one of a limited
number of categories that are ranked in terms of
order
• Although non-numerical, can be considered to have
a natural ordering
• Examples: Patient status, cancer stages, social
class, Likert scales etc.

Rediet.E 28
Example of ordinal scale:

• Pain level: • The numbers have


1. None LIMITED meaning
2. Mild 4>3>2>1 is all we
3. Moderate know apart from
4. Severe their utility as labels

Rediet.E 29
3. Interval scale:
- Measured on a continuum and differences between any
two numbers on a scale are of known size
Example: Temp. in oF on 4 consecutive days
Days: A B C D
Temp. oF: 50 55 60 65
For these data, not only is day A with 50o cooler than
day D with 65o, but is 15o cooler
- It has no true zero point. “0” is arbitrarily chosen and
doesn’t reflect the absence of temp

Rediet.E 30
4. Ratio scale:
- Measurement begins at a true zero point and the
scale has equal space
- Examples: Height, age, weight, BP, etc.
• Note on meaningfulness of “ratio”-
– Someone who weighs 80 kg is two times as heavy as
someone else who weighs 40 kg. This is true even if
weight had been measured in other measurements

Rediet.E 31
Scales of Measurement

• Nominal = Naming
• Ordinal = Naming + Order
• Interval = Naming + Order + Equal Intervals
• Ratio = Naming + Order + Equal Intervals + True
Zero

Rediet.E 32
Degree of precision in measuring

Nominal

Ordinal

Interval

Ratio

Rediet.E 33
Exercise:- Consider the following Scales of measurement
(types of data) and answer questions A to D
1. Blood group
2. Temperature (Celsius)
3. Sex
4. Job satisfaction index (1-5)
5. Number of heart attacks
6. Calendar year
7. Serum uric acid (mg/100ml)
8. Number of cases of each reportable disease reported by a
health worker
9. The average weight gain of six 1-year old dogs with a special
diet supplement was 950 grams last month.
10. Injury severity (a score between 1and 3 is allocated
depending on the severity) – scores 1 and 3 show mild and
very severe respectively

Rediet.E 34
Exercise cont.----
A) Identify the type of data (nominal, ordinal, interval ratio).
Confirm your answers by giving your own examples

B) Identify the types of data which are qualitative and


quantitative

C) Identify the types of data which are numerical discrete and


numerical continuous

D) Which nominal scales are dichotomous? Which ones are


multichotomous?

Rediet.E 35
Chapter 3

Source of data and methods of data collection

Rediet.E 36
Source of Data

Source of data

Internal source External source

Primary source Secondary source

Rediet.E 37
Internal and External Source of Data
Internal Sources of Data External Sources of Data
 Many institutions and o When information is
departments have collected form outside
information about their agencies, it is called
regular functions, for their external source of data
own internal purpose o Such type of data are
 When those information is either Primary or
used in any survey, it’s Secondary
called Internal Source Of o
This type of information
Collection of Data
can be collected by Census
 E.g.., Public health Institutes
or Sampling method by
& Nursing association
conducting surveys
members etc.
Rediet.E 38
Primary Data
• Primary data are those which are collected for
the first time
• It is real time data which are collected by the
researcher himself
• This is the process of Collecting and making
use of the data
• This Data originated by the researcher
specifically to address the research problem

Rediet.E 39
Method of Collecting Primary Data
1. Direct personal Investigation ( i.e. Interview
Method)
2. Indirect oral investigation ( i.e. through
enumerators)
3. Investigation through Local reporters
Questionnaire
4. Investigation through mailed Questionnaire
5. Investigation through Observation
Rediet.E 40
Secondary Data
• Secondary data are those that have already been
collected by others
• These are usually in journals, periodicals, research
publications, official records etc.
• Secondary data may be available in the published
or unpublished form
• When it is not possible to collect the data by
primary method, the investigator go for Secondary
method
• This Data collected for some purpose other than
the problem at hand
Rediet.E 41
Method of Collecting Secondary Data
1. Published Sources
a) International Publication
b) Government Publications

c) Commercials Research, Educational Institute,


Unions, Organizations etc.
2. Unpublished Sources

Rediet.E 42
Difference between Primary and Secondary Data

Primary Data Secondary Data


• Real time data • Past data
• Sure about sources of data • Not sure about sources of data
• Help to give results/finding • Refining the problem
• Costly and Time consuming • Cheap and No time consuming
process process
• Avoid biasness of response • Can not know in data biasness
data or not
• More flexible • Less Flexible

Rediet.E 43
• Data collection methods?

Rediet.E 44
Data collection methods
 Before any statistical work can be done data must be
collected.
 Data collection is a crucial stage in the planning and
implementation of a study.
 Data collection techniques allow us to systematically
collect data about our objects of study (people, objects,
and phenomena) and about the setting in which they
occur.
 In the collection of data we have to be systematic. If
data are collected haphazardly(lacking order or organization), it
will be difficult to answer our research questions in a
conclusive way.
Rediet.E 45
Data collection methods…

The choice of methods of data collection is based on:


♣ The accuracy of information they will yield

♣ Practical considerations, such as, the need for


personnel, time, equipment and other facilities, in
relation to what is available.

§ Types of data/information to be collected

Rediet.E 46
Data collection methods…

 Accuracy and “practicability” are often inversely


correlated. A method providing more satisfactory
information will often be a more expensive or
inconvenient one

♣ Therefore, accuracy must be balanced against


practical considerations (resources and other practical
limitations)

Rediet.E 47
Data collection methods…
 For quantitative data, we usually use questionnaires
(standard or structured)
- The questionnaire could be self-administered or
interviewer-administered (either face-to-face or
telephone, or other electronic media such as
online internet)

Rediet.E 48
Data collection methods…

- Self-administered questionnaire is filled by the study


subjects themselves at spot or through mails
- Self -administered questionnaires are suitable for
literate study subjects, simple questions that don't
need further clarifications and sensitive matters (e.g.
sexual issues, criminal activities, substance abuse)

Rediet.E 49
Data collection methods…

- Interviewer- administered questionnaires are suited


for illiterate study subjects, complex questions that
need further clarifications and non-private or non-
sensitive issues, and when information on emotional
reactions of study subjects is to be recorded

Rediet.E 50
Data collection methods…

For qualitative data, the common methods of


collection are focus group discussion, in-depth
interview(unstructured/semi-structured) observation
(participant/non-participant), and case studies

Rediet.E 51
Types of Questions

 Depending on how questions are asked and recorded we


can distinguish two major possibilities - open –ended
questions, and closed questions
Open-ended questions
Open-ended questions permit free responses that should
be recorded in the respondent’s own words. The
respondent is not given any possible answers to choose
from
Such questions are useful to obtain information on:
 Facts with which the researcher is not very familiar,
 Opinions, attitudes, and suggestions of informants, or
 Sensitive issues
Rediet.E 52
Open-ended questions…

For example
Can you describe exactly what the traditional birth
attendant did when your labor started?
What do you think are the reasons for a high drop-out
rate of village health committee members?
What would you do if you noticed that your daughter
(school girl) had a relationship with a teacher?

Rediet.E 53
Closed Questions
 Closed questions offer a list of possible options or
answers from which the respondents must choose
 When designing closed questions one should try to:
 Offer a list of options that are exhaustive and
mutually exclusive

 Keep the number of options as few as possible

 Closed questions are useful if the range of possible


responses is known

Rediet.E 54
Closed Questions…

For example
 What is your marital status?
1. Single
2. Married/living together
3. Separated
4. divorced
5. widowed
 Have you ever gone to the local village health worker for
treatment?
1. Yes
2. No
Closed questions may also be used if one does not want to
waste the time of the respondent and interviewer by
obtaining more information Rediet.E
than one needs 55
Problems in gathering data
 It is important to recognize some of the main problems that may
be faced when collecting data so that they can be addressed in the
selection of appropriate collection methods and in the training of
the staff involved
 Common problems might include:
 Language barriers
 Lack of adequate time
 Expense
 Inadequately trained and experienced staff
 Invasion of privacy
 Suspicion (mistrust)
 Bias:(Systematic error (not random) in a study that leads to an
incorrect estimate (RR) of the association between exposure and
disease)
 Cultural norms (e.g. whichRediet.E
may preclude (prevent) men 56
Methods of data collection summary
Types of data Data type by source Methods of data collection

Qualitative Primary FGD

Primary In-depth – interview

primary Observation

Quantitative Primary / secondary Questionnaires


-open/closed
-Structured
-Self/Interviewer
administered
Primary / secondary -Observation
-Use of documentary
sources
57
Rediet.E
Chapter 4

Methods of Data Organization and Presentation

Rediet.E 58
Presentation of Results
 For data to be more easily appreciated and to draw
quick comparisons, it is often useful to arrange the data
in the form of a table, or in one of a number of different
graphical forms

 Quite often, the presentation of data in a meaningful


way is done by preparing a frequency distribution. If
this is not done the raw data will not present any
meaning and any pattern in them (if any) may not be
detected

Rediet.E 59
Statistical tables
• A statistical table is an orderly and systematic
presentation of numerical data in rows and
columns

-Tables are often the best way to show small data


sets

Rediet.E 60
Importance of statistical Tabulation
Statistical data arranged in tables have some definite
advantages over those descriptively stated

a) Tabulated data can be easily understood than facts stated


in the form of description
b) They have a lasting impression
c) They facilitate comparison(make easier to compare)
d) Statistical tables make easier the summation of items
and detection of errors and omissions
e) When data are tabulated all unnecessary details and
repetitions are avoided

Rediet.E 61
Parts of a table
a) Title
b) Captions
c) Stubs
d) Body
e) Head note
f) Foot note
g) Source

Most of the times, the 1st four parts are present in all tables while
the presence of the remaining three depends upon the specific
purpose

Rediet.E 62
Parts of a table cont...
a) Titles : It explains - What the data are about
- from where the data are collected
- time period of the data
- how the data are classified
b) Captions: The titles of the columns are given in captions.
In case there is a sub-division of any column there
would be sub-caption headings also

c) Stubs: The titles of the rows are called stubs


d) Body: Contains the numerical data
e) Head note: A statement below the title which clarifies the content
of the table
f) Foot note: A statement below the table which clarifies some
specific items given in the table e.g. it explains omissions, etc.
g) Source: Source of the data should be stated
Rediet.E 63
Constructing a statistical table

- no hard and fast rules


a) The title of the table with its four aspects explained earlier
should be written at the top. If the data are not original their
source should be given below the table
b) Figures to be compared should be placed in adjacent
columns or rows
c) Much emphasis is given to important items while
constructing the table. Such items should be properly
placed in the table. The most prominent position is the top-
most row and the extreme "left" column
d) If the rows are very long then the stub headings should also
be mentioned at the right Rediet.E
side of the table 64
Constructing a statistical table cont...
e) In headings, whenever possible use the singular. A
column of years should be headed as ‘Year' and not
'Years'
f) A statistical table must contain sub-total for each
separate class of data and a grand total for all
combined classes
g) When necessary, unit designations must be written at
the top of the columns
i) Overall, tables should be clearly labelled. The reader
should be able to determine without difficulty what
is tabulated
Rediet.E 65
1. Frequency Distributions (Tables)
• Ordered array: A simple arrangement of individual observations in the order of
magnitude
• Very difficult with large sample size

12 19 27 36 42 59
15 22 31 39 43 61
17 23 31 41 44 65
18 26 34 41 54 67

Rediet.E 66
• Data contain information and that
summarization is a way of making it easier to
determine the nature of this information
• The actual summarization and organization of
data starts from frequency distribution
• Frequency distribution: A table which has a
list of each of the possible values that the data
can assume along with the number of times
each value occurs

Rediet.E 67
Relative frequency: useful at times to know the proportion,
rather than the number of values falling within a particular
class interval
a)Table for Qualitative variable: Count the number of cases
in each category
- Example1: The intensive care unit type of 25 patients entering
ICU at a given hospital:
1. Medical
2. Surgical
3. Cardiac
4. Other

Rediet.E 68
Frequency Relative Frequency
ICU Type (How often) (Proportionately often)
Medical 12 0.48
Surgical 6 0.24
Cardiac 5 0.20
Other 2 0.08
Total 25 1.00

Rediet.E 69
b) Table for Quantitative variable:
- Select a set of continuous, non-overlapping
intervals such that each value can be placed in
one, and only one, of the intervals
- The first consideration is how many intervals to
include

Rediet.E 70
To determine the number of class intervals and the
corresponding width, we may use:

Sturge’s rule:
K  1  3.322(logn)
LS
W
K
where
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value

Rediet.E 71
Example:
– Leisure time (hours) per week for 40 college
students:
23 24 18 14 20 36 24 26 23 21 16 15 19 20 22
14 13 10 19 27 29 22 38 28 34 32 23 19 21 31
16 28 19 18 12 27 15 21 25 16
K = 1 + 3.22 (log40) = 6.32 ≈ 6
Maximum value = 38, Minimum value = 10
Width = (38-10)/6 = 4.66 ≈ 5

Rediet.E 72
Time Relative Cumulative
(Hours) Frequency Frequency Relative
Frequency
10-14 5 0.125 0.125
15-19 11 0.275 0.400
20-24 12 0.300 0.700
25-29 7 0.175 0.875
30-34 3 0.075 0.950
35-39 2 0.050 1.00
Total 40 1.00

Rediet.E 73
• Cumulative frequencies: When frequencies of two or more
classes are added

• Cumulative relative frequency: The percentage of the total


number of observations that have a value either in that
interval or below it

• Mid-point: The value of the interval which lies midway


between the lower and the upper limits of a class

• True limits: Are those limits that make an interval of a


continuous variable continuous in both directions

• Used for smoothening of the class intervals

• Subtract 0.5 from the lower Rediet.E


and add it to the upper limit 74
Time
(Hours) True limit Mid-point Frequency
10-14 9.5 – 14.5 12 5
15-19 14.5 – 19.5 17 11
20-24 19.5 – 24.5 22 12
25-29 24.5 – 29.5 27 7
30-34 29.5 – 34.5 32 3
35-39 34.5 - 39.5 37 2
Total 40

Rediet.E 75
Simple Frequency Distribution
• Primary and secondary cases of syphilis morbidity
by age, 1989
Age group Cases
(years) Number Percent

0-14 230 0.5


15-19 4378 10.0
20-24 10405 23.6
25-29 9610 21.8
30-34 8648 19.6
35-44 6901 15.7
45-54 2631 6.0
>44 1278 2.9

Total 44081 100


Rediet.E 76
Two Variable Table
• Primary and secondary cases of syphilis morbidity
by age and sex, 1989
Age group Number of cases
(years) Male Female Total

0-14 40 190 230


15-19 1710 2668 4378
20-24 5120 5285 10405
25-29 5301 4306 9610
30-34 5537 3111 8648
35-44 5004 1897 6901
45-54 2144 487 2631
>44 1147 131 1278

Total 26006 18075 44081


Rediet.E 77
Tables can also be used to present more than
three or more variables
Variable Frequency (n) Percent
Sex
Male
Female
Age (yrs)
15-19
20-24
25-29
Religion
Christian
Muslim
Occupation
Student
Farmer
Merchant

Rediet.E 78
2.Diagrammatic Representation

Pictorial representations of numerical data

Rediet.E 79
Importance of diagrammatic representation:

1. Diagrams have greater attraction than


mere figures
2. They give quick overall impression of the
data
3. They have great memorizing value than
mere figures
4. They facilitate comparison
5. Used to understand patterns and trends

Rediet.E 80
Graphical Presentation…
Limitations of Graphical presentation
 The technique of diagrammatic presentation is made
use only for purposes of comparison. It is not to be
used when comparison is either not possible or is not
necessary
 Diagrammatic presentation is not an alternative to
tabulation. It only strengthens the textual exposition
of a subject, and cannot serve as a complete substitute
for statistical data
 It can give only an approximate idea and as such
where greater accuracy is needed diagrams will not
be suitable
 They fail to bring to light small differences
Rediet.E 81
General directions for the construction of diagrams

 The first thing is the selection of a proper scale. All the


significant characteristics of the figures should be clearly
exhibited by the diagram, and it should also suit the size
of the paper

 The vertical and horizontal scales should be clearly shown


on the diagram itself - the former on the left hand side
and the latter at the bottom of the diagram

 Neatness should be strictly observed and the diagram be


drawn with the aid of geometrical instruments
Rediet.E 82
Directions cont…
The heading (title) should be written in bold letter and
should be self explanatory. The source should be
indicated if the data are not collected by yourself
 Various shades of colours can be used

 Diagrams should be as simple as possible so that the


reader can understand their meaning clearly and easily

N.B. The graph must present a truthful impression


of the data

Rediet.E 83
Specific types of graphs include:
• Bar graph Nominal, ordinal
• Pie chart data

• Histogram
• Stem-and-leaf plot
• Box plot
Quantitative
• Scatter plot data
• Line graph
• Others

Rediet.E 84
1. Bar charts (or graphs)

• Bar graph is especially satisfactory for nominal


and ordinal data
• Categories are listed on the horizontal axis (X-
axis)
• Frequencies or relative frequencies are
represented on the Y-axis (ordinate)
• The heights of bars represent the value of the
frequency (actual number or percentage) for
each category
Rediet.E 85
2.1. Simple bar chart: It is a one-dimensional
diagram
E.g. Bar chart for the type of ICU for 25
patients

Rediet.E 86
2.2 Sub-divided bar chart
• If there are different quantities forming the
sub-divisions of the totals, simple bars may
be sub-divided in the ratio of the various
sub-divisions to exhibit the relationship of
the parts to the whole
• The order in which the components are
shown in a “bar” is followed in all bars used
in the diagram
– Example: Stacked and 100% Component bar
charts

Rediet.E 87
Example: Plasmodium species distribution for
confirmed malaria cases, Zeway, 2003

100 Mixed
P. vivax
80 P. falciparum

60
Percent

40

20

0
August October December
2003

Rediet.E 88
2.3. Multiple bar graph
• Bar charts can be used to represent the
relationships among more than two
variables
• The following figure shows the relationship
between children’s reports of breathlessness
and cigarette smoking by themselves and
their parents

Rediet.E 89
Prevalence of self reported breathlessness among school
childeren, 1998

35
Breathlessness, per cent

30
25
20
15
10
5
0
Neither One Both
Parents smooking

Child never smoked smoked occassionaly child smoked one/week or more

We can see from the graph quickly that the prevalence of the symptoms
increases both with the child’s smoking and with that of their parents

Rediet.E 90
2. Pie chart
• Shows the relative frequency for each category
by dividing a circle into sectors, the angles of
which are proportional to the relative frequency
• Used for a single categorical variable
• Pie chart is important for depicting discrete
variables with relatively few categories
• Use percentage distributions

Rediet.E 91
Distribution fo cause of death for females, in England and Wales, 1989

Others
8%
Digestive System
4%
Injury and Poisoning
3%

Circulatory system
Respiratory system
42%
13%

Neoplasmas
30%

Rediet.E 92
3. Histogram
• Histograms are special type of bar graph in which
frequency distributions with continuous class
intervals turned into graphs
• To construct a histogram, we draw the class
boundaries on a horizontal line and the frequencies on
a vertical line

• Non-overlapping intervals that cover all of the data


values must be used

• The area of each bar is proportional to the frequency


of observations in the interval
Rediet.E 93
Example: Distribution of the age of women at the time of marriage

Age 15-19 20-24 25-29 30-34 35-39 40-44 45-49


group
Number 11 36 28 13 7 3 2

Age of women at the time of marriage

40

35

30

25
No of women

20

15

10

0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group Rediet.E 94
Two problems with histograms
1. They are somewhat difficult to construct
2. The actual values within the respective
groups are lost and difficult to reconstruct

Þ The other graphic display (stem-and-leaf


plot) overcomes these problems

Rediet.E 95
4. Stem-and-Leaf Plot
• A quick way to organize data to give visual impression
similar to a histogram while retaining much more detail
on the data

• Similar to histogram and serves the same purpose and


reveals the presence or absence of symmetry
• Its advantage over the histogram is that it preserves the
information contained in the individual item

• Are most effective with relatively small data sets


Rediet.E 96
Example

• 43, 28, 34, 61, 77, 82, 22, 47, 49, 51, 29, 36,
66, 72, 41
2 2 8 9
3 4 6
4 1 3 7 9
5 1
6 1 6
7 2 7
8 2

Rediet.E 97
Example: 3031, 3101, 3265, 3260, 3245, 3200, 3248,
3323, 3314, 3484, 3541, 3649 (BWT in g)

Stem Leaf Number


30 31 1
31 01 1
32 65 60 45 00 48 5
33 23 14 2
34 84 1
35 41 1
36 49 1

Rediet.E 98
5. Frequency polygon
• A frequency distribution can be portrayed graphically
in yet another way by means of a frequency polygon
• It is special kind of line graph

• To draw a frequency polygon we connect the mid-


point of the tops of the cells of the histogram by a
straight line (i.e. By connecting mid point of the class
boundary)
• The total area under the frequency polygon is equal to
the area under the histogram

• Useful when comparing two or more frequency


distributions by drawing them on the same diagram
Rediet.E 99
Frequency polygon for the ages of 2087 mothers with <5
children, Adami Tulu, 2003
700

600

500

400

300

200

100 Std. Dev = 6.13


Mean = 27.6
0 N = 2087.00
15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0

N1AGEMOTH

Rediet.E 100
6. Ogive Curve (The Cumulative Frequency Polygon)

• Some times it may be necessary to know the number of


items whose values are more or less than a certain
amount
• We may, for example, be interested to know the no. of
patients whose weight is <50 Kg or >60 Kg
• To get this information it is necessary to change the
form of the frequency distribution from a ‘simple’ to a
‘cumulative’ distribution
• Ogive curve turns a cumulative frequency distribution in
to graphs

• Are much more common than frequency polygons

Rediet.E 101
Cumulative frequency of 25 ICU patients

Rediet.E 102
7. Scatter plot

• Most studies in medicine involve measuring more than


one characteristic, and graphs displaying the relationship
between two characteristics are common in literature

• When both the variables are qualitative then we can use a


multiple bar graph

• When one of the characteristics is qualitative and the


other is quantitative, the data can be displayed in box and
whisker plots

Rediet.E 103
• For two quantitative variables we use
bivariate plots (also called scatter plots or
scatter diagrams)

• In the study on percentage saturation of


bile, information was collected on the age
of each patient to see whether a relationship
existed between the two measures

Rediet.E 104
• A scatter diagram is constructed by drawing X-and Y-axes.

• Each point represented by a point or dot() represents a pair of


values measured for a single study subject

Age and percentage saturation of bile for women patients in


hospital Z, 1998
160

140

120
Saturation of bile

100

80

60

40

20

0
0 10 20 30 40 50 60 70 80
Age 105
Rediet.E
8. Line graph
• Useful for assessing the trend of particular situation overtime.
• Helps for monitoring the trend of epidemics
• The time, in weeks, months or years, is marked along the
horizontal axis, and
• Values of the quantity being studied is marked on the vertical
axis
• Values for each category are connected by continuous line
• Sometimes two or more graphs are drawn on the same graph
taking the same scale so that the plotted graphs are comparable

Rediet.E 106
No. of microscopically confirmed malaria cases by species and
month at Zeway malaria control unit, 2003
No. of confirmed malaria cases

2100

1800 Positive
1500 P. falciparum
P. vivax
1200

900

600

300

0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Months

Rediet.E 107
Chapter 5

Summarizing Data

Rediet.E 108
Properties of frequency distribution
Other than the mentioned advantages of using diagrams,
we use graphical representations to demonstrate three
properties of frequency distributions:
central location,
variation or
dispersion, and skewness
When we graph frequency distribution data, we
often find that the graph looks like with a large part of
the observations clustered around a central value. This
clustering is known as the central location or central
tendency of a frequency distribution

Rediet.E 109
 The central values that result from the various
methods are known collectively as measures of
central location

 Frequency distributions of some characteristics


of human populations tend to be symmetrical.
On the other hand, the graph of suicide data
was asymmetrical (not symmetrical). A distribution
that is asymmetrical is said to be skewed

Rediet.E 110
Fig. Three curves identical in shape with different central location

Rediet.E 111
Measures of Central Tendency/ Measures of Location

 “Central Tendency”: The tendency of statistical data to


get concentrated at certain values
 Measures of central Tendency: the various methods of
determining the actual value at which the data tend to
concentrate. Hence, measures of central Tendency is a
value which tends to sum up or describe the mass of the
data
 Measure of central location is the single value that best
represents the whole series involving magnitude of the
same variable /characteristic such as age or height of a
group of persons
 These central tendency includes arithmetic mean, median
and mode Rediet.E 112
1. Arithmetic Mean/simple Mean

Definition: the arithmetic mean is the sum of all


observations divided by the number of observations

 It is the arithmetic average and is commonly called


simply “mean” or “average”
- it is usually denoted by µ /
 Let us consider X1,X2,..., XN are the list of N
measurements obtained from N subjects. Then the mean
for ungrouped number of measurements for N subjects is
defined as:

Rediet.E 113
Properties of the arithmetic mean
 Uniqueness: For a given set of data there is one and only
one arithmetic mean
 Simplicity: The mean is easily understood and easy to
compute
 Center of gravity: Algebraic sum of the deviations of
the given values from their arithmetic mean is always
zero. i.e.∑(xi- ) )=0. So, mean is the center of gravity of
the given data set
 Sensitivity: Since each and every value in a set of data
enters into the computation of the mean, it is greatly
affected by extreme values
 So, in skewed distribution, it is undesirable measure of
central tendency Rediet.E 114
2. Median
 An alternative measure of central location, perhaps second in
popularity to the arithmetic mean
 Suppose there are n observations in a sample. If these
observations are ordered from smallest to largest, then the
median is defined as follows:
 The median, is a value such that at least half of the
observations are less than or equal to median and at least
half of the observations are greater than or equal to median .
 Median means middle, and the median is the middle of a set
of data that has been put into rank order
 To find the median of a data set:
Arrange the data in ascending order
 Find the middle observation of this ordered data
Rediet.E 115
Median…

 If the number of data is ODD, then the median is the


middle data point

Median =

 If the number of data is EVEN, then the median is the


average of the two values around the middle

Median =

 Extreme values do NOT affect the median, making the


median a good alternative to the mean to measure
116
Rediet.E
Properties of Median
Uniqueness: There is only one median for a
given set of data
 Simplicity: Median is easy to compute
 Insensitivity: median is a positional average In
contrast to the mean; the median is not
influenced to the same extent by extreme
values

Rediet.E 117
3. Mode
 Mode is the value appearing most frequently
 It can be obtained by counting the number of appearance for
each observation from the list
 Important for summarising nominal/categorical types of data
 Disadvantage,
 In small number of observations, there may be no mode.
 In addition, sometimes, there may be more than one mode
such as when dealing with a bimodal (two-peaks)
distribution
 Example
a. 22, 66, 69, 70, 73. (no modal value)
b. 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal
value = 3.0 kg)

Rediet.E 118
Properties of Mode
It is not affected by extreme values
It can be calculated for distributions with open
end classes
Often its value is not unique
The main drawback of mode is that often it
does not exist
Rediet.E 119
Central Tendency cont---

Rediet.E 120
 Quartiles: is quintiles which divide the distribution into
four equal parts
- The 25th percentile demarcates the first quartile(Q1)
- the median or 50th percentile demarcates the second
quartile(Q2)
- the 75th percentile demarcates the third quartile (Q3)
- and the 100th percentile demarcates the fourth
quartile(Q4)

Rediet.E 121
Central Tendency cont---

Haileab.f (Bsc.) 122


Central Tendency cont---

Rediet.E 123
Skewness cont---
 If extremely low or extremely high observations are
present in a distribution, then the mean tends to shift
towards those scores
 Based on the type of skewness, distributions can be:
 Symmetrical distribution: It is neither
positively nor negatively skewed. A curve is
symmetrical
 if one half of the curve is the mirror image of
the other half
 If the distribution is symmetric and has only one
mode, all three measures are the same, an
example being the normal distribution
Rediet.E 124
Rediet.E 125
Positively skewed distribution: Occurs when the
majority of scores are at the left end of the curve
and a few extreme large scores are scattered at the
right end

For positively skewed distributions (where the


upper, or left, tail of the distribution is longer
(“fatter”) than the lower, or right, tail) the
measures are ordered as follows:
mode < median < mean

Rediet.E 126
Rediet.E 127
Negatively skewed distribution: occurs when
majority of scores are at the right end of the curve
and a few small scores are scattered at the left end

For negatively skewed distributions (where the


right tail of the distribution is longer than the left
tail), the reverse ordering occurs:

mean < median < mode

Rediet.E 128
Rediet.E 129
Summary
 Given a set of observations, an investigator may
naturally ask which measure of central tendency is best
to use with the data
 Two factors are important in making this decisions:
1. The scale of measurement
2. The shape of the distribution of observations

Haileab.f (Bsc.) 130


Summary
 Which measure of central tendency is best to use with the data?
Two factors are important in making this decision:
1. The scale of measurement
2. The shape of the distribution of observations
Therefore,
 The arithmetic mean is used for interval and ratio scale data
with symmetric distribution (i.e. normally or approximately
normally distributed data set)
 The median and Quartiles is used for ordinal, interval and
ratio scale data whose distribution is skewed
 For nominal data, mode is the appropriate measure of central
tendency
 The geometric mean is used primarily for observations
measured on a logarithmic or exponential scale such as titrated
Rediet.E 131
values
Skewness:
 The skewness of a distribution is measured by comparing
the relative positions of the mean, median and mode
 Distribution is symmetrical
-Mean = Median = Mode
 A distribution that has the central location to the left and
a tail off to the right is said to be “positively skewed” or
“skewed to the right.
Mode < Median < Mean
 A distribution that has the central location to the right and
a tail off to the left is said to be “negatively skewed”
or “skewed to the left”
Mean < Median < Mode
Rediet.E 132
Normally distributed
positively
Negatively

Haileab.f (Bsc.) 133


2. Measures of Dispersion/ Variation

 Measures of dispersion or variability will give us


information about the spread of the scores how closely
the rest of the data fall about that central value in our
distribution

 More over, two or more sets may have the same mean
and/or median but they may be quite different

 Thus to have a clear picture of data, one needs to have a


measure of dispersion or needs to have a measure of
dispersion or variability (scatterdness) amongst
observations in the set Rediet.E 134
1. RANGE:
 It is the difference between the largest and
smallest observation from the data
R= L value – S value from the data set

EXAMPLE: Consider the data on the weight of


10 new born children at university of Gondar
hospital within a month: 2.51, 3.01, 3.25,
2.02,1.98, 2.33, 2.33, 2.98, 2.88, 2.43.

Rediet.E 135
 Then the range for the dataset can be computed by first
arranging the all observation in to ascending order as:
1.98, 2.02, 2.33, 2.33, 2.43, 2.51, 2.88, 2.98, 3.01, 3.25.

 Maximum-Minimum=3.25-1.98=1.27
 It is based upon two extreme cases in the entire
distribution, the range may be considerably changed if
either of the extreme cases happens to drop out, while
the removal of any other case would not affect it at all

 It wastes information , it takes no account of the entire


data
Rediet.E 136
2. The interquartile range(IQR):
 It reflects the variability among the middle 50
percent of the observation in a data set
 is the difference between the first and the third
quartiles
 To compute it, we first sort the data, in
ascending order, then find the data values
corresponding to the first quarter of the
numbers (first quartile), and then the third
quartile

Rediet.E 137
IQR Cont---

Haileab.f (Bsc.) 138


Example
Given the following data set (age of patients) find the
interquartile range!
18,59,24,42,21,23,24,32
1. sort the data from lowest to highest
18 21 23 24 24 32 42 59

2. find the bottom and the top quarters of the data


3. find the difference (interquartile range) between
the two quartiles

Rediet.E 139
Example …
 1st quartile = The {1/4 (n+1)}th observation = (2.25) th

observation = 21 + (23-21)x 0.25 = 21.5

 3rd quartile = {3/4 (n+1)}th observation = (6.75)th observation


= 32 + (42-32)x 0.75 = 39.5
Hence, IQR = 39.5 - 21.5 = 18
 i.e. 50% observation age of patients between 21.5 and 39.5
• The best measure for skewed data
 The interquartile range is a preferable measure to the range.
Because it is less prone to distortion by a single large or
small value. That is, outliers in the data do not affect the
inerquartile range
 Also, it can be computed when
Rediet.E
the distribution has open-end
140
3. Variance

Variance:
 While the inter-quartile range eliminates the problem of outliers
it creates another problem in that you are eliminating half of
your data
 The solution to both problems is to measure variability from the
center of the distribution

 Variance measure how far on average scores deviate or differ


from the mean

 To compute variance we first start by computing the deviation of


each observation from the mean
 As the property of mean, the sum of the deviation of each
observation from the mean is zero
141
Rediet.E
Variance:
 Hence to avoid this problem, let us take the square of
the deviation from the mean
 Thus variance is defined as the sum of the square of the
deviation of each observation from the mean divided by
total number of observation
 Mathematically the formula for population variance is
defined as:

Rediet.E 142
• Mathematically the formula for sample variance is
defined as:

Rediet.E 143
4. Standard Deviation
Standard Deviation:
 The sample and population standard deviations are
denoted by S and σ (by convention) respectively
 The standard deviation(S.D.), is just the positive
square root of the variance
 It expresses exactly the same information as the
variance, but re-scaled to be in the same units as the
mean
 The best measures for normally distributed data
 Mathematically: Population standard deviation

Haileab.f (Bsc.) 144


Standard Deviation:

 Sample standard deviation can be defined as:

 Example1 The Areas of spray able surfaces with DDT


from a sample of 15 houses are measured as follows (in
m2) :

101,105,110,114,115,124,125,125,130,133,135,136,137
,140,145

Rediet.E 145
Example 1
 Find the variance and standard deviation of the
above distribution
 Solutions
The mean of the sample is 125 m2.
Variance (sample) = s2 = Σ(xi –x)2/n-1 = {(101-125)
2
+(105-125) 2 + ….(145-125) 2 } / (15-1)
= 2502/14
= 178.71 m4
Hence, the standard deviation
=
= 13.37 m2
Rediet.E 146
Example 2
 Consider the dataset about current age of women
which was collected from 240 women
 The variance for the dataset can be computed as:

 the standard deviation can be compute as:

Rediet.E 147
5. Coefficient of variance
 The standard deviation is an absolute measure of deviation
of observations around their mean and is expressed with the
same unit of the data
 Due to this nature of the standard deviation it is not directly
used for comparison purposes with respect to variability
 Coefficient of variation, is often used for this purpose
 The coefficient of variation (CV) is defined by:

CV =

 The coefficient of variation is most useful in comparing the


variability of several different samples, each with different
Rediet.E 148
means
Coefficient of variance…
 CV is a relative measure free from unit of measurement
 example

Weights of newborn Weights of newborn


elephants (kg) mice (kg)

929 853 0.72 0.42


878 939 0.63 0.31
895 972 0.59 0.38
937 841 0.79 0.96
801 826 1.06 0.89
Mice show greater
n=10, X = 0.68 birth-weight
n=10, X = 887.1
s = 0.255 variation
s = 56.50
CV = 0.375
CV = 0.0637
Rediet.E 149
When to use coefficient of variance

 When comparison groups have very different means


(CV is suitable as it expresses the standard deviation
relative to its corresponding mean)

 When different units of measurements are involved,


e.g. group 1 unit is mm, and group 2 unit is gm (CV is
suitable for comparison as it is unit-free)

 In such cases, standard deviation should not be used


for comparison

Rediet.E 150
Summary
Data type vs Measure of central tendency and
dispersion

Central Tendency Measure of Dispersion

Nominal  Mode Nominal  IQR

Ordinal  Median Ordinal  range

Interval  Mean Interval  SD

Ratio  Mean Ratio  SD

Rediet.E 151
Thank you very much

Rediet.E 152

You might also like