1 Introduction To Statistics 2021

Introduction to
Statistics
By Leul Deribe (BSc, MPH/RH)
1
INTRODUCTION
• Statistics
• Data
• Scientific method
• Collection, organization, presentation, analysis and
interpretation of data
• Biostatistics
• The application of statistics on biological or life science
data
2
INTRODUCTION …
• Data
• Measurements taken on variables
• Measurement
• Assigning values to objects
• Statistical data: numerical descriptions of things.

• These descriptions may take the form of counts or
measurements.
• all 'numerical descriptions' are not statistical data.
3
INTRODUCTION …
• Descriptive statistics: - Statistical procedures used
to summarize, organize, and simplify data.
• A number that conveys a particular characteristic of a
set of data.
• Summarizes a set of data with one number or graph.
• Statistical inference. deals with techniques of

making conclusions about the population.
• use measurements from a sample to reach conclusions
about a larger, unmeasured population.
4
Characteristics of statistical data
• They must be in aggregates – statistics are 'number of facts.' A
single fact, even though numerically stated, cannot be called
statistics.
• They must be affected to a marked extent by a multiplicity of
causes. –it is aggregates of such facts only as grow out of a '
variety of circumstances'.
• They must be enumerated or estimated according to a
reasonable standard of accuracy
• They must have been collected in a systematic manner for a
predetermined purpose.
• They must be placed in relation to each other. That is, they
must be comparable.
5
Type of variables
• Variable:- A characteristic which takes different
values in different persons, places, or things.
• Something that exists in more than one amount or
in more than one form.
• Any aspect of an individual or object that is
measured (e.g., BP) or recorded (e.g., age, sex) and
takes any value
• Variables can be qualitative (or categorical) or
quantitative (or numerical variables).
6
Qualitative variable:
• A variable or characteristic which cannot be
measured in quantitative form
• can only be identified by name or categories,
• for instance place of birth, ethnic group, type of
drug, stages of breast cancer (I, II, III, or IV), degree
of pain (minimal, moderate, severe or unbearable).
7
• Quantitative variable: A variable that can be
measured (or counted) and expressed numerically.
• Height, wt, # of children, etc.
• Has the notion of magnitude.
8
Quantitative variable is divided into two:
1. Discrete: It can only have a limited number of discrete
values (usually whole numbers).
• E.g., the number of episodes of diarrhoea a child has had in a
year. You can’t have 12.5 episodes of diarrhoea
• Characterized by gaps or interruptions in the values
(integers).
• Both the order and magnitude of the values matter.
• The values aren’t just labels, but are actual measurable
quantities.
9
2. Continuous variable: It can have an infinite
number of possible values in any given interval.
• Both the magnitude and the order of the values matter
• Does not possess the gaps or interruptions
• Weight is continuous since it can take on any number of
values (e.g., 34.575 Kg).
10
Scales of measurement
• Numbers mean different things in different situations.

Consider three answers that appear to be identical
but are not:
• What number were you wearing in the race? “5”
• What place did you finish in? “5”
• How many minutes did it take you to finish? “5”
• To illustrate this difference, consider another person
whose answers to the same three questions were 10,
10, and 10.
11
• “Four is twice as much as two” is true for the pure

numbers themselves and for time, length, and
amount, but it is not true for finish places in a race.
• Fourth place is not twice anything in relation to

second place—not twice as slow or twice as far
behind the second runner
12
• All measurements are not the same.

• Measuring weight = eg. 40kg
• Measuring the status of a patient on scale =
“improved”, “stable”, “not improved”.
• There are four types of scales of measurement.
13
1. Nominal scale:
• Measurement scale in which numbers serve only as
labels and do not indicate any quantitative
relationship.
• The simplest type of data, in which the values fall into
unordered categories or classes
• Consists of “naming” observations or classifying them
into various mutually exclusive and collectively
exhaustive categories
• Uses names, labels, or symbols to assign each
measurement.
• Examples: Blood type, sex, race, marital status, etc.
14
Example of nominal Scale:
Race/Ethnicity:
1. Black • The numbers have NO
2. White meaning
3. Latino • They are labels only
4. Other
15
• If nominal data can take on only two possible values,
they are called dichotomous or binary.
• So sex is not just nominal, it is dichotomous (male or
female).
• Yes/no questions
• E.g., cured from TB at 6 months of Rx
16
2. Ordinal scale:
• has the characteristic of the nominal scale (different numbers mean
different things) plus the characteristic of indicating greater than or
less than.
• Assigns each measurement to one of a limited number of categories
that are ranked in terms of order.
• Measurement scale in which numbers are ranks;
• equal differences between numbers do not represent equal
differences between the things measured.
• Although non-numerical, can be considered to have a natural ordering
• Examples: Patient status, cancer stages,
social class, etc.
17
Example of ordinal scale:
• Pain level: • The numbers have

1. None LIMITED meaning
2. Mild 4>3>2>1 is all we know
apart from their utility as
3. Moderate labels
4. Severe
18
3. Interval scale:
- has the properties of both the nominal and ordinal scales plus
the
- additional property that intervals between the numbers are
equal.
- Measured on a continuum and differences between any two
numbers on a scale are of known size.
Example: Temp. in oF on 4 consecutive days
Days: A B C D
Temp. oF: 50 55 60 65
For these data, not only is day A with 50o cooler than day D with
65o, but is 15o cooler.
19
3. Interval scale:
- It has no true zero point. “0” is arbitrarily chosen and
doesn’t reflect the absence of temp.
- The zero point is arbitrarily defined.
- You may not make simple ratio statements
- You may not say that 100° is twice as hot as 50° or
- that a person with an IQ of 60 is half as intelligent as a
person with an IQ of 120.
20
4. Ratio scale:
- has all the characteristics of the nominal, ordinal, and
interval scales plus one other:
- It has a true zero point, which indicates a complete
absence of the thing measured.
- On a ratio scale, zero means “none.”
- Measurement begins at a true zero point and the scale
has equal space.
- Examples: Height, age, weight, BP, etc.
• Note on meaningfulness of “ratio”- you can make ratio
statements
• Someone who weighs 80 kg is two times as heavy as
someone else who weighs 40 kg. This is true even if weight
had been measured in other measurements.
21
Characteristics of the four scales of measurement
22
23
Interval
Ordinal
Nominal
Ratio
Degree of precision in measuring
Data
• Data are numbers which can be measurements or can
be obtained by counting
• The raw material for statistics
• Can be obtained from:
• Routinely kept records, literature
• Surveys
• Counting
• Experiments
• Reports
• Observation
• Etc
24
Data
• raw score/data: Score obtained by observation or
from an experiment
25
Types of Data
1. Primary data: collected from the items or

individual respondents directly by the researcher
for the purpose of a study.
2. Secondary data: which had been collected by

certain people or organization, & statistically
treated and the information contained in it is used
for other purpose by other people
26
Methods of Data collection
• Questionnaires
• Interviews
• Focus group interviews
• Observation
• Documentary source
27
Methods of data organization
and presentation
28
Descriptive statistics (Describing variables)
• Descriptive statistic includes tables, graphical /chart

displays and calculation of summary measures such as
proportions and averages
• The methods of describing variables differ depending on the

type of data (Numerical or Categorical).
29
1.Describing categorical variables
• Table of frequency distributions

• Frequency
• Relative frequency
• Cumulative frequencies
• Charts
• Bar charts
• Pie charts
30
Statistical Tables
• a table could be either of simple frequency table or
cross tabulation.
• The simple frequency table
• is used when the individual observations involve only to
a single variable
• The cross tabulation
• is used to obtain the frequency distribution of one
variable by the subset of another variable.
31
Construction of tables
• Tables should be as simple as possible.
• Tables should be self-explanatory. For that purpose
• Title should be clear and to the point( a good title answers:
what? when? where? how classified ?) and it be placed above
the table.
• Each row and column should be labelled.
• Numerical entities of zero should be explicitly written rather
than indicated by a dash. Dashed are reserved for missing or
unobserved data.
• Totals should be shown
• If data are not original, their source should be given in a
footnote.
32
Frequency distributions
• A simple and effective way of summarizing categorical data is to

construct a frequency distribution table.
• This is done by counting the number of observations falling into

each of the categories, or levels of the variables.
33
Simple Frequency Distributions
• Scores arranged from highest to lowest, with the frequency shown for
each score.
• column tells
• the name of the variable that is being measured. The generic
name for any variable is X, which is the symbol used in formulas.
• The Frequency ( f ) column shows how frequently a score
occurred.
• The tally marks are used when you construct a rough draft
version and are not usually included in the final form
• N is the number of scores and is found by summing the numbers
in the f column.
• useful way to present a set of data because you pick up valuable
information with just a glance.
34
Raw data
35
Steps to construct simple frequency distributions
1. Find the highest and lowest scores. Highest score is 35; lowest is 5.
2. In column form, write in descending order all the numbers. 35 to 5.
3. At the top of the column, name the variable being measured. Satisfaction With
Life Scale scores.
4. Start with the number in the upper left-hand corner of the scores, draw a line
under it, and place a tally mark beside that number in the column of numbers.
Underline 15 in Table 2.1, and place a tally mark beside 15 in Table 2.2.
5. Continue underlining and tallying for all the unorganized scores.
6. Add a column labeled f (frequency).
7. Count the number of tallies by each score and enter the count in the f column.
2, 1, 2, 4, . . . , 0, 2.
8. Add the numbers in the f column. If the sum is equal to N, you haven’t left out
any scores. Sum = 100.
36
37
38
Group frequency
• Compilation of scores into equalsized ranges (class
intervals), with the frequency shown for each
interval.
• class interval
• A range of scores in a grouped frequency distribution.
• The midpoint of each interval represents all the
scores in that interval.
39
40
Relative Frequency
• Is useful to compute the proportion, or percentages of

observations in each category.
• The distribution of proportions is called the relative frequency

distribution of the variable.
41
Cumulative frequency
• The cumulative frequency of a category is the number of

observations in the category plus observations in all categories
smaller than it.
• Cumulative relative frequency

• is the proportion of observations in the category plus observations in all
categories smaller than it, and
• is obtained by dividing the cumulative frequency by the total number of
observations
42
Table 2. Frequencies of serum cholesterol levels for
1067 US males of ages 25-34, (1976-1987).
---------------------------------------------------------------------------------------------------------------------
Cholesterol level
Mg/100ml freq Relative freq(%) Cum freq Cum.rel. freq(%)
----------------------------------------------------------------------------------------------------------------------
80-119 13 1.2 13 1.2
120-159150 14.1 163 15.3
160-199442 41.4 605 56.7
200-239299 28.0 904 84.7
240-279115 10.8 1019 95.5
280-31934 3.2 1053 98.7
320-3599 0.8 1062 99.5
360-3995 0.5 1067 100
----------------------------------------------------------------------------------------------------------------------
Total 1067 100
43
Table 1. Distribution of birth weight of
newborns between 1976-1996 at TAH.
Rel.Freq(%) Cum. Freq

BWT Freq. Cum.rel.freq(%).
Very low
43 0.4 43 0.4
Low 793 8 836 8.4
Normal 8870 88.9 9706 97.3
Big 268 2.7 9974 100
Total 9974 100
44
Graphs of Frequency
Distributions
45
Charts
• Bar charts: display the frequency distribution for nominal or ordinal data.
• The vertical axis should always start from 0 but the horizontal can start from
any where.
• The bars should be of equal width and should be separated from one
another so as not to imply continuity
46
10000 100.0%
9000 8870
90.0% 89%
8000
80.0%
7000
6000 70.0%
5000
60.0%
4000
50.0%
3000
2000 40.0%
1000 793
268 30.0%
43
0
g 20.0%
al
w
Bi
Lo
rm
No
10.0% 8%
3%
w
0%
0.0%
lo
Very low Low Normal Big

ry
Ve
Figure 1. Bar charts showing frequency distribution of the variable ‘BWT’

described in Table 2.
47
Bar charts for comparison
• In order to compare the distribution of a variable

for two or more groups, bars are often drawn along
side each other for groups being compared in a
single bar chart
48
6000
100 88.989
90
5000
80
4000
70
60
Percent
Freq.
50 Yes
3000
40 No
2000 30
Antenatal Care
20 9 7.9
1000 10 2.13.1
No
NNo 0
Yes
Low Normal Big
0
Low Normal Big
BWT BWT
Fig 2. Bar chart indicating categories of birth weight of 9975 newborns grouped by
antenatal follow-up of the mothers
49
Pie chart
• Pie Chart: displays the frequency distribution for

nominal or ordinal data.
• In a pie chart the various categories into which the

observation fall are represented along sectors of a
circle,
• each sector represents either the frequency or the
relative frequency of observation
• the angles are proportional to frequency or the
relative.
50
Pie chart…
Figure 8. Pie chart showing distribution of subjects

by their educational status, Jimma 2009.
51
43 793
268
Very low
Low
Normal
Big
8870
0.4 8
2.7
Very low
Low
Normal
Big
88.9
Figure 3. Pie charts showing frequency distribution of the variable ‘BWT’

described in Table 2. 52
2.Describing numerical variables
• Graphs
• Histograms
• Frequency polygons
• Cumulative frequency polygons
53
Graphs...
• Diagrams have greater attraction than mere figures.
• They give delight to the eye, add a spark of interest and as such
catch the attention as much as the figures dispel it.
• They help in deriving the required information in less time and

without any mental strain.
• They have great memorizing value than mere figures. This is so

because the impression left by the diagram is of a lasting nature.
• They facilitate comparison
54
Graphs…
• Every graph should be self-explanatory and as simple as possible.
• Titles are usually placed below the graph and it should again
question what ? Where? When? How classified?
• Legends or keys should be used to differentiate variables if more
than one is shown.
• The axes label should be placed to read from the left side and from
the bottom.
• The units in to which the scale is divided should be clearly indicated.
• The numerical scale representing frequency must start at zero or a
break in the line should be shown.
55
Histograms
• Are frequency distributions with continuous class interval that
have been turned into graphs.
• Given a set of numerical data, we can obtain impression of the

shape of its distribution by constructing a histogram.
• Is constructed by choosing a set of non-overlapping intervals

(class intervals) and counting the number of observations that
fall in each class.
• The number of observations in each class is called the frequency.
56
Histograms……………..
• It is necessary that the class intervals be non-overlapping

so that each observation falls in one and only one interval.
• Except for the two boundaries, class intervals are usually

chosen to be of equal width.
• If this is not the case, the histogram could give a

misleading impression of the shape of the data.
57
58
2000
1800
1600
1400
1200
1000
800
600
F re q u e n c y
400
Std. Dev = 502.34
200 Mean = 3126
0 N = 9975.00
Birth weight
Fig 5. A histogram displaying frequency distribution of birth weight of

newborns at Tikur Anbessa Hospital
59
Frequency polygons
• Instead of drawing bars for each class interval, sometimes a

single point is drawn at the mid point of each class interval and
consecutive points joined by straight line.
• Graphs drawn in this way are called frequency polygons (line

graphs).
• Frequency polygons are superior to histograms for comparing

two or more sets of data.
60
61
50
40
%
30
20
SEX
10
Males
Females
0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Birth Weight
Fig.6. Frequency polygon of birth weight of 9975 newborns at

Tikur Anbessa Hospital for males and females
62
Table 5. Frequencies of serum cholesterol levels for 1067 US
males of ages 25-34 1976-1980
-------------------------------------------------------------------------------------------------------------
Cholesterol level
Mg/100ml freq Relative freq Cum freq Cum.rel. freq
-------------------------------------------------------------------------------------------------------------
80-119 13 1.2 13 1.2
120-159 150 14.1 163 15.3
160-199442 41.4 605 56.7
200-239299 28.0 904 84.7
240-279115 10.8 1019 95.5
280-31934 3.2 1053 98.7
320-3599 0.8 1062 99.5
360-3995 0.5 1067 100
-------------------------------------------------------------------------------------------------------------
Total 1067 100
63
Table 6. Frequencies of serum cholesterol levels for 1227 US
males of ages 55-64 1976-1980
--------------------------------------------------------------------------------------------------
Cholesterol level
Mg/100ml freq Relative freq Cum freq Cum.rel. freq
--------------------------------------------------------------------------------------------------
80-119 5 0.4 5 0.4
120-159 48 3.9 53 4.3
160-199 265 21.6 318 25.9
200-239 458 37.3 776 63.2
240-279 281 22.9 1057 86.1
280-319 128 10.4 1185 96.5
320-359 35 2.9 1220 99.4
360-399 7 0.5 1227 100
------------------------------------------------------------------------------------------------
Total 1227 100
64
45 100
40 90
35 80
30 70
25 Ages 25-34 60
elativefrequency(%)
Ages 55-64 50
20 Ages 25-34
y(%
)
40
Ages 55-64
c
15
q e
un
30
efre
10
R
tiv
20
la
ere
5
tiv
10
mla
u
0
u
C
80-119 120-159 160-199 200-239 240-279 280-319 320-359 360-399 0
80-119 120-159 160-199 200-239 240-279 280-319 320-359 360-399
Serum cholesterol levels (mg/100ml)
Serum cholesterol levels (mg/100ml)
Fig. 7. Frequency polygon (Ogive curves Vs survival curves) and Cumulative frequency
polygons of serum cholesterol levels for 2294 males aged 25-34 and 55-64 years, 1976-1980
65
THANK YOU!
66

1 Introduction To Statistics 2021

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 Introduction To Statistics 2021

Uploaded by

Copyright:

Available Formats

Introduction to

By Leul Deribe (BSc, MPH/RH)

• Statistical data: numerical descriptions of things.

• Statistical inference. deals with techniques of

• Height, wt, # of children, etc.

• Has the notion of magnitude.

• Numbers mean different things in different situations.

• “Four is twice as much as two” is true for the pure

• Fourth place is not twice anything in relation to

• All measurements are not the same.

• Pain level: • The numbers have

1. Primary data: collected from the items or

2. Secondary data: which had been collected by

• Descriptive statistic includes tables, graphical /chart

• The methods of describing variables differ depending on the

• Table of frequency distributions

• A simple and effective way of summarizing categorical data is to

• This is done by counting the number of observations falling into

• Is useful to compute the proportion, or percentages of

• The distribution of proportions is called the relative frequency

• The cumulative frequency of a category is the number of

• Cumulative relative frequency

Rel.Freq(%) Cum. Freq

Very low Low Normal Big

Figure 1. Bar charts showing frequency distribution of the variable ‘BWT’

• In order to compare the distribution of a variable

• Pie Chart: displays the frequency distribution for

• In a pie chart the various categories into which the

Figure 8. Pie chart showing distribution of subjects

Figure 3. Pie charts showing frequency distribution of the variable ‘BWT’

• Diagrams have greater attraction than mere figures.

• They help in deriving the required information in less time and

• They have great memorizing value than mere figures. This is so

• They facilitate comparison

• Given a set of numerical data, we can obtain impression of the

• Is constructed by choosing a set of non-overlapping intervals

• The number of observations in each class is called the frequency.

• It is necessary that the class intervals be non-overlapping

• Except for the two boundaries, class intervals are usually

• If this is not the case, the histogram could give a

Fig 5. A histogram displaying frequency distribution of birth weight of

• Instead of drawing bars for each class interval, sometimes a

• Graphs drawn in this way are called frequency polygons (line

• Frequency polygons are superior to histograms for comparing

Fig.6. Frequency polygon of birth weight of 9975 newborns at

You might also like