You are on page 1of 66

Introduction to

Statistics

By Leul Deribe (BSc, MPH/RH)

1
INTRODUCTION
• Statistics
• Data
• Scientific method
• Collection, organization, presentation, analysis and
interpretation of data

• Biostatistics
• The application of statistics on biological or life science
data

2
INTRODUCTION …
• Data
• Measurements taken on variables

• Measurement
• Assigning values to objects

• Statistical data: numerical descriptions of things.


• These descriptions may take the form of counts or
measurements.
• all 'numerical descriptions' are not statistical data.

3
INTRODUCTION …
• Descriptive statistics: - Statistical procedures used
to summarize, organize, and simplify data.
• A number that conveys a particular characteristic of a
set of data.
• Summarizes a set of data with one number or graph.

• Statistical inference. deals with techniques of


making conclusions about the population.
• use measurements from a sample to reach conclusions
about a larger, unmeasured population.

4
Characteristics of statistical data
• They must be in aggregates – statistics are 'number of facts.' A
single fact, even though numerically stated, cannot be called
statistics.
• They must be affected to a marked extent by a multiplicity of
causes. –it is aggregates of such facts only as grow out of a '
variety of circumstances'.
• They must be enumerated or estimated according to a
reasonable standard of accuracy
• They must have been collected in a systematic manner for a
predetermined purpose.
• They must be placed in relation to each other. That is, they
must be comparable.
5
Type of variables
• Variable:- A characteristic which takes different
values in different persons, places, or things.
• Something that exists in more than one amount or
in more than one form.
• Any aspect of an individual or object that is
measured (e.g., BP) or recorded (e.g., age, sex) and
takes any value
• Variables can be qualitative (or categorical) or
quantitative (or numerical variables).

6
Qualitative variable:
• A variable or characteristic which cannot be
measured in quantitative form
• can only be identified by name or categories,
• for instance place of birth, ethnic group, type of
drug, stages of breast cancer (I, II, III, or IV), degree
of pain (minimal, moderate, severe or unbearable).

7
• Quantitative variable: A variable that can be
measured (or counted) and expressed numerically.

• Height, wt, # of children, etc.

• Has the notion of magnitude.

8
Quantitative variable is divided into two:
1. Discrete: It can only have a limited number of discrete
values (usually whole numbers).
• E.g., the number of episodes of diarrhoea a child has had in a
year. You can’t have 12.5 episodes of diarrhoea
• Characterized by gaps or interruptions in the values
(integers).
• Both the order and magnitude of the values matter.
• The values aren’t just labels, but are actual measurable
quantities.

9
2. Continuous variable: It can have an infinite
number of possible values in any given interval.
• Both the magnitude and the order of the values matter
• Does not possess the gaps or interruptions
• Weight is continuous since it can take on any number of
values (e.g., 34.575 Kg).

10
Scales of measurement

• Numbers mean different things in different situations.


Consider three answers that appear to be identical
but are not:
• What number were you wearing in the race? “5”
• What place did you finish in? “5”
• How many minutes did it take you to finish? “5”
• To illustrate this difference, consider another person
whose answers to the same three questions were 10,
10, and 10.

11
Scales of measurement

• “Four is twice as much as two” is true for the pure


numbers themselves and for time, length, and
amount, but it is not true for finish places in a race.

• Fourth place is not twice anything in relation to


second place—not twice as slow or twice as far
behind the second runner

12
Scales of measurement

• All measurements are not the same.


• Measuring weight = eg. 40kg
• Measuring the status of a patient on scale =
“improved”, “stable”, “not improved”.
• There are four types of scales of measurement.

13
1. Nominal scale:
• Measurement scale in which numbers serve only as
labels and do not indicate any quantitative
relationship.
• The simplest type of data, in which the values fall into
unordered categories or classes
• Consists of “naming” observations or classifying them
into various mutually exclusive and collectively
exhaustive categories
• Uses names, labels, or symbols to assign each
measurement.
• Examples: Blood type, sex, race, marital status, etc.

14
Example of nominal Scale:

Race/Ethnicity:
1. Black • The numbers have NO
2. White meaning
3. Latino • They are labels only
4. Other

15
• If nominal data can take on only two possible values,
they are called dichotomous or binary.
• So sex is not just nominal, it is dichotomous (male or
female).
• Yes/no questions
• E.g., cured from TB at 6 months of Rx

16
2. Ordinal scale:
• has the characteristic of the nominal scale (different numbers mean
different things) plus the characteristic of indicating greater than or
less than.
• Assigns each measurement to one of a limited number of categories
that are ranked in terms of order.
• Measurement scale in which numbers are ranks;
• equal differences between numbers do not represent equal
differences between the things measured.
• Although non-numerical, can be considered to have a natural ordering
• Examples: Patient status, cancer stages,
social class, etc.

17
Example of ordinal scale:

• Pain level: • The numbers have


1. None LIMITED meaning
2. Mild 4>3>2>1 is all we know
apart from their utility as
3. Moderate labels
4. Severe

18
3. Interval scale:
- has the properties of both the nominal and ordinal scales plus
the
- additional property that intervals between the numbers are
equal.
- Measured on a continuum and differences between any two
numbers on a scale are of known size.
Example: Temp. in oF on 4 consecutive days
Days: A B C D
Temp. oF: 50 55 60 65
For these data, not only is day A with 50o cooler than day D with
65o, but is 15o cooler.
19
3. Interval scale:
- It has no true zero point. “0” is arbitrarily chosen and
doesn’t reflect the absence of temp.
- The zero point is arbitrarily defined.
- You may not make simple ratio statements
- You may not say that 100° is twice as hot as 50° or
- that a person with an IQ of 60 is half as intelligent as a
person with an IQ of 120.

20
4. Ratio scale:
- has all the characteristics of the nominal, ordinal, and
interval scales plus one other:
- It has a true zero point, which indicates a complete
absence of the thing measured.
- On a ratio scale, zero means “none.”
- Measurement begins at a true zero point and the scale
has equal space.
- Examples: Height, age, weight, BP, etc.
• Note on meaningfulness of “ratio”- you can make ratio
statements
• Someone who weighs 80 kg is two times as heavy as
someone else who weighs 40 kg. This is true even if weight
had been measured in other measurements.
21
Characteristics of the four scales of measurement

22
23
Interval
Ordinal
Nominal

Ratio
Degree of precision in measuring
Data
• Data are numbers which can be measurements or can
be obtained by counting
• The raw material for statistics
• Can be obtained from:
• Routinely kept records, literature
• Surveys
• Counting
• Experiments
• Reports
• Observation
• Etc

24
Data
• raw score/data: Score obtained by observation or
from an experiment

25
Types of Data

1. Primary data: collected from the items or


individual respondents directly by the researcher
for the purpose of a study.

2. Secondary data: which had been collected by


certain people or organization, & statistically
treated and the information contained in it is used
for other purpose by other people

26
Methods of Data collection

• Questionnaires
• Interviews
• Focus group interviews
• Observation
• Documentary source

27
Methods of data organization
and presentation

28
Descriptive statistics (Describing variables)

• Descriptive statistic includes tables, graphical /chart


displays and calculation of summary measures such as
proportions and averages

• The methods of describing variables differ depending on the


type of data (Numerical or Categorical).

29
1.Describing categorical variables

• Table of frequency distributions


• Frequency
• Relative frequency
• Cumulative frequencies

• Charts
• Bar charts
• Pie charts

30
Statistical Tables
• a table could be either of simple frequency table or
cross tabulation.
• The simple frequency table
• is used when the individual observations involve only to
a single variable
• The cross tabulation
• is used to obtain the frequency distribution of one
variable by the subset of another variable.

31
Construction of tables
• Tables should be as simple as possible.
• Tables should be self-explanatory. For that purpose
• Title should be clear and to the point( a good title answers:
what? when? where? how classified ?) and it be placed above
the table.
• Each row and column should be labelled.
• Numerical entities of zero should be explicitly written rather
than indicated by a dash. Dashed are reserved for missing or
unobserved data.
• Totals should be shown
• If data are not original, their source should be given in a
footnote.
32
Frequency distributions

• A simple and effective way of summarizing categorical data is to


construct a frequency distribution table.

• This is done by counting the number of observations falling into


each of the categories, or levels of the variables.

33
Simple Frequency Distributions
• Scores arranged from highest to lowest, with the frequency shown for
each score.
• column tells
• the name of the variable that is being measured. The generic
name for any variable is X, which is the symbol used in formulas.
• The Frequency ( f ) column shows how frequently a score
occurred.
• The tally marks are used when you construct a rough draft
version and are not usually included in the final form
• N is the number of scores and is found by summing the numbers
in the f column.
• useful way to present a set of data because you pick up valuable
information with just a glance.

34
Raw data

35
Steps to construct simple frequency distributions

1. Find the highest and lowest scores. Highest score is 35; lowest is 5.
2. In column form, write in descending order all the numbers. 35 to 5.
3. At the top of the column, name the variable being measured. Satisfaction With
Life Scale scores.
4. Start with the number in the upper left-hand corner of the scores, draw a line
under it, and place a tally mark beside that number in the column of numbers.
Underline 15 in Table 2.1, and place a tally mark beside 15 in Table 2.2.
5. Continue underlining and tallying for all the unorganized scores.
6. Add a column labeled f (frequency).
7. Count the number of tallies by each score and enter the count in the f column.
2, 1, 2, 4, . . . , 0, 2.
8. Add the numbers in the f column. If the sum is equal to N, you haven’t left out
any scores. Sum = 100.

36
37
38
Group frequency
• Compilation of scores into equalsized ranges (class
intervals), with the frequency shown for each
interval.
• class interval
• A range of scores in a grouped frequency distribution.
• The midpoint of each interval represents all the
scores in that interval.

39
40
Relative Frequency

• Is useful to compute the proportion, or percentages of


observations in each category.

• The distribution of proportions is called the relative frequency


distribution of the variable.

41
Cumulative frequency

• The cumulative frequency of a category is the number of


observations in the category plus observations in all categories
smaller than it.

• Cumulative relative frequency


• is the proportion of observations in the category plus observations in all
categories smaller than it, and
• is obtained by dividing the cumulative frequency by the total number of
observations

42
Table 2. Frequencies of serum cholesterol levels for
1067 US males of ages 25-34, (1976-1987).
---------------------------------------------------------------------------------------------------------------------
Cholesterol level
Mg/100ml freq Relative freq(%) Cum freq Cum.rel. freq(%)
----------------------------------------------------------------------------------------------------------------------
80-119 13 1.2 13 1.2
120-159150 14.1 163 15.3
160-199442 41.4 605 56.7
200-239299 28.0 904 84.7
240-279115 10.8 1019 95.5
280-31934 3.2 1053 98.7
320-3599 0.8 1062 99.5
360-3995 0.5 1067 100
----------------------------------------------------------------------------------------------------------------------
Total 1067 100

43
Table 1. Distribution of birth weight of
newborns between 1976-1996 at TAH.

Rel.Freq(%) Cum. Freq


BWT Freq. Cum.rel.freq(%).
Very low
43 0.4 43 0.4
Low 793 8 836 8.4
Normal 8870 88.9 9706 97.3
Big 268 2.7 9974 100
Total 9974 100

44
Graphs of Frequency
Distributions

45
Charts

• Bar charts: display the frequency distribution for nominal or ordinal data.

• The vertical axis should always start from 0 but the horizontal can start from
any where.

• The bars should be of equal width and should be separated from one
another so as not to imply continuity

46
10000 100.0%

9000 8870
90.0% 89%

8000
80.0%
7000

6000 70.0%

5000
60.0%
4000
50.0%
3000

2000 40.0%

1000 793
268 30.0%
43
0
g 20.0%
al
w

Bi
Lo

rm
No

10.0% 8%
3%
w

0%
0.0%
lo

Very low Low Normal Big


ry
Ve

Figure 1. Bar charts showing frequency distribution of the variable ‘BWT’


described in Table 2.
47
Bar charts for comparison

• In order to compare the distribution of a variable


for two or more groups, bars are often drawn along
side each other for groups being compared in a
single bar chart

48
6000
100 88.989
90
5000
80
4000
70
60

Percent
Freq.
50 Yes
3000

40 No
2000 30
Antenatal Care
20 9 7.9
1000 10 2.13.1
No
NNo 0
Yes
Low Normal Big
0
Low Normal Big

BWT BWT

Fig 2. Bar chart indicating categories of birth weight of 9975 newborns grouped by
antenatal follow-up of the mothers

49
Pie chart

• Pie Chart: displays the frequency distribution for


nominal or ordinal data.

• In a pie chart the various categories into which the


observation fall are represented along sectors of a
circle,
• each sector represents either the frequency or the
relative frequency of observation
• the angles are proportional to frequency or the
relative.

50
Pie chart…

Figure 8. Pie chart showing distribution of subjects


by their educational status, Jimma 2009.

51
43 793
268

Very low
Low
Normal
Big

8870

0.4 8
2.7

Very low
Low
Normal
Big

88.9

Figure 3. Pie charts showing frequency distribution of the variable ‘BWT’


described in Table 2. 52
2.Describing numerical variables

• Graphs
• Histograms
• Frequency polygons
• Cumulative frequency polygons

53
Graphs...

• Diagrams have greater attraction than mere figures.

• They give delight to the eye, add a spark of interest and as such
catch the attention as much as the figures dispel it.

• They help in deriving the required information in less time and


without any mental strain.

• They have great memorizing value than mere figures. This is so


because the impression left by the diagram is of a lasting nature.

• They facilitate comparison

54
Graphs…
• Every graph should be self-explanatory and as simple as possible.
• Titles are usually placed below the graph and it should again
question what ? Where? When? How classified?
• Legends or keys should be used to differentiate variables if more
than one is shown.
• The axes label should be placed to read from the left side and from
the bottom.
• The units in to which the scale is divided should be clearly indicated.
• The numerical scale representing frequency must start at zero or a
break in the line should be shown.
55
Histograms
• Are frequency distributions with continuous class interval that
have been turned into graphs.

• Given a set of numerical data, we can obtain impression of the


shape of its distribution by constructing a histogram.

• Is constructed by choosing a set of non-overlapping intervals


(class intervals) and counting the number of observations that
fall in each class.

• The number of observations in each class is called the frequency.

56
Histograms……………..

• It is necessary that the class intervals be non-overlapping


so that each observation falls in one and only one interval.

• Except for the two boundaries, class intervals are usually


chosen to be of equal width.

• If this is not the case, the histogram could give a


misleading impression of the shape of the data.

57
58
2000

1800

1600

1400

1200

1000

800

600
F re q u e n c y

400
Std. Dev = 502.34
200 Mean = 3126
0 N = 9975.00

Birth weight

Fig 5. A histogram displaying frequency distribution of birth weight of


newborns at Tikur Anbessa Hospital
59
Frequency polygons

• Instead of drawing bars for each class interval, sometimes a


single point is drawn at the mid point of each class interval and
consecutive points joined by straight line.

• Graphs drawn in this way are called frequency polygons (line


graphs).

• Frequency polygons are superior to histograms for comparing


two or more sets of data.

60
61
50

40

%
30

20

SEX
10
Males

Females

0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Birth Weight

Fig.6. Frequency polygon of birth weight of 9975 newborns at


Tikur Anbessa Hospital for males and females
62
Table 5. Frequencies of serum cholesterol levels for 1067 US
males of ages 25-34 1976-1980

-------------------------------------------------------------------------------------------------------------
Cholesterol level
Mg/100ml freq Relative freq Cum freq Cum.rel. freq
-------------------------------------------------------------------------------------------------------------
80-119 13 1.2 13 1.2
120-159 150 14.1 163 15.3
160-199442 41.4 605 56.7
200-239299 28.0 904 84.7
240-279115 10.8 1019 95.5
280-31934 3.2 1053 98.7
320-3599 0.8 1062 99.5
360-3995 0.5 1067 100
-------------------------------------------------------------------------------------------------------------
Total 1067 100

63
Table 6. Frequencies of serum cholesterol levels for 1227 US
males of ages 55-64 1976-1980

--------------------------------------------------------------------------------------------------
Cholesterol level
Mg/100ml freq Relative freq Cum freq Cum.rel. freq
--------------------------------------------------------------------------------------------------
80-119 5 0.4 5 0.4
120-159 48 3.9 53 4.3
160-199 265 21.6 318 25.9
200-239 458 37.3 776 63.2
240-279 281 22.9 1057 86.1
280-319 128 10.4 1185 96.5
320-359 35 2.9 1220 99.4
360-399 7 0.5 1227 100
------------------------------------------------------------------------------------------------
Total 1227 100

64
45 100
40 90

35 80

30 70

25 Ages 25-34 60
elativefrequency(%)

Ages 55-64 50
20 Ages 25-34

y(%
)
40
Ages 55-64

c
15

q e
un
30

efre
10
R

tiv
20

la
ere
5

tiv
10

mla
u
0

u
C
80-119 120-159 160-199 200-239 240-279 280-319 320-359 360-399 0
80-119 120-159 160-199 200-239 240-279 280-319 320-359 360-399
Serum cholesterol levels (mg/100ml)
Serum cholesterol levels (mg/100ml)

Fig. 7. Frequency polygon (Ogive curves Vs survival curves) and Cumulative frequency
polygons of serum cholesterol levels for 2294 males aged 25-34 and 55-64 years, 1976-1980

65
THANK YOU!

66

You might also like