You are on page 1of 167

Applied Research Methods

Birhanu Teshome (PhD)


EMAIL: birhanu.teshome@aau.edu.et

Ethiopian Civil Service University


Institute of Tax and Customs Administration

Oct. 2017
Course Contents
Chapter I: Basic concepts in statistics (overview)
– What is Statistics?
– Methods of data collection
– Some Basic Terms in Statistics
– Sampling Techniques
– Criteria for the acceptability of a sampling method
Chapter II: Classification and presentation of statistical data
(overview)
– Scales of measurement and types of classification of variables
– Grouped frequency distribution
– Graphical presentation of data
Chapter III: Measures of central tendency and dispersion
– Measures of central tendency
– Measures of dispersion
– General shape of distributions 2
Course Contents
Chapter IV: Estimation of sample size
– Sample size determination with continuous data
– Sample size determination for proportions
Chapter V: Introduction to tests of hypotheses
– Review of hypothesis testing
– Data analysis
– Parametric and non-parametric statistics (tests)

Chapter VI: Parametric tests of hypotheses concerning means,


variances and proportions
– Independent samples tests: tests of equality of two population
variances, tests concerning the difference between two means, tests of
mean difference between several populations and analysis of variance
(ANOVA)
– Paired-samples t-test (differences between dependent groups)
– Hypothesis test for the difference between two proportions
3
Course Contents
Chapter VII: Non-parametric tests
– Tests concerning independent samples (groups): Kruskal-Wallis test,
Mann–Whitney U test
– Tests concerning dependent samples (groups): Friedman test on
related ordinal measures, Wilcoxon signed-rank test, Cochran test on
related binary measures, McNemar test

Chapter VIII: Tests of association


– The Chi-square test – test of association between categorical variables
– The Pearson coefficient of correlation and test of its significance
– The Spearman rank correlation coefficient
Chapter IX: The simple linear regression model
– What is a regression model?
– Simple linear regression
– Precision and standard errors
– Inference concerning regression coefficients
4
Course Contents
Chapter X: The multiple linear regression model
– Introduction
– The coefficient of determination
– Test of model adequacy: Analysis of variance (ANOVA)
– Tests on regression coefficients
Chapter X: The Logistic regression model
– Introduction
– Binary logistic regression
– Tests on regression coefficients
Chapter XI: Overview of scientific research
– Research problem
– Review of literature
– Research design
– Sample design
– Data analysis
5
– Discussion of results, interpretation and conclusions
Teaching methods and further info.
• Teaching methods: Lectures, group project and assignments
• Software: Mainly SPSS
• Assessment: test (20%), group project, presentations (40%)
and a final written exam (40%).
• Attendance: Regular attendance in this course is expected.
• Participation: Active participation is expected.
• Electronic communication: I will occasionally send lecture
notes, assignments and reading materials through e-mail, so
please check your account regularly.

6
Introduction
Definition and classifications of statistics:

• Statistics: can be defined in two senses:


• In the plural sense: statistics are the raw data themselves,
like statistics of births, statistics of deaths, statistics of
students, statistics of imports and exports, etc.
• In the singular sense: statistics is the field of study that
deals with the collection, organization, presentation,
analysis and interpretation of numerical data.

7
Applications, Uses & Limitations of
statistics
Applications of statistics:

 In almost all fields of human endeavor


In planning, policy making, marketing decisions, in industries
especially in quality control area, health related studies,…

 Almost all human beings in their daily life are subjected to


obtaining numerical facts e.g. a taxi price.

 Applicable in some process e.g. invention of certain drugs,


extent of environmental pollution

 Statistics are everywhere– just look at any newspaper or a


literature.
8
Uses of statistics
Some uses of statistics:
 It presents facts in a definite and precise form
 Data reduction
 Measures the magnitude of variations in data
 Furnishes a technique of comparison
 Estimating unknown population characteristics
 Testing and formulating of hypothesis
 Studying the relationship between two or more variables
 Forecasting future events
9
Limitations of statistics
As a science statistics has its own limitations. Some of its
limitations:
 Deals with only quantitative information

 Deals with only aggregate of facts and not with individual


data items

 Statistical data are only approximately & not mathematically


correct

 Statistics can be easily misused and therefore should be


used be experts.

10
Misuses of statistics
Many people, knowingly or unknowingly, use data in wrong
manner
• Unrepresentative/ Inadequate sample
• Unfair comparison
• Unwarranted conclusion:
– may be as a result of making false assumptions.
– may be the use of wrong average. Eg: Assume monthly incomes
of 1,000,000 and 1,000. The use of an arithmetic average in such
a case may give a wrong idea.
• Suppression of unfavorable results: hiding unfavorable, though
true, facts emerging from statistical study
• Use of inefficient statistical models or mistake in arithmetic
11
Classification of statistics
Based on the usage of statistical data, statistics is defined broadly
in to two mutually exclusive groups

Descriptive statistics:
• Used to describe the basic features of the data in a study
• Provide simple summaries about the sample and the measures
• Ways of organizing and summarizing data
• Helps to identify the general features and trends in a set of data
and extracting useful information
• Also very important in conveying the final results of a study

12
Descriptive Statistics
 Collect data
 e.g., Survey

 Present data
 e.g., Tables and graphs

 Summarize data
 e.g., Sample mean = X i

13
Inferential statistics
• is a method used to generalize from a sample to a population
• Eg, the average income of all families (the population) in
Ethiopia can be estimated from figures obtained from a few
thousands (the sample) families
• It is important because statistical data usually arises from
sample.
• Statistical techniques based on probability theory are required

• provide the bases for predictions, forecasts, and estimates that


are used to transform information into knowledge

14
Inferential Statistics
 Estimation
 e.g: Estimate the population mean
using the sample mean
 Confidence interval
 Hypothesis testing
 e.g., Test the claim that the population
mean weight is 56 kg.
 comparison of two or more means or
proportions

Inference is the process of drawing conclusions or


making decisions about a population based on
sample results 15
Descriptive Vs Inferential
Classify the following sentences as belonging to the area of
descriptive statistics or inferential statistics.
 As a result of recent cutbacks by oil-producing nations, we
expect the price of gasoline to double in the next year.

 At least 5% of all killings reported last year in city Z were


due to terrorism.

 Mr. Y concludes that his chance of passing this course is


at least 90% based on the statistics that 85% of his
seniors passed the course last year.

 Of all patients who received this particular type of drug at


a clinic Y, 75% later developed significant side effect.
16
Key Definitions
• A population is the collection of all items
of interest or under investigation
 N represents the population size

• A sample is an observed subset of the


population
• n represents the sample size

17
Population
Role of statistics in using
Information from a sample
to make inferences about
the population
Information

Sample

Generalizability
If the sample is not representative
of the population, the conclusions
will be restricted to the sample &
could not be generalized to the
target population!

18
Key Definitions
Target population: A collection of items that have something in
common for which we wish to draw conclusions at a particular
time. E.g., All financial offices in Ethiopia
• Defining the target population is an important and often difficult part
of the study. For eg, in a political poll, should the target population
be all adults eligible to vote? All registered voters? All persons who
voted in the last election?
• The choice of target population will profoundly affect the statistics
that result.
Study (Sampled) Population: The subset of the target population
that has at least some chance of being sampled

• The specific population group from which samples are drawn


and data are collected
19
Example: In a study of the
prevalence of HIV among
adolescents in Ethiopia, a
random sample of adolescents in
Bole KK of AA were included.

Sample Target Population: All


adolescents in Ethiopia
Study Population Study population: All adolescents
in Addis Ababa
Target Population
Sample: Adolescents in Bole KK
who were included in the study

20
Parameter and Statistic

A parameter is a specific characteristic of a population


– Values calculated using population data
– E.g., the mean (µ) age of the target population

• A statistic is a specific characteristic of a sample


– Values computed from sample data
– E.g., sample mean age ( )

21
Stages in statistical investigation
Interpretation
Inferential Statistics

Analysis of Data

Presentation
Descriptive Statistics

Organization

Formulate the Data


Problem

22
Formulating the problem
• Research begins with a problem/problems
 The problem need not be Earth-shaking

• Identifying the problem/s can actually be the hardest part of


research
• Sources of research problem:
– Observation, Literature reviews, Professional conferences, Experts.

 Good research project should:

 Address an important question


 Advance knowledge
23
Research Design
The first stage in any statistical investigation should be to:
Get a clear understanding of the physical background to
the situation under study

Clarify the objectives very carefully

Formulate the objective in statistical terms

We can not study all subjects (e.g. all pregnant women) living in
a given geographical area
Sampling techniques
Sample size calculation, Study design
Method of data collection
Etc.
24
Stages in statistical investigation
Interpretation
Inferential Statistics

Analysis of Data

Presentation
Descriptive Statistics

Organization

Formulate the Data


Problem

25
Methods of Data Collection
Data are facts or figures from which conclusion can be drawn.
 In order to draw valid conclusions, it is important to have
‘good’ data
 Data are gathered with aim to meet predetermined objectives.
 The data itself form the foundation of statistical analyses and
hence the data must be carefully and accurately collected.
 Can be obtained from:
 Routinely kept records, literature, Surveys, Experiments,
Reports, Observation, etc.
 Who needs info?
 Government, businesses, organizations, and everyone need info for
their day to day lives
26
Types of Data
Primary data: collected from the items or individual respondents
directly by the researcher for the purpose of a study.
 you collect the data yourself
 the data you collect is unique to you and your research and,
until you publish, no one else has access to it
 Methods of collecting primary data: interviews, questionnaires,
observation (measurement) and diaries
Secondary data: which had been collected by someone else or
organization (e.g., researchers, institutions, other NGOs,…)
 Some sources: official statistics, scholarly journals, reference
books, research institutes, universities, libraries, library search
engines, computerized data base and world wide web.

27
Method of primary data collection
Questionnaire: a popular means of data collection
 written questions are mailed or hand-delivered to respondents
 is difficult to design & often require many rewrites before an
acceptable questionnaire is produced.
Advantages:
 Can be used as a method in its own right or as a basis for
interviewing or a telephone survey.
 Relatively cheap
 Can be posted, e-mailed or faxed
 Can cover wide geographic area, a large number of people or
organizations
 Avoids embarrassment on the part of the respondent.
 Possible anonymity of respondent.
28
 No interviewer bias
Method of primary data collection:
Questionnaire
Disadvantages:

• Historically low response rate (although inducements may


help)
• Assumes no literacy problems
• No control over who completes it
• Not possible to give assistance if required
• Time delay whilst waiting for responses to be returned
• Several reminders may be required.
• Respondent can read all questions beforehand and then
decide whether to complete or not. E.g: it is too long, too
complex, uninteresting, or too personal.

29
Primary data collection: Interviewing
 is primarily used to gain an understanding of the underlying
reasons & motivations for people’s attitudes…
 Interviews can be undertaken on a personal one-to-one basis
or in a group.
 can be conducted at work, at home, on the street, in a
shopping center, or some other agreed location.
Advantages:
 Serious approach by respondent resulting in accurate info.
 Good response rate, completed and immediate.
 Possible in-depth questions.
 Interviewer in control and can give help if there is a problem.
 Can investigate motives and feelings.
 Can use recording equipment.
30
 If one interviewer used, uniformity of approach.
Primary data collection: Interviewing
Disadvantages:
• Time consuming.
• Geographic limitations.
• Can be expensive.
• Need to set up interviews.
• Normally need a set of questions.
• Respondent bias– tendency to please or impress, create false
personal image, or end interview quickly.
• Embarrassment possible if personal questions.
• Transcription and analysis can present problems– subjectivity.
• If many interviewers, training is required!

31
secondary data collection

Advantages of secondary data


• Saves time and money
• Avoid data collection problem
• Provide bases for comparison

Disadvantages
• Quality of documentation
• Data quality control
• Level of observation
• Data availability
• Outdated data
32
Sampling Techniques

33
Why Sample?
 Researchers often use sample survey methodology to
obtain information about a larger population by selecting
and measuring a sample from that population.

Why Sample?

• Less time consuming than a census

• Less costly to administer than a census

• It is possible to obtain statistical results of a sufficiently


high precision based on samples.

• Avoids destructive test

• The only option when the population is infinite 34


Why Census?
Census: is a complete enumeration of the entire population
Eg. Population and housing census conducted in Eth.:
1984, 1994 and 2007.

Sometimes taking a census makes more sense than using


a sample. Some of the reasons include:

Universality

Detaildness

Representativness

35
• Due to the variability in the characteristics of the population,
scientific sample designs should be applied to select a
representative sample.

• If not, there is a high risk of distorting the view of the


population.

• Sampling enables us to estimate the characteristic of a


population by directly observing a portion of the population.

• Researchers are not interested in the sample itself, but in


what can be learned from the sample—and how this
information can be applied to the entire population.

36
Sample Information

Population
 It is essential that a sample should be correctly defined and
organized.
 If the wrong questions are posed to the wrong people,
reliable information will not be received and lead to a wrong
conclusion when applied to the entire population.
37
Steps needed to select a sample and ensure
that this sample will fulfill its goals

1. Establish the study's objectives


– The first step in planning a useful and efficient survey is to
specify the objectives with as much detail as possible.

– Clarifying the aims of the survey is critical to its ultimate


success.

– Without objectives, the survey is unlikely to generate


valuable results.

– The initial users and uses of the data should be identified


at this stage.
38
2. Define the target population

• Reference population (or target population):


– the population of interest to whom the researchers would like to
make generalizations.
– is the total population for which the information is required.
• Sampling population: the subset of the target population from
which a sample will be drawn.
• Study population: the actual group in which the study is
conducted = Sample
• Study unit: the units on which information will be collected:
persons, house holds, etc.

39
Researchers are interested to know about factors associated
with ART use among HIV/AIDS patients attending certain
hospitals in a given Region

Target population = All ART


patients in the Region

Sampling population = All


ART patients in, e.g. 3,
hospitals in the Region

Sample

40
2. Define the target population

• Defining the target population is an important and often difficult


part of the study.
– For eg:, in a political poll, should the target population be all adults
eligible to vote? All registered voters? All persons who voted in the last
election?

 The choice of target population will profoundly affect the


statistics that result.
 The target population is defined by the ff. characteristics:
• Nature of data required
• Geographic location
• Reference period
• Other characteristics, such as socio-demographic
characteristics
41
3. Decide on the data to be collected

– The data requirements of the survey must be


established.

– To ensure that the requirements are operationally


sound, the necessary data terms and definitions also
need to be determined.

42
4. Set the level of precision
 There is a level of uncertainty associated with estimates
coming from a sample.

 The level of precision required in the estimate requires


specifying the acceptable margin of error and the confidence
level

 Researchers can estimate the sampling error associated with


a particular sampling plan, and try to minimize it.

↑ Sample size ≡ ↑ Precision ≡ ↑ Cost


 Sample-to-sample variation causes sampling error

 Acceptable precision is important


43
5. Decide on the methods on measurement
 Choose measuring instrument and method of approach to the
population

 Data about a person’s state of health may be obtained from


statements that he/she makes or from a medical examination

 The survey may employ a self-administered questionnaire, an


interviewing

6. Preparing Frame
 List of all members of the population from which the sample
will be taken

 The elements must not overlap

44
The sample design
 Sample design: how the sample will be collected.
 Estimation techniques: how the results from the sample will
be extended to the whole population.
 Measures of precision: how the sampling error will be
measured.
Other Considerations
• Sample size determination
• Questionnaire development
• Pretest
• Organization of the field work
• Data collection, Data entry
• Summary and analysis of the data (Edit the completed
45
questionnaires, Decide on computation procedures)
Sampling
• Sampling: The process of selecting a portion of the
population to represent the entire population.

• A main concern in sampling:


– Ensure that the sample represents the population, and

– The findings can be generalized.

Basic questions while selecting a SAMPLE:

 What is the group of people (STUDY POPULATION) from which


we want to draw a sample?

 How many people do we need in our sample?


46
 How will these people be selected?
Advantages of sampling:
• Feasibility: Sampling may be the only feasible method of
collecting information.
• Reduced cost: Sampling reduces demands on resource
such as finance, personnel, and material.
• Greater speed: Data can be collected and summarized
more quickly
• Greater accuracy: Sampling may lead to better accuracy of
collecting data
• Sampling error: Precise allowance can be made for
sampling error
47
Disadvantages of sampling:
• There is always a sampling error.
• Sampling may create a feeling of
discrimination within the population.

48
Errors in sampling
1) Sampling error: errors caused by the act of taking a sample.
They cause sample results to be different than results of a
census.
– They cannot be avoided or totally eliminated.
– Can be controlled by selecting “large” sample
• Random sampling error – deviation between the sample
statistic and the population parameter caused by chance in
selecting a random sample. The margin of error in a confidence
statement includes only random sampling error.
2) Non-sampling error: errors not related to the act of selecting a
sample from the population. They can be present in a census.
- Observational error
- Respondent error
- Lack of preciseness of definition
49
- Errors in editing and tabulation of data
Errors in Sampling
• Most sample surveys afflicted by errors other than random
sampling.
• These errors introduce bias that makes a confidence interval
basically meaningless.
• Good sampling technique includes reducing all sources of
error.
• Part of this includes random sampling and confidence
statements.

50
Sampling Errors
• Random sampling error
– Margin of error & confidence statement
• Bad sampling methods
– Voluntary response & convenience samples
• Under-coverage bias
– Occurs when some groups in the population are left out
of the process of choosing a sample.
– Limited sampling frame
– Homeless
– Subjects excluded who are in hospitals, motels, etc.

• Sampling errors in careful sample surveys are


usually small.
51
Non-sampling Errors
• Response error (incorrect response)
– A subject may lie or not remember (period of time
questions)
– People may lie, especially if the questions are embarrassing: Age,
weight, income
– People may not remember (period of time questions)
• How many movies did you watch last year?
• Non-response bias
– Occurs when an individual chosen refuses to provide
answers or cannot be contacted.
• Measurement bias
– Interviewer bias
• Occurs when an interviewer (because of social position, poor
training, etc.) influences the response in a systematic way.
– Question wording bias
• Occurs when questions have leading phrases, loaded words, or
ambiguities that influence the response.
• Processing errors 52
Questions to Ask Before You Believe a
survey result
• What was the (target) population?

• How was the sample selected? Randomly?

• How large was the sample & margin of error?

• What was the response rate?

• How were the subjects contacted?

• When was the survey conducted?

• What were the exact questions asked?

• Who carried out the survey? 53


General Suggestions
• Think through the survey, include what critics might ask.

• Do a pilot survey

• Follow ups increase response rates.

• Do the best you can with what you have.

• Be honest about limitations (& report).

54
How to live with non-sampling errors
• Non-sampling errors, such as non-response, are always
there.

• Substitute other households(subjects) for the non-responders


(Not always!).

• Weight the responses in an attempt to correct for sources of


bias.
– If too many women are in the sample, the survey gives
more weight to men.

55
Sampling Methods
Two broad divisions:

A. Probability (Random) sampling methods

B. Non-probability sampling methods

56
A. Probability Sampling

• Involves random selection of a sample


• Every sampling unit has a known and non-zero
probability of selection into the sample.
• Involves the selection of a sample from a population,
based on chance

• Often yield representative samples

57
• Probability sampling is:
– more complex,
– more time-consuming and
– usually more costly than non-probability sampling.

• Why prob. Sampling then?


– because study samples are randomly selected and their
probability of inclusion can be calculated,
reliable estimates can be produced and
inferences can be made about the population.

58
• There are several different ways in which a probability
sample can be selected.

• The method chosen depends on a number of factors, such


as
– the available sampling frame,

– how spread out the population is,

– how costly it is to survey members of the population

59
Most common probability
sampling methods

1. Simple random sampling


2. Systematic random sampling
3. Stratified random sampling
4. Cluster sampling
5. Multi-stage sampling

60
1. Simple random sampling
 Every member of the population has an equal chance of being
selected

 Objects are selected independently

 A simple random sample is the ideal against which other


sample methods are compared

 The required number of individuals are selected at random


from the sampling frame, a list or a database of all individuals
in the population

61
To use a SRS method:

– Make a numbered list of all the units in the population

– Each unit should be numbered from 1 to N (where N is


the size of the population)

– Select the required number.

The randomness of the sample is ensured by:


Use of “lottery” methods

Table of random numbers

Computer programs

62
SRS has certain limitations:

 Requires a sampling frame


 Difficult if the reference population is dispersed.
 Minority subgroups of interest may not be selected.
 It can be expensive and often not feasible in practice

 Since it gives each element in the population an equal chance


of being chosen in the sample, it may result in samples that
are spread out over a large geographical area. Such a
geographic distribution of the sample would be very costly to
implement

63
2. Systematic random sampling
• often used instead of random sampling

• Selection of individuals from the sampling frame systematically


rather than randomly

• Individuals are taken at regular intervals down the list

• The starting point is chosen at random

64
2. Systematic random sampling
• Taking individuals at fixed intervals (every kth) based on the
sampling fraction

• Important if the reference population is arranged in some


order:
– Order of registration of patients
– Numerical number of house numbers
– Student’s registration books

65
Steps in systematic random sampling
1. Number the units on your frame from 1 to N (where N is the
total population size).

2. Determine the sampling interval (K) by dividing the number


of units in the population by the desired sample size.

3. Select a number between one and K at random. This


number is called the random start and would be the first
number included in your sample.
4. Select every Kth unit after that first number

66
Example
• To select a sample of 100 from a population of 400, you
would need a sampling interval of 400 ÷ 100 = 4.

• Therefore, K = 4.

• You will need to select one unit out of every four units to
end up with a total of 100 units in your sample.

• Select a number between 1 and 4 from a table of random


numbers.

• If you choose 3, the third unit on your frame would be the


first unit included in your sample;

• The sample might consist of the following units to make up


a sample of 100: 3 (the random start), 7, 11, 15, 19...395,
67
399 (up to N, which is 400 in this case).
 Using the above example, you can see that with a systematic
sample approach there are only four possible samples that
can be selected, corresponding to the four possible random
starts:

A. 1, 5, 9, 13...393, 397
B. 2, 6, 10, 14...394, 398
C. 3, 7, 11, 15...395, 399
D. 4, 8, 12, 16...396, 400
• Each member of the population belongs to only one of the four
samples and each sample has the same chance of being
selected.

• The main difference with SRS, any combination of 100 units


would have a chance of making up the sample, while with
systematic sampling, there are only four possible samples.68
Advantages of Systematic random
sampling
 The systematic sampling design is simple and convenient.

 The time and work involved in sampling using this method


are relatively low.

 The results obtained are also found to be generally


satisfactory, provided care is taken to see that there are no
periodic features associated with the sampling interval.

 If populations are sufficiently large, systematic sampling can


often be expected to yield results similar to those obtained by
proportional stratified sampling.

69
Disadvantages of Systematic random
sampling
 The main limitation of the method is that it becomes less
representative if we are dealing with populations having
“hidden periodicities”.

 If the population is ordered in a systematic way with respect


to the characteristics the investigator is interested in, then it is
possible that only certain types of items will be included in the
population, or at least more of certain types than others.

Note: Systematic sampling should not be used when a cyclic


repetition is inherent in the sampling frame.

70
3. Stratified random sampling
• It is done when the population is known to be have
heterogeneity with regard to some factors and those factors
are used for stratification
• Using stratified sampling, the population is divided into
homogeneous, mutually exclusive groups called strata, and
• A population can be stratified by any variable that is available
for all units prior to sampling (e.g., income (low, medium &
high), age, sex, province of residence, etc.)

• A separate sample is taken independently from each stratum.

• Any of the sampling methods mentioned in this section (and


others that exist) can be used to sample within each stratum.
71
Why do we need to create strata?
• It can make the sampling strategy more efficient.

• A larger sample is required to get a more accurate estimation


if a characteristic varies greatly from one unit to the other.
– For example, if every person in a population had the same salary, then
a sample of one individual would be enough to get a precise estimate
of the average salary.
• Efficiency gain using stratification.
– If you create strata within which units share similar characteristics
(e.g., income) and are considerably different from units in other strata
(e.g., occupation) then you would only need a small sample from each
stratum to get a precise estimate of total income for that stratum.

– Then you could combine these estimates to get a precise estimate of


total income for the whole population.
72
• Is superior to SRS because it reduces sampling error
• If you use a SRS approach in the whole population without
stratification, the sample would need to be larger than the
total of all stratum samples to get an estimate of total income
with the same level of precision.

• Stratified sampling ensures an adequate sample size for sub-


groups in the population of interest.

• When a population is stratified, each stratum becomes an


independent population and you will need to decide the
sample size for each stratum.

73
• Equal allocation:
– Allocate equal sample size to each stratum

• Proportionate allocation:
n
nj  Nj
N
– nj is sample size of the jth stratum
– Nj is population size of the jth stratum
– n = n1 + n2 + ...+ nk is the total sample size
– N = N1 + N2 + ...+ Nk is the total population
size

74
Example: Proportionate Allocation

• Village A B C D Total
• HHs 100 150 120 130 500
• S. size ? ? ? ? 60

75
4. Cluster sampling
• Is preferable when the population is subdivided in to groups
or clusters that are internally heterogonous and externally
homogenous
• Sometimes it is too expensive to carry out SRS
– Population may be large and scattered.
– Complete list of the study population unavailable
– Travel costs can become expensive if interviewers have to
survey people from one end of the country to the other.
• Cluster sampling is the most widely used to reduce the cost
• The clusters should be homogeneous, unlike stratified
sampling where the strata are heterogeneous
76
Steps in cluster sampling
• Cluster sampling divides the population into groups
or clusters.

• Clusters are selected randomly to represent the total


population
• then all units within selected clusters are included in
the sample.
No units from non-selected clusters are included in
the sample
– they are represented by those from selected clusters.

This differs from stratified sampling, where some


77
units are selected from each group.
Example
• In a school based study, we assume students of the same
school are homogeneous.

• We can randomly select sections and include all students of


the selected sections only

Advantages
• Cost reduction
• It creates 'pockets' of sampled units instead of spreading the
sample over the whole territory.
• Sometimes a list of all units in the population is not available,
while a list of all clusters is either available or easy to create.
78
Disadvantages
• Creates a loss of efficiency when compared with SRS.

• Final size may be larger or smaller than you expected.


– You do not have total control over the final sample size.
– Since not all schools have the same number of students and you
must interview every student,

• Neighboring units tend to be more alike, resulting in a


sample that does not represent the whole spectrum of
opinions or situations present in the overall population.

79
5. Multi-stage Sampling
• Similar to the cluster sampling, except that it involves
picking a sample from within each chosen cluster, rather
than including all units in the cluster.

• This type of sampling requires at least two stages.

• The primary sampling unit (PSU) is the sampling unit in


the first sampling stage.

• The secondary sampling unit (SSU) is the sampling unit in


the second sampling stage, etc.

80
Woreda PSU

Kebele SSU

Sub-Kebele TSU

HH

81
• In the first stage, large groups or clusters are identified
and selected. These clusters contain more population
units than are needed for the final sample.
• In the second stage, population units are picked from
within the selected clusters (using any of the possible
probability sampling methods) for a final sample.

• Advantages and Disadvantages


• No need to have a list of all of the units in the population.
– All you need is a list of clusters and list of the units in the selected
clusters.

• cost reduction.
• saves a great amount of time and effort.

• sample size is bigger than for a SRS 82


B. Non-probability sampling
• In non-probability sampling, every item has an unknown
chance of being selected.
• In non-probability sampling, there is an assumption that
there is an even distribution of a characteristic of interest
within the population.

• This is what makes the researcher believe that any sample


would be representative and because of that, results will be
accurate.

• For probability sampling, random is a feature of the


selection process, rather than an assumption about the
structure of the population.
83
 In non-probability sampling, since elements are chosen
arbitrarily, there is no way to estimate the probability of any
one element being included in the sample.

 Also, no assurance is given that each item has a chance of


being included, making it impossible either to estimate
sampling variability or to identify possible bias

 Reliability cannot be measured in non-probability sampling;


the only way to address data quality is to compare some of
the survey results with available information about the
population.

 Still, there is no assurance that the estimates will meet an


acceptable level of error.

 Researchers are reluctant to use these methods because


there is no way to measure the precision of the resulting
84
sample.
• Despite these drawbacks, non-probability sampling
methods can be useful when descriptive comments
about the sample itself are desired.

• They are quick, inexpensive and convenient.

• There are also other circumstances, such as


researches, when it is unfeasible or impractical to
conduct probability sampling.

85
Most common types of non-
probability sampling
1. Convenience or haphazard sampling
2. Volunteer sampling
3. Judgment sampling
4. Quota sampling
5. Snowball sampling technique
86
1. Convenience or haphazard sampling

• Convenience sampling is sometimes referred to as


haphazard or accidental sampling.

• It is not normally representative of the target population


because sample units are only selected if they can be
accessed easily and conveniently.

• The obvious advantage is that the method is easy to use,


but that advantage is greatly offset by the presence of
bias.

• Although useful applications of the technique are limited, it


can deliver accurate results when the population is
homogeneous. 87
• For example, a scientist could use this method
to determine whether a lake is polluted or not.

– Assuming that the lake water is well-mixed, any


sample would yield similar information.

– A scientist could safely draw water anywhere on the


lake without bothering about whether or not the
sample is representative

88
2. Volunteer sampling
• As the term implies, this type of sampling occurs when
people volunteer to be involved in the study.
• In psychological experiments or pharmaceutical trials
(drug testing), for example, it would be difficult and
unethical to enlist random participants from the general
public.
• In these instances, the sample is taken from a group of
volunteers.
• Sometimes, the researcher offers payment to attract
respondents.

• In exchange, the volunteers accept the possibility of a


lengthy, demanding or sometimes unpleasant process.
89
• Sampling voluntary participants as opposed to the
general population may introduce strong biases.

• Often in opinion polling, only the people who care


strongly enough about the subject tend to respond.

• The silent majority does not typically respond,


resulting in large selection bias.

90
3. Judgment sampling
• This approach is used when a sample is taken
based on certain judgments about the overall
population.

• The underlying assumption is that the investigator


will select units that are characteristic of the
population.

• The critical issue here is objectivity: how much can


judgment be relied upon to arrive at a typical
sample?
91
• Judgment sampling is subject to the researcher's biases

• Since any preconceptions the researcher may have are


reflected in the sample, large biases can be introduced if
these preconceptions are inaccurate.

• Researchers often use this method:


– in exploratory studies like pre-testing of questionnaires
and focus groups.
– in laboratory settings where the choice of experimental
subjects (i.e., animal, human) reflects the investigator's
pre-existing beliefs about the population.

• One advantage of judgment sampling is the reduced


cost and time involved in acquiring the sample.
92
4. Quota sampling
• Sampling is done until a specific number of units (quotas) for
various sub-populations have been selected.

• Since there are no rules as to how these quotas are to be


filled, quota sampling is really a means for satisfying sample
size objectives for certain sub-populations.

• Assumption: assume that persons selected are similar to


those not selected.
 Such strong assumptions are rarely valid.

93
Quota sampling is:
 generally less expensive than random sampling.

 easy to administer, especially considering the tasks of listing the


whole population, randomly selecting the sample and following-up
on non-respondents can be omitted from the procedure.

 An effective sampling method when information is urgently


required and can be conducted without sampling frames.

 may be the only appropriate sampling method where the


population has no suitable frame.

 it does not meet the basic requirement of randomness.


– Some units may have no chance of selection or the chance of
selection may be unknown. Therefore, the sample may be
biased.
94
5. Snowball sampling
• a special non probability method used when the desired
sample characteristic is rare.

• It may be extremely difficult or cost prohibitive to locate


respondents in these situations.

• A technique for selecting a research sample where existing


study subjects recruit future subjects from among their
acquaintances.

• Thus the sample group appears to grow like a rolling


snowball.
• This sampling technique is often used in hidden populations
which are difficult for researchers to access; example
populations would be drug users or commercial sex workers.
95
Snowball sampling

• Because sample members are not selected from a sampling


frame, snowball samples are subject to numerous biases.
• For example, people who have many friends are more likely
to be recruited into the sample.

• It dramatically lower search costs, at the expense of


introducing bias!

96
Non-Probability Sampling: Inherent concerns related to
generalizability and representation
97
Variable
Variable: is an attribute or characteristic which may take on
different values in different persons, places,…

 Any aspect of an individual or object that is measured (e.g.,


Weight, Height) or recorded (e.g., age, gender) and takes any
value.
 There may be one or many variables in a study.

 Variables are often specified according to their type and


intended use and hence classified as qualitative and
quantitative variables

98
Types of Variable/Data

Variable/Data

Categorical/Qualitative Numerical/Quantitative

Eg:
 Marital Status
 registered to vote? Discrete Continuous
 Region
(Defined categories or groups) Eg: Examples:
 Number of Children  Weight
 Defects per hour  Height
(Counted items) (Measured characteristics)

There are different statistical methods for each type! 99


Levels of measurement
Differences between
measurements, true Ratio Data
zero exists
Eg: Height, age, BP Quantitative Data
Differences between
measurements but no Interval Data
true zero
Eg: Temp in oF

Ordered Categories
(rankings, order, or Ordinal Data
scaling)
Eg: response to treatment Qualitative Data

Categories (no
ordering or direction) Nominal Data
Eg: Ethnic group 100
Exercise 2
What type of variable is?
a) Region
b) Blood group
c) Health status: very sick, sick and cured
d) Age of an employee in a company
e) Student mark
f) No. of movies seen this summer
g) Income
h) Income class (poor, medium, rich)
i) Test result (negative, positive)
• Quantitative or categorical?
• Continuous or discrete?
• Nominal, ordinal, interval or ratio scale? 101
Assignment
2. Match by permissible Arithmetic operations of measurement
of scales

Measurement scales Arithmetic operations


Nominal < or > operations
Ordinal Only + & - of scale values
Interval x & division of scale values
Ratio Counting

102
Recap: Why Level of measurement is
important?

 Level of measurement helps to decide on how to organize and


present the data

 Knowing the level of measurement helps to decide on how to


interpret data

 Knowing the level of measurement helps to decide what type


of statistical analysis is appropriate

103
Stages in statistical investigation
Interpretation
Inferential Statistics

Analysis of Data

Presentation
Descriptive Statistics

Organization

Formulate the Data


Problem

104
Methods of Data Organization
and Presentation

105
Data Organization
 Data in raw form are usually not easy to use for decision
making

 Some type of organization is needed

 Table (frequency distribution)

 Graph

 The type of graph/ frequency distribution to use depends


on the variable being summarized

106
Frequency Distributions
(Tables)
 The actual summarization and organization of data starts
from frequency distribution.

 Frequency distribution: A table which has a list of each of


the possible values that the data can assume along with the
number of times each value occurs.
 A frequency distribution summarize data by condensing the
raw data into a more useful form
 allows for a quick visualization of the data

107
Frequency Distributions
(Tables)
• For nominal and ordinal data, frequency distributions are
often used as a summary.
• Example:

• The % of times each value occurs, or the relative frequency,


is often listed
• Tables make it easier to see how the data are distributed108
Frequency Distributions
(Tables)

109
 Select a set of continuous, non-overlapping intervals such
that each value can be placed in one, and only one, of the
intervals.
 The first consideration is how many intervals to include
 A common rule of thumb states that there should be no fewer
than six intervals and no more than 15.
110
To determine the number of class intervals and the
corresponding width, we may use:

Sturge’s rule: K  1  3.322(logn)

largest number  smallest number L  S


w  interval width  
number of desired intervals K

where
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value

111
Example: A manufacturer of insulation randomly selects 20 winter
days and records the daily high temperature:
24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27

Solution:
1. Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
2. Find range: 58-12 = 46
3. Select number of classes (K): K = 1 + 3.22 (log20) = 5.33≈5
4. Compute interval width: 10 (46/5 then round up)
5. Determine interval boundaries: 10 but less than 20, 20 but
less than 30, . . . , 60 but less than 70
6. Count observations & assign to classes

112
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

113
Exercise 3
Construct a grouped data frequency distribution for Leisure
time (hours) per week for 40 college students:

23 24 18 14 20 36 24 26 23 21 16 15 19 20 22 14 13 10 19
27

29 22 38 28 34 32 23 19 21 31 16 28 19 18 12 27 15 21 25 16

114
Cumulative frequencies: When frequencies
of two or more classes are added.
K = 1 + 3.22 (log40) = 6.32≈6
Maximum= 38, Minimum= 10 Cumulative relative frequency: The
Width = (38-10)/6 = 4.66 ≈ 5 percentage of the total number of
observations that have a value either in
that interval or below it.

Time Relative Cumulative


(Hours) Frequency Frequency Relative
Frequency
10-14 5 0.125 0.125
15-19 11 0.275 0.400
20-24 12 0.300 0.700
25-29 7 0.175 0.875
30-34 3 0.075 0.950
35-39 2 0.050 1.00
Total 40 1.00 115
Graphical Presentation of Data

116
Importance of diagrammatic
representation
 Diagrams have greater attraction than mere figures
 They give quick overall impression of the data
 They have great memorizing value than mere figures
 They facilitate comparison
 Used to understand patterns and trends
 Well designed graphs can be powerful means of
communicating a great deal of information
 When graphs are poorly designed, they not only
ineffectively convey message, but they are often
misleading.

117
Specific types of graphs include:

Categorical Numerical
Variables Variables

• Frequency distribution • Line chart


• Bar chart • Frequency distribution
• Pie chart • Histogram and ogive
• Pareto diagram • Stem-and-leaf display
• Scatter plot

118
• Bar charts and Pie charts are often used for qualitative
(category) data
Hospital Patients by Unit
Hospital Number
5000
Unit of Patients
Cardiac Care 1,052 4000

patients per year


Number of
Emergency 2,245 3000
Intensive Care 340
2000
Maternity 552
Surgery 4,630 1000

Cardiac

Surgery
Emergency

Maternity
Intensive
Care

Care
119
Data presentation using
Histogram

120
Scatter Plot
Scatterplots (for quantitative variables)
plot response variable on vertical axis,
explanatory variable on horizontal axis
Descriptive Statistics: Numerical
Summary Measures
 Single numbers that quantify the characteristics of a
distribution of values
 Measures of central tendency (location)
 Measures of dispersion

122
Describing Data Numerically
Describing Data Numerically

Central Tendency Variation

Arithmetic Mean Range

Median Interquartile Range

Mode Variance

Standard Deviation

Coefficient of Variation

123
Measures of Central Tendency
Overview
Central Tendency

Mean Median Mode

x i
x i1
n
Arithmetic Midpoint of Most frequently
average ranked values observed value

124
Arithmetic Mean
 The arithmetic mean (mean) is the most common
measure of central tendency

 For a population of N values:


N

xx1  x 2    x N
i Population
μ 
i1
values
N N
Population size
 For a sample of size n:
n

x i
x1  x 2    x n Observed
x i1
 values
n n
125
Properties of Arithmetic Mean
 The most common measure of central tendency
 Easy to calculate and understand (simple).
 For a given set of data there is one and only one
arithmetic mean (uniqueness).
 Influenced by each and every value in a data set
 Greatly affected by the extreme values (outliers).

126
Weighted Mean
 Weighted Mean is a special type arithmetic mean and it will
be functional when values have its own weight.
 Some of the observations in a data set may have greater
importance.

 For eg. The final exam in a course is given more weight as


compared to mid-exam and test

 Let w 1 , w 2 ,..., w n be weights assigned for observations x 1 , x 2 ,..., x n


respectively, then weighted mean is defined as:
n

w x i i
w1x1  w 2 x 2    w n x n
xw  i 1

n
w1  w 2    w n
w
i 1
i 127
Weighted Mean
Example: An entrance exam for a job consists of 25% English,
50% Mathematics, 5% Typing and 20% Accounting. If an
applicant who took the entrance exam scored 48% in English,
35% in Mathematics, 80% in Typing and 50% in Accounting,
his average score is:
n

w x i i
0.25(48)  0.50(35)  0.05(80)  0.20(50)
xw  i 1
  43.5
n
0.25  0.50  0.05  0.20
w
i 1
i

Exercise: A teacher attaches weights 2 to homework 3 to mid


term exam and 5 for final exam. If a student score 90, 50 and
60 for HM, MT and FE, respectively, what is his/ her average
academic performance? Ans: 63
128
Median
 In an ordered list, the median is the “middle” number (50%
above, 50% below)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3

 Not affected by extreme values

 The location of the median:


n 1
Median position  positionin the ordered data
2
129
Median
 If the number of values is odd, the median is the middle
number
 If the number of values is even, the median is the average of
the two middle numbers

n 1
Note that 2 is not the value of the median, only the
position of the median in the ranked data
130
131
Properties of median
 There is only one median for a given set of data
(uniqueness)
 The median is easy to calculate
 Median is a positional average and hence it is insensitive
to very large or very small values
 Median can be calculated even in the case of open end
intervals
 It is determined mainly by the middle points and less
sensitive to the remaining data points (weakness).

132
Mode

 Value that occurs most often

 Not affected by extreme values

 Used for either numerical or categorical data

 It is possible to have more than one mode or no mode

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
No Mode
Mode = 9

133
Mode
 Examples: Compute mode for the following data
sets:
1. 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
2. 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
3. 2.62, 2.75, 2.76, 2.86, 3.05, 3.12

134
Mode
 Examples: Compute mode for the following data
sets:
1. 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
• Mode is 4 “Unimodal”
2. 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
• There are two modes – 2 & 5
• This distribution is said to be “bi-modal”
3. 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
• No mode, since all the values are different

135
Class Exercise

Annual per capita carbon dioxide emissions (metric tons) for n = 8 largest
nations in population size

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2, Indonesia 1.4,
Pakistan 0.7, Russia 9.9, U.S. 20.1

Compute the mean and median of the carbon dioxide emissions data.
Which one is the best measure of central tendency? Why?

136
Class Exercise
Annual per capita carbon dioxide emissions (metric tons) for n = 8 largest
nations in population size

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2, Indonesia 1.4,
Pakistan 0.7, Russia 9.9, U.S. 20.1

Compute the mean and median of the carbon dioxide emissions data.
Which one is the best measure of central tendency? Why?

Ordered sample: 0.3, 0.7, 1.2, 1.4, 1.8, 2.3, 9.9, 20.1
Median = (1.4 + 1.8)/2 = 1.6
Mean = (0.3 + 0.7 + 1.2 + … + 20.1)/8 = 4.7
Mean sensitive to “outliers” (median often preferred for highly skewed
distributions)
137
Exercise

Sample Data (xi) : 10 12 14 15 17 18 18 24

Compute the mean and median of this sample data

138
Describing Data Numerically
Describing Data Numerically

Central Tendency Variation

Arithmetic Mean Range

Median Interquartile Range

Mode Variance

Standard Deviation

Coefficient of Variation

139
Measures of Dispersion
 Measures that quantify the variation or dispersion of a set of
data from its central location
 Dispersion refers to the variety exhibited by the values of the
data.
 Measures of variation give information on the spread or
variability of the data values.
 The amount may be small when the values are close
together.
 If all the values are the same, no dispersion

140
Why measures of Dispersion
 The measures of dispersion are helpful in statistical
investigation

 Some of the main objectives of dispersion are:

 To determine the reliability of an average

 To compare the variability of two or more series

 For facilitating the use of other statistical measures:

serve the basis of many other statistical measures such as


correlation, regression, testing of hypothesis etc.

141
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21

Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21

Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21

142
Comparing standard deviation

Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 3.338

Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 0.926
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.570
143
Measures of Dispersion
Variation

Range Interquartile Variance Standard Coefficient


Range Deviation of Variation

Two or more sets may have the


same mean and/or median but they
may be quite different

Same center,
different variation
144
Range (R)
• Simplest measure of variation
• Difference between the largest and the smallest observations
in a sample
• Range = Maximum value – Minimum value
• Example –Compute range of the following dataset:
Data values: 5, 9, 12, 16, 23, 34, 37, 42
– Range = 42-5 = 37

• Data set with higher range exhibit more variability

145
Properties of range
 It is the simplest crude measure and can be easily
understood

 It takes into account only two values which causes it to be a


poor measure of dispersion

 The larger the sample size, the larger the range

 Very sensitive to extreme observations

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
146
Variance (2, s2)
• first used by Karl Pearson in 1893

• Average (approximately) of squared deviations of values from


the mean

• Variance is used to measure the dispersion of values relative


to the mean.
– Population variance = σ2
– Sample variance = S2

• When values are close to their mean (narrow range) the


dispersion is less than when there is scattering over a wide
range.

147
Variance (2, s2)
n
Sample variance:  i
(x  x) 2

s2  i1
n -1
Where
X = arithmetic mean
n = sample size
N
Xi = ith value of the variable X
 (X i  ) 2
 2
 i 1
where
N
Population Variance: N

X i
= i=1
is the population mean.
N

148
Standard deviation (, s)
• It is the square root of the variance

   and S = S 2 2

• This produces a measure having the same scale as that of


the individual values.

149
Sample Standard Deviation computation
Following are the survival times of 11 patients after heart
transplant surgery. Calculate their sample variance and SD.

150
Exercise 3

Sample Data (xi) : 10 12 14 15 17 18 18 24

Compute the sample standard deviation (SD)

151
Sample Standard Deviation
computation
Sample Data (xi) : 10 12 14 15 17 18 18 24

n=8 Mean = X = 16

(10  X)2  (12  x)2  (14  x)2    (24  x)2


s
n 1

(10  16)2  (12  16)2  (14  16)2    (24  16)2



8 1

126
  4.2426
7
152
Properties of SD
 SD is considered to be the best measure of dispersion and is
used widely because of the properties of the theoretical
normal curve
 Each value in the data set is used in the calculation
 Values far from the mean are given extra weight
(because deviations from the mean are squared)
 The SD has the advantage of being expressed in the same
units of measurement as the mean
 However, if the units of measurements of variables of two
data sets is not the same, then there variability can’t be
compared by comparing the values of SD.

153
Coefficient of variation (CV)
 When two data sets have different units of measurements, or
their means differ sufficiently in size, the CV should be used
as a measure of dispersion
 It is the best measure to compare the variability of two series
of sets of observations
 Can be used to compare two or more sets of data measured
in different units
 Measures relative variation  s
CV     100%
 Shows variation relative to mean x 
 Data with less coefficient of variation is considered more
consistent (less dispersed) 154
Comparing CV
Stock A:
Average price last year = $50
Standard deviation = $5

s $5
CVA    100%  100%  10%
x $50
Both stocks have the
Stock B: same standard
deviation, but stock B is
Average price last year = $100 less variable relative to
Standard deviation = $5 its price

s $5
CVB    100%  100%  5%
x $100
155
Comparing CV

SD Mean
SBP 15mm 130mm
Cholesterol 40mg/dl 200mg/dl

Compare CV of SBP and Cholesterol level

156
Comparing CV

SD Mean CV (%)
SBP 15mm 130mm 11.5
Cholesterol 40mg/dl 200mg/dl 20.0

“Cholesterol is more variable than systolic blood pressure”

157
Standard Score
 A standard score for sample value in a data set is obtained
by the mean of the data set from the value and dividing the
result by the standard deviation of the data set.

X-X
Z
S
 Basically, the standard score (z-score) tells us how many
standard deviations a specific value is above or below the
mean value of the data set.
 i.e. the z-score is the number of standard deviations the data
value falls above (positive z-score) or below (negative z-
score) the mean for the data set.
158
Standard Score
 Ex. Suppose a student scored 65% in a statistics test and 70%
in mathematics test. In which subject did he perform better?

 To answer this Q, we need to compare the score of the student


with the average score of all students who sat for these exams
(simple comparing 65 and 70 may lead to a wrong conclusion).
We can compute and compare their Z score values.

Exercise: what is the Z-score for the value of 14 in the following


sample data set?

3 8 6 14 4 12 7 10

159
Standard Score
Exercise: what is the Z-score for the value of 14 in the following
sample data set?
3 8 6 14 4 12 7 10

mean= 8,SD = 3.8173. Thus

X - X 14 - 8
Z   1.57
S 3.8173
The data value of 14 is located 1.57 standard deviations above
the mean 8 because the z-score is positive.

160
General Shape of Distributions
Histograms and box plots can be quite useful in suggesting
the shape of a probability distribution.

For a distribution that is symmetric, approximately half of the data


values lie to the left of the mean, and approximately half of the
data values lie to the right of the mean.

A symmetric, uni-modal, bell shaped dist. is called a Normal


distribution.

161
Eg: Weight, Height, IQ, etc.
General Shape of Distributions
For a distribution that is skewed right (positively Skewed), the bulk
of the data values (including the median) lie to the left of the
mean, and there is a long tail on the right side.

Eg: Annual income


162
General Shape of Distributions
For a distribution that is skewed left (negatively Skewed), the
bulk of the data values (including the median) lie to the right of the
mean, and there is a long tail on the left side.

Eg: score on easy exam


163
Identifying Outliers
Example: Annual per capita carbon dioxide emissions (metric
tons) for n = 8 largest nations in population size

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2, Indonesia 1.4,
Pakistan 0.7, Russia 9.9, U.S. 20.1. Compute a measure of
central value.
Ordered sample: 0.3, 0.7, 1.2, 1.4, 1.8, 2.3, 9.9, 20.1

Median = (1.4 + 1.8)/2 = 1.6


Mean = (0.3 + 0.7 + 1.2 + … + 20.1)/8 = 4.7

Mean sensitive to “outliers” (median often preferred for highly


skewed distributions)
164
Identifying Outliers
Outliers are observations that are far from the center of the
distribution.
Box plots have box from LQ to UQ, with median marked. They
portray a five-number summary of the data:
Minimum, LQ, Median, UQ, Maximum except for outliers
identified separately

1 outlier

165
Identifying Outliers

Outlier = observation falling


below Q1 – 1.5(IQR)
or above Q3 + 1.5(IQR) where IQR= Q3-Q1.
or Zj above 3 or less than -3

Ex. If Q1 = 2, Q3 = 10, then IQR = 8 and outliers above 10 +


1.5(8) = 22

166
Remedial Measures for Outliers
If outliers exist, their potentially large squared errors may have a
strong influence on the fitted model (regression line)
Be sure to examine your data graphically for outliers and
extreme points
Decide, based on your model and logic, whether the extreme
points should remain or treated differently.

Outliers
1. Check if these are simply incorrectly recorded data.
2. Fit the model with and without the outlier.
Do the results change much?
If not, report the results including the outlier, but note that it is present.
If results do change substantially, report both.
3. Use a robust estimation procedure.
167