1 views

Uploaded by daniel

statistics

- Df31.pdf
- Astro Plug Thesis Presentation (1)
- Research Methodology...mobile phone questionnaire
- Nike Shoes Project
- 4th Sem Synopysis
- Statistics Best Practices Guide
- marketing strategies of nike
- Competitive Analysis of Godrej With Samsung
- finalresearchreport 1
- PB1MAT_01Bahan-Introduction of Statistic and Data Collection Pert 1
- Quantitative Methods
- RESEARCH PROJECT REPORT.docx
- Lect_28
- CHAPTER III PR PRODUCTS.doc
- BBRC4103-Research-Methodology-assignment2.pdf
- Thesis Format Gf Is
- MB0040 Assignments June2010
- Working Children in Indonesia
- Data in a Research Ttut
- Synopsis

You are on page 1of 167

EMAIL: birhanu.teshome@aau.edu.et

Institute of Tax and Customs Administration

Oct. 2017

Course Contents

Chapter I: Basic concepts in statistics (overview)

– What is Statistics?

– Methods of data collection

– Some Basic Terms in Statistics

– Sampling Techniques

– Criteria for the acceptability of a sampling method

Chapter II: Classification and presentation of statistical data

(overview)

– Scales of measurement and types of classification of variables

– Grouped frequency distribution

– Graphical presentation of data

Chapter III: Measures of central tendency and dispersion

– Measures of central tendency

– Measures of dispersion

– General shape of distributions 2

Course Contents

Chapter IV: Estimation of sample size

– Sample size determination with continuous data

– Sample size determination for proportions

Chapter V: Introduction to tests of hypotheses

– Review of hypothesis testing

– Data analysis

– Parametric and non-parametric statistics (tests)

variances and proportions

– Independent samples tests: tests of equality of two population

variances, tests concerning the difference between two means, tests of

mean difference between several populations and analysis of variance

(ANOVA)

– Paired-samples t-test (differences between dependent groups)

– Hypothesis test for the difference between two proportions

3

Course Contents

Chapter VII: Non-parametric tests

– Tests concerning independent samples (groups): Kruskal-Wallis test,

Mann–Whitney U test

– Tests concerning dependent samples (groups): Friedman test on

related ordinal measures, Wilcoxon signed-rank test, Cochran test on

related binary measures, McNemar test

– The Chi-square test – test of association between categorical variables

– The Pearson coefficient of correlation and test of its significance

– The Spearman rank correlation coefficient

Chapter IX: The simple linear regression model

– What is a regression model?

– Simple linear regression

– Precision and standard errors

– Inference concerning regression coefficients

4

Course Contents

Chapter X: The multiple linear regression model

– Introduction

– The coefficient of determination

– Test of model adequacy: Analysis of variance (ANOVA)

– Tests on regression coefficients

Chapter X: The Logistic regression model

– Introduction

– Binary logistic regression

– Tests on regression coefficients

Chapter XI: Overview of scientific research

– Research problem

– Review of literature

– Research design

– Sample design

– Data analysis

5

– Discussion of results, interpretation and conclusions

Teaching methods and further info.

• Teaching methods: Lectures, group project and assignments

• Software: Mainly SPSS

• Assessment: test (20%), group project, presentations (40%)

and a final written exam (40%).

• Attendance: Regular attendance in this course is expected.

• Participation: Active participation is expected.

• Electronic communication: I will occasionally send lecture

notes, assignments and reading materials through e-mail, so

please check your account regularly.

6

Introduction

Definition and classifications of statistics:

• In the plural sense: statistics are the raw data themselves,

like statistics of births, statistics of deaths, statistics of

students, statistics of imports and exports, etc.

• In the singular sense: statistics is the field of study that

deals with the collection, organization, presentation,

analysis and interpretation of numerical data.

7

Applications, Uses & Limitations of

statistics

Applications of statistics:

In planning, policy making, marketing decisions, in industries

especially in quality control area, health related studies,…

obtaining numerical facts e.g. a taxi price.

extent of environmental pollution

literature.

8

Uses of statistics

Some uses of statistics:

It presents facts in a definite and precise form

Data reduction

Measures the magnitude of variations in data

Furnishes a technique of comparison

Estimating unknown population characteristics

Testing and formulating of hypothesis

Studying the relationship between two or more variables

Forecasting future events

9

Limitations of statistics

As a science statistics has its own limitations. Some of its

limitations:

Deals with only quantitative information

data items

correct

used be experts.

10

Misuses of statistics

Many people, knowingly or unknowingly, use data in wrong

manner

• Unrepresentative/ Inadequate sample

• Unfair comparison

• Unwarranted conclusion:

– may be as a result of making false assumptions.

– may be the use of wrong average. Eg: Assume monthly incomes

of 1,000,000 and 1,000. The use of an arithmetic average in such

a case may give a wrong idea.

• Suppression of unfavorable results: hiding unfavorable, though

true, facts emerging from statistical study

• Use of inefficient statistical models or mistake in arithmetic

11

Classification of statistics

Based on the usage of statistical data, statistics is defined broadly

in to two mutually exclusive groups

Descriptive statistics:

• Used to describe the basic features of the data in a study

• Provide simple summaries about the sample and the measures

• Ways of organizing and summarizing data

• Helps to identify the general features and trends in a set of data

and extracting useful information

• Also very important in conveying the final results of a study

12

Descriptive Statistics

Collect data

e.g., Survey

Present data

e.g., Tables and graphs

Summarize data

e.g., Sample mean = X i

13

Inferential statistics

• is a method used to generalize from a sample to a population

• Eg, the average income of all families (the population) in

Ethiopia can be estimated from figures obtained from a few

thousands (the sample) families

• It is important because statistical data usually arises from

sample.

• Statistical techniques based on probability theory are required

are used to transform information into knowledge

14

Inferential Statistics

Estimation

e.g: Estimate the population mean

using the sample mean

Confidence interval

Hypothesis testing

e.g., Test the claim that the population

mean weight is 56 kg.

comparison of two or more means or

proportions

making decisions about a population based on

sample results 15

Descriptive Vs Inferential

Classify the following sentences as belonging to the area of

descriptive statistics or inferential statistics.

As a result of recent cutbacks by oil-producing nations, we

expect the price of gasoline to double in the next year.

due to terrorism.

at least 90% based on the statistics that 85% of his

seniors passed the course last year.

a clinic Y, 75% later developed significant side effect.

16

Key Definitions

• A population is the collection of all items

of interest or under investigation

N represents the population size

population

• n represents the sample size

17

Population

Role of statistics in using

Information from a sample

to make inferences about

the population

Information

Sample

Generalizability

If the sample is not representative

of the population, the conclusions

will be restricted to the sample &

could not be generalized to the

target population!

18

Key Definitions

Target population: A collection of items that have something in

common for which we wish to draw conclusions at a particular

time. E.g., All financial offices in Ethiopia

• Defining the target population is an important and often difficult part

of the study. For eg, in a political poll, should the target population

be all adults eligible to vote? All registered voters? All persons who

voted in the last election?

• The choice of target population will profoundly affect the statistics

that result.

Study (Sampled) Population: The subset of the target population

that has at least some chance of being sampled

and data are collected

19

Example: In a study of the

prevalence of HIV among

adolescents in Ethiopia, a

random sample of adolescents in

Bole KK of AA were included.

adolescents in Ethiopia

Study Population Study population: All adolescents

in Addis Ababa

Target Population

Sample: Adolescents in Bole KK

who were included in the study

20

Parameter and Statistic

– Values calculated using population data

– E.g., the mean (µ) age of the target population

– Values computed from sample data

– E.g., sample mean age ( )

21

Stages in statistical investigation

Interpretation

Inferential Statistics

Analysis of Data

Presentation

Descriptive Statistics

Organization

Problem

22

Formulating the problem

• Research begins with a problem/problems

The problem need not be Earth-shaking

research

• Sources of research problem:

– Observation, Literature reviews, Professional conferences, Experts.

Advance knowledge

23

Research Design

The first stage in any statistical investigation should be to:

Get a clear understanding of the physical background to

the situation under study

We can not study all subjects (e.g. all pregnant women) living in

a given geographical area

Sampling techniques

Sample size calculation, Study design

Method of data collection

Etc.

24

Stages in statistical investigation

Interpretation

Inferential Statistics

Analysis of Data

Presentation

Descriptive Statistics

Organization

Problem

25

Methods of Data Collection

Data are facts or figures from which conclusion can be drawn.

In order to draw valid conclusions, it is important to have

‘good’ data

Data are gathered with aim to meet predetermined objectives.

The data itself form the foundation of statistical analyses and

hence the data must be carefully and accurately collected.

Can be obtained from:

Routinely kept records, literature, Surveys, Experiments,

Reports, Observation, etc.

Who needs info?

Government, businesses, organizations, and everyone need info for

their day to day lives

26

Types of Data

Primary data: collected from the items or individual respondents

directly by the researcher for the purpose of a study.

you collect the data yourself

the data you collect is unique to you and your research and,

until you publish, no one else has access to it

Methods of collecting primary data: interviews, questionnaires,

observation (measurement) and diaries

Secondary data: which had been collected by someone else or

organization (e.g., researchers, institutions, other NGOs,…)

Some sources: official statistics, scholarly journals, reference

books, research institutes, universities, libraries, library search

engines, computerized data base and world wide web.

27

Method of primary data collection

Questionnaire: a popular means of data collection

written questions are mailed or hand-delivered to respondents

is difficult to design & often require many rewrites before an

acceptable questionnaire is produced.

Advantages:

Can be used as a method in its own right or as a basis for

interviewing or a telephone survey.

Relatively cheap

Can be posted, e-mailed or faxed

Can cover wide geographic area, a large number of people or

organizations

Avoids embarrassment on the part of the respondent.

Possible anonymity of respondent.

28

No interviewer bias

Method of primary data collection:

Questionnaire

Disadvantages:

help)

• Assumes no literacy problems

• No control over who completes it

• Not possible to give assistance if required

• Time delay whilst waiting for responses to be returned

• Several reminders may be required.

• Respondent can read all questions beforehand and then

decide whether to complete or not. E.g: it is too long, too

complex, uninteresting, or too personal.

29

Primary data collection: Interviewing

is primarily used to gain an understanding of the underlying

reasons & motivations for people’s attitudes…

Interviews can be undertaken on a personal one-to-one basis

or in a group.

can be conducted at work, at home, on the street, in a

shopping center, or some other agreed location.

Advantages:

Serious approach by respondent resulting in accurate info.

Good response rate, completed and immediate.

Possible in-depth questions.

Interviewer in control and can give help if there is a problem.

Can investigate motives and feelings.

Can use recording equipment.

30

If one interviewer used, uniformity of approach.

Primary data collection: Interviewing

Disadvantages:

• Time consuming.

• Geographic limitations.

• Can be expensive.

• Need to set up interviews.

• Normally need a set of questions.

• Respondent bias– tendency to please or impress, create false

personal image, or end interview quickly.

• Embarrassment possible if personal questions.

• Transcription and analysis can present problems– subjectivity.

• If many interviewers, training is required!

31

secondary data collection

• Saves time and money

• Avoid data collection problem

• Provide bases for comparison

Disadvantages

• Quality of documentation

• Data quality control

• Level of observation

• Data availability

• Outdated data

32

Sampling Techniques

33

Why Sample?

Researchers often use sample survey methodology to

obtain information about a larger population by selecting

and measuring a sample from that population.

Why Sample?

high precision based on samples.

Why Census?

Census: is a complete enumeration of the entire population

Eg. Population and housing census conducted in Eth.:

1984, 1994 and 2007.

a sample. Some of the reasons include:

Universality

Detaildness

Representativness

35

• Due to the variability in the characteristics of the population,

scientific sample designs should be applied to select a

representative sample.

population.

population by directly observing a portion of the population.

what can be learned from the sample—and how this

information can be applied to the entire population.

36

Sample Information

Population

It is essential that a sample should be correctly defined and

organized.

If the wrong questions are posed to the wrong people,

reliable information will not be received and lead to a wrong

conclusion when applied to the entire population.

37

Steps needed to select a sample and ensure

that this sample will fulfill its goals

– The first step in planning a useful and efficient survey is to

specify the objectives with as much detail as possible.

success.

valuable results.

at this stage.

38

2. Define the target population

– the population of interest to whom the researchers would like to

make generalizations.

– is the total population for which the information is required.

• Sampling population: the subset of the target population from

which a sample will be drawn.

• Study population: the actual group in which the study is

conducted = Sample

• Study unit: the units on which information will be collected:

persons, house holds, etc.

39

Researchers are interested to know about factors associated

with ART use among HIV/AIDS patients attending certain

hospitals in a given Region

patients in the Region

ART patients in, e.g. 3,

hospitals in the Region

Sample

40

2. Define the target population

part of the study.

– For eg:, in a political poll, should the target population be all adults

eligible to vote? All registered voters? All persons who voted in the last

election?

statistics that result.

The target population is defined by the ff. characteristics:

• Nature of data required

• Geographic location

• Reference period

• Other characteristics, such as socio-demographic

characteristics

41

3. Decide on the data to be collected

established.

sound, the necessary data terms and definitions also

need to be determined.

42

4. Set the level of precision

There is a level of uncertainty associated with estimates

coming from a sample.

specifying the acceptable margin of error and the confidence

level

a particular sampling plan, and try to minimize it.

Sample-to-sample variation causes sampling error

43

5. Decide on the methods on measurement

Choose measuring instrument and method of approach to the

population

statements that he/she makes or from a medical examination

interviewing

6. Preparing Frame

List of all members of the population from which the sample

will be taken

44

The sample design

Sample design: how the sample will be collected.

Estimation techniques: how the results from the sample will

be extended to the whole population.

Measures of precision: how the sampling error will be

measured.

Other Considerations

• Sample size determination

• Questionnaire development

• Pretest

• Organization of the field work

• Data collection, Data entry

• Summary and analysis of the data (Edit the completed

45

questionnaires, Decide on computation procedures)

Sampling

• Sampling: The process of selecting a portion of the

population to represent the entire population.

– Ensure that the sample represents the population, and

we want to draw a sample?

46

How will these people be selected?

Advantages of sampling:

• Feasibility: Sampling may be the only feasible method of

collecting information.

• Reduced cost: Sampling reduces demands on resource

such as finance, personnel, and material.

• Greater speed: Data can be collected and summarized

more quickly

• Greater accuracy: Sampling may lead to better accuracy of

collecting data

• Sampling error: Precise allowance can be made for

sampling error

47

Disadvantages of sampling:

• There is always a sampling error.

• Sampling may create a feeling of

discrimination within the population.

48

Errors in sampling

1) Sampling error: errors caused by the act of taking a sample.

They cause sample results to be different than results of a

census.

– They cannot be avoided or totally eliminated.

– Can be controlled by selecting “large” sample

• Random sampling error – deviation between the sample

statistic and the population parameter caused by chance in

selecting a random sample. The margin of error in a confidence

statement includes only random sampling error.

2) Non-sampling error: errors not related to the act of selecting a

sample from the population. They can be present in a census.

- Observational error

- Respondent error

- Lack of preciseness of definition

49

- Errors in editing and tabulation of data

Errors in Sampling

• Most sample surveys afflicted by errors other than random

sampling.

• These errors introduce bias that makes a confidence interval

basically meaningless.

• Good sampling technique includes reducing all sources of

error.

• Part of this includes random sampling and confidence

statements.

50

Sampling Errors

• Random sampling error

– Margin of error & confidence statement

• Bad sampling methods

– Voluntary response & convenience samples

• Under-coverage bias

– Occurs when some groups in the population are left out

of the process of choosing a sample.

– Limited sampling frame

– Homeless

– Subjects excluded who are in hospitals, motels, etc.

usually small.

51

Non-sampling Errors

• Response error (incorrect response)

– A subject may lie or not remember (period of time

questions)

– People may lie, especially if the questions are embarrassing: Age,

weight, income

– People may not remember (period of time questions)

• How many movies did you watch last year?

• Non-response bias

– Occurs when an individual chosen refuses to provide

answers or cannot be contacted.

• Measurement bias

– Interviewer bias

• Occurs when an interviewer (because of social position, poor

training, etc.) influences the response in a systematic way.

– Question wording bias

• Occurs when questions have leading phrases, loaded words, or

ambiguities that influence the response.

• Processing errors 52

Questions to Ask Before You Believe a

survey result

• What was the (target) population?

General Suggestions

• Think through the survey, include what critics might ask.

• Do a pilot survey

54

How to live with non-sampling errors

• Non-sampling errors, such as non-response, are always

there.

(Not always!).

bias.

– If too many women are in the sample, the survey gives

more weight to men.

55

Sampling Methods

Two broad divisions:

56

A. Probability Sampling

• Every sampling unit has a known and non-zero

probability of selection into the sample.

• Involves the selection of a sample from a population,

based on chance

57

• Probability sampling is:

– more complex,

– more time-consuming and

– usually more costly than non-probability sampling.

– because study samples are randomly selected and their

probability of inclusion can be calculated,

reliable estimates can be produced and

inferences can be made about the population.

58

• There are several different ways in which a probability

sample can be selected.

as

– the available sampling frame,

59

Most common probability

sampling methods

2. Systematic random sampling

3. Stratified random sampling

4. Cluster sampling

5. Multi-stage sampling

60

1. Simple random sampling

Every member of the population has an equal chance of being

selected

sample methods are compared

from the sampling frame, a list or a database of all individuals

in the population

61

To use a SRS method:

the size of the population)

Use of “lottery” methods

Computer programs

62

SRS has certain limitations:

Difficult if the reference population is dispersed.

Minority subgroups of interest may not be selected.

It can be expensive and often not feasible in practice

of being chosen in the sample, it may result in samples that

are spread out over a large geographical area. Such a

geographic distribution of the sample would be very costly to

implement

63

2. Systematic random sampling

• often used instead of random sampling

rather than randomly

64

2. Systematic random sampling

• Taking individuals at fixed intervals (every kth) based on the

sampling fraction

order:

– Order of registration of patients

– Numerical number of house numbers

– Student’s registration books

65

Steps in systematic random sampling

1. Number the units on your frame from 1 to N (where N is the

total population size).

of units in the population by the desired sample size.

number is called the random start and would be the first

number included in your sample.

4. Select every Kth unit after that first number

66

Example

• To select a sample of 100 from a population of 400, you

would need a sampling interval of 400 ÷ 100 = 4.

• Therefore, K = 4.

• You will need to select one unit out of every four units to

end up with a total of 100 units in your sample.

numbers.

first unit included in your sample;

a sample of 100: 3 (the random start), 7, 11, 15, 19...395,

67

399 (up to N, which is 400 in this case).

Using the above example, you can see that with a systematic

sample approach there are only four possible samples that

can be selected, corresponding to the four possible random

starts:

A. 1, 5, 9, 13...393, 397

B. 2, 6, 10, 14...394, 398

C. 3, 7, 11, 15...395, 399

D. 4, 8, 12, 16...396, 400

• Each member of the population belongs to only one of the four

samples and each sample has the same chance of being

selected.

would have a chance of making up the sample, while with

systematic sampling, there are only four possible samples.68

Advantages of Systematic random

sampling

The systematic sampling design is simple and convenient.

are relatively low.

satisfactory, provided care is taken to see that there are no

periodic features associated with the sampling interval.

often be expected to yield results similar to those obtained by

proportional stratified sampling.

69

Disadvantages of Systematic random

sampling

The main limitation of the method is that it becomes less

representative if we are dealing with populations having

“hidden periodicities”.

to the characteristics the investigator is interested in, then it is

possible that only certain types of items will be included in the

population, or at least more of certain types than others.

repetition is inherent in the sampling frame.

70

3. Stratified random sampling

• It is done when the population is known to be have

heterogeneity with regard to some factors and those factors

are used for stratification

• Using stratified sampling, the population is divided into

homogeneous, mutually exclusive groups called strata, and

• A population can be stratified by any variable that is available

for all units prior to sampling (e.g., income (low, medium &

high), age, sex, province of residence, etc.)

others that exist) can be used to sample within each stratum.

71

Why do we need to create strata?

• It can make the sampling strategy more efficient.

if a characteristic varies greatly from one unit to the other.

– For example, if every person in a population had the same salary, then

a sample of one individual would be enough to get a precise estimate

of the average salary.

• Efficiency gain using stratification.

– If you create strata within which units share similar characteristics

(e.g., income) and are considerably different from units in other strata

(e.g., occupation) then you would only need a small sample from each

stratum to get a precise estimate of total income for that stratum.

total income for the whole population.

72

• Is superior to SRS because it reduces sampling error

• If you use a SRS approach in the whole population without

stratification, the sample would need to be larger than the

total of all stratum samples to get an estimate of total income

with the same level of precision.

groups in the population of interest.

independent population and you will need to decide the

sample size for each stratum.

73

• Equal allocation:

– Allocate equal sample size to each stratum

• Proportionate allocation:

n

nj Nj

N

– nj is sample size of the jth stratum

– Nj is population size of the jth stratum

– n = n1 + n2 + ...+ nk is the total sample size

– N = N1 + N2 + ...+ Nk is the total population

size

74

Example: Proportionate Allocation

• Village A B C D Total

• HHs 100 150 120 130 500

• S. size ? ? ? ? 60

75

4. Cluster sampling

• Is preferable when the population is subdivided in to groups

or clusters that are internally heterogonous and externally

homogenous

• Sometimes it is too expensive to carry out SRS

– Population may be large and scattered.

– Complete list of the study population unavailable

– Travel costs can become expensive if interviewers have to

survey people from one end of the country to the other.

• Cluster sampling is the most widely used to reduce the cost

• The clusters should be homogeneous, unlike stratified

sampling where the strata are heterogeneous

76

Steps in cluster sampling

• Cluster sampling divides the population into groups

or clusters.

population

• then all units within selected clusters are included in

the sample.

No units from non-selected clusters are included in

the sample

– they are represented by those from selected clusters.

77

units are selected from each group.

Example

• In a school based study, we assume students of the same

school are homogeneous.

the selected sections only

Advantages

• Cost reduction

• It creates 'pockets' of sampled units instead of spreading the

sample over the whole territory.

• Sometimes a list of all units in the population is not available,

while a list of all clusters is either available or easy to create.

78

Disadvantages

• Creates a loss of efficiency when compared with SRS.

– You do not have total control over the final sample size.

– Since not all schools have the same number of students and you

must interview every student,

sample that does not represent the whole spectrum of

opinions or situations present in the overall population.

79

5. Multi-stage Sampling

• Similar to the cluster sampling, except that it involves

picking a sample from within each chosen cluster, rather

than including all units in the cluster.

the first sampling stage.

the second sampling stage, etc.

80

Woreda PSU

Kebele SSU

Sub-Kebele TSU

HH

81

• In the first stage, large groups or clusters are identified

and selected. These clusters contain more population

units than are needed for the final sample.

• In the second stage, population units are picked from

within the selected clusters (using any of the possible

probability sampling methods) for a final sample.

• No need to have a list of all of the units in the population.

– All you need is a list of clusters and list of the units in the selected

clusters.

• cost reduction.

• saves a great amount of time and effort.

B. Non-probability sampling

• In non-probability sampling, every item has an unknown

chance of being selected.

• In non-probability sampling, there is an assumption that

there is an even distribution of a characteristic of interest

within the population.

would be representative and because of that, results will be

accurate.

selection process, rather than an assumption about the

structure of the population.

83

In non-probability sampling, since elements are chosen

arbitrarily, there is no way to estimate the probability of any

one element being included in the sample.

being included, making it impossible either to estimate

sampling variability or to identify possible bias

the only way to address data quality is to compare some of

the survey results with available information about the

population.

acceptable level of error.

there is no way to measure the precision of the resulting

84

sample.

• Despite these drawbacks, non-probability sampling

methods can be useful when descriptive comments

about the sample itself are desired.

researches, when it is unfeasible or impractical to

conduct probability sampling.

85

Most common types of non-

probability sampling

1. Convenience or haphazard sampling

2. Volunteer sampling

3. Judgment sampling

4. Quota sampling

5. Snowball sampling technique

86

1. Convenience or haphazard sampling

haphazard or accidental sampling.

because sample units are only selected if they can be

accessed easily and conveniently.

but that advantage is greatly offset by the presence of

bias.

can deliver accurate results when the population is

homogeneous. 87

• For example, a scientist could use this method

to determine whether a lake is polluted or not.

sample would yield similar information.

lake without bothering about whether or not the

sample is representative

88

2. Volunteer sampling

• As the term implies, this type of sampling occurs when

people volunteer to be involved in the study.

• In psychological experiments or pharmaceutical trials

(drug testing), for example, it would be difficult and

unethical to enlist random participants from the general

public.

• In these instances, the sample is taken from a group of

volunteers.

• Sometimes, the researcher offers payment to attract

respondents.

lengthy, demanding or sometimes unpleasant process.

89

• Sampling voluntary participants as opposed to the

general population may introduce strong biases.

strongly enough about the subject tend to respond.

resulting in large selection bias.

90

3. Judgment sampling

• This approach is used when a sample is taken

based on certain judgments about the overall

population.

will select units that are characteristic of the

population.

judgment be relied upon to arrive at a typical

sample?

91

• Judgment sampling is subject to the researcher's biases

reflected in the sample, large biases can be introduced if

these preconceptions are inaccurate.

– in exploratory studies like pre-testing of questionnaires

and focus groups.

– in laboratory settings where the choice of experimental

subjects (i.e., animal, human) reflects the investigator's

pre-existing beliefs about the population.

cost and time involved in acquiring the sample.

92

4. Quota sampling

• Sampling is done until a specific number of units (quotas) for

various sub-populations have been selected.

filled, quota sampling is really a means for satisfying sample

size objectives for certain sub-populations.

those not selected.

Such strong assumptions are rarely valid.

93

Quota sampling is:

generally less expensive than random sampling.

whole population, randomly selecting the sample and following-up

on non-respondents can be omitted from the procedure.

required and can be conducted without sampling frames.

population has no suitable frame.

– Some units may have no chance of selection or the chance of

selection may be unknown. Therefore, the sample may be

biased.

94

5. Snowball sampling

• a special non probability method used when the desired

sample characteristic is rare.

respondents in these situations.

study subjects recruit future subjects from among their

acquaintances.

snowball.

• This sampling technique is often used in hidden populations

which are difficult for researchers to access; example

populations would be drug users or commercial sex workers.

95

Snowball sampling

frame, snowball samples are subject to numerous biases.

• For example, people who have many friends are more likely

to be recruited into the sample.

introducing bias!

96

Non-Probability Sampling: Inherent concerns related to

generalizability and representation

97

Variable

Variable: is an attribute or characteristic which may take on

different values in different persons, places,…

Weight, Height) or recorded (e.g., age, gender) and takes any

value.

There may be one or many variables in a study.

intended use and hence classified as qualitative and

quantitative variables

98

Types of Variable/Data

Variable/Data

Categorical/Qualitative Numerical/Quantitative

Eg:

Marital Status

registered to vote? Discrete Continuous

Region

(Defined categories or groups) Eg: Examples:

Number of Children Weight

Defects per hour Height

(Counted items) (Measured characteristics)

Levels of measurement

Differences between

measurements, true Ratio Data

zero exists

Eg: Height, age, BP Quantitative Data

Differences between

measurements but no Interval Data

true zero

Eg: Temp in oF

Ordered Categories

(rankings, order, or Ordinal Data

scaling)

Eg: response to treatment Qualitative Data

Categories (no

ordering or direction) Nominal Data

Eg: Ethnic group 100

Exercise 2

What type of variable is?

a) Region

b) Blood group

c) Health status: very sick, sick and cured

d) Age of an employee in a company

e) Student mark

f) No. of movies seen this summer

g) Income

h) Income class (poor, medium, rich)

i) Test result (negative, positive)

• Quantitative or categorical?

• Continuous or discrete?

• Nominal, ordinal, interval or ratio scale? 101

Assignment

2. Match by permissible Arithmetic operations of measurement

of scales

Nominal < or > operations

Ordinal Only + & - of scale values

Interval x & division of scale values

Ratio Counting

102

Recap: Why Level of measurement is

important?

present the data

interpret data

of statistical analysis is appropriate

103

Stages in statistical investigation

Interpretation

Inferential Statistics

Analysis of Data

Presentation

Descriptive Statistics

Organization

Problem

104

Methods of Data Organization

and Presentation

105

Data Organization

Data in raw form are usually not easy to use for decision

making

Graph

on the variable being summarized

106

Frequency Distributions

(Tables)

The actual summarization and organization of data starts

from frequency distribution.

the possible values that the data can assume along with the

number of times each value occurs.

A frequency distribution summarize data by condensing the

raw data into a more useful form

allows for a quick visualization of the data

107

Frequency Distributions

(Tables)

• For nominal and ordinal data, frequency distributions are

often used as a summary.

• Example:

is often listed

• Tables make it easier to see how the data are distributed108

Frequency Distributions

(Tables)

109

Select a set of continuous, non-overlapping intervals such

that each value can be placed in one, and only one, of the

intervals.

The first consideration is how many intervals to include

A common rule of thumb states that there should be no fewer

than six intervals and no more than 15.

110

To determine the number of class intervals and the

corresponding width, we may use:

w interval width

number of desired intervals K

where

K = number of class intervals n = no. of observations

W = width of the class interval L = the largest value

S = the smallest value

111

Example: A manufacturer of insulation randomly selects 20 winter

days and records the daily high temperature:

24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27

Solution:

1. Sort raw data in ascending order:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

2. Find range: 58-12 = 46

3. Select number of classes (K): K = 1 + 3.22 (log20) = 5.33≈5

4. Compute interval width: 10 (46/5 then round up)

5. Determine interval boundaries: 10 but less than 20, 20 but

less than 30, . . . , 60 but less than 70

6. Count observations & assign to classes

112

Data in ordered array:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

113

Exercise 3

Construct a grouped data frequency distribution for Leisure

time (hours) per week for 40 college students:

23 24 18 14 20 36 24 26 23 21 16 15 19 20 22 14 13 10 19

27

29 22 38 28 34 32 23 19 21 31 16 28 19 18 12 27 15 21 25 16

114

Cumulative frequencies: When frequencies

of two or more classes are added.

K = 1 + 3.22 (log40) = 6.32≈6

Maximum= 38, Minimum= 10 Cumulative relative frequency: The

Width = (38-10)/6 = 4.66 ≈ 5 percentage of the total number of

observations that have a value either in

that interval or below it.

(Hours) Frequency Frequency Relative

Frequency

10-14 5 0.125 0.125

15-19 11 0.275 0.400

20-24 12 0.300 0.700

25-29 7 0.175 0.875

30-34 3 0.075 0.950

35-39 2 0.050 1.00

Total 40 1.00 115

Graphical Presentation of Data

116

Importance of diagrammatic

representation

Diagrams have greater attraction than mere figures

They give quick overall impression of the data

They have great memorizing value than mere figures

They facilitate comparison

Used to understand patterns and trends

Well designed graphs can be powerful means of

communicating a great deal of information

When graphs are poorly designed, they not only

ineffectively convey message, but they are often

misleading.

117

Specific types of graphs include:

Categorical Numerical

Variables Variables

• Bar chart • Frequency distribution

• Pie chart • Histogram and ogive

• Pareto diagram • Stem-and-leaf display

• Scatter plot

118

• Bar charts and Pie charts are often used for qualitative

(category) data

Hospital Patients by Unit

Hospital Number

5000

Unit of Patients

Cardiac Care 1,052 4000

Number of

Emergency 2,245 3000

Intensive Care 340

2000

Maternity 552

Surgery 4,630 1000

Cardiac

Surgery

Emergency

Maternity

Intensive

Care

Care

119

Data presentation using

Histogram

120

Scatter Plot

Scatterplots (for quantitative variables)

plot response variable on vertical axis,

explanatory variable on horizontal axis

Descriptive Statistics: Numerical

Summary Measures

Single numbers that quantify the characteristics of a

distribution of values

Measures of central tendency (location)

Measures of dispersion

122

Describing Data Numerically

Describing Data Numerically

Mode Variance

Standard Deviation

Coefficient of Variation

123

Measures of Central Tendency

Overview

Central Tendency

x i

x i1

n

Arithmetic Midpoint of Most frequently

average ranked values observed value

124

Arithmetic Mean

The arithmetic mean (mean) is the most common

measure of central tendency

N

xx1 x 2 x N

i Population

μ

i1

values

N N

Population size

For a sample of size n:

n

x i

x1 x 2 x n Observed

x i1

values

n n

125

Properties of Arithmetic Mean

The most common measure of central tendency

Easy to calculate and understand (simple).

For a given set of data there is one and only one

arithmetic mean (uniqueness).

Influenced by each and every value in a data set

Greatly affected by the extreme values (outliers).

126

Weighted Mean

Weighted Mean is a special type arithmetic mean and it will

be functional when values have its own weight.

Some of the observations in a data set may have greater

importance.

compared to mid-exam and test

respectively, then weighted mean is defined as:

n

w x i i

w1x1 w 2 x 2 w n x n

xw i 1

n

w1 w 2 w n

w

i 1

i 127

Weighted Mean

Example: An entrance exam for a job consists of 25% English,

50% Mathematics, 5% Typing and 20% Accounting. If an

applicant who took the entrance exam scored 48% in English,

35% in Mathematics, 80% in Typing and 50% in Accounting,

his average score is:

n

w x i i

0.25(48) 0.50(35) 0.05(80) 0.20(50)

xw i 1

43.5

n

0.25 0.50 0.05 0.20

w

i 1

i

term exam and 5 for final exam. If a student score 90, 50 and

60 for HM, MT and FE, respectively, what is his/ her average

academic performance? Ans: 63

128

Median

In an ordered list, the median is the “middle” number (50%

above, 50% below)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3

n 1

Median position positionin the ordered data

2

129

Median

If the number of values is odd, the median is the middle

number

If the number of values is even, the median is the average of

the two middle numbers

n 1

Note that 2 is not the value of the median, only the

position of the median in the ranked data

130

131

Properties of median

There is only one median for a given set of data

(uniqueness)

The median is easy to calculate

Median is a positional average and hence it is insensitive

to very large or very small values

Median can be calculated even in the case of open end

intervals

It is determined mainly by the middle points and less

sensitive to the remaining data points (weakness).

132

Mode

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

No Mode

Mode = 9

133

Mode

Examples: Compute mode for the following data

sets:

1. 1, 2, 3, 4, 4, 4, 4, 5, 5, 6

2. 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8

3. 2.62, 2.75, 2.76, 2.86, 3.05, 3.12

134

Mode

Examples: Compute mode for the following data

sets:

1. 1, 2, 3, 4, 4, 4, 4, 5, 5, 6

• Mode is 4 “Unimodal”

2. 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8

• There are two modes – 2 & 5

• This distribution is said to be “bi-modal”

3. 2.62, 2.75, 2.76, 2.86, 3.05, 3.12

• No mode, since all the values are different

135

Class Exercise

Annual per capita carbon dioxide emissions (metric tons) for n = 8 largest

nations in population size

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2, Indonesia 1.4,

Pakistan 0.7, Russia 9.9, U.S. 20.1

Compute the mean and median of the carbon dioxide emissions data.

Which one is the best measure of central tendency? Why?

136

Class Exercise

Annual per capita carbon dioxide emissions (metric tons) for n = 8 largest

nations in population size

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2, Indonesia 1.4,

Pakistan 0.7, Russia 9.9, U.S. 20.1

Compute the mean and median of the carbon dioxide emissions data.

Which one is the best measure of central tendency? Why?

Ordered sample: 0.3, 0.7, 1.2, 1.4, 1.8, 2.3, 9.9, 20.1

Median = (1.4 + 1.8)/2 = 1.6

Mean = (0.3 + 0.7 + 1.2 + … + 20.1)/8 = 4.7

Mean sensitive to “outliers” (median often preferred for highly skewed

distributions)

137

Exercise

138

Describing Data Numerically

Describing Data Numerically

Mode Variance

Standard Deviation

Coefficient of Variation

139

Measures of Dispersion

Measures that quantify the variation or dispersion of a set of

data from its central location

Dispersion refers to the variety exhibited by the values of the

data.

Measures of variation give information on the spread or

variability of the data values.

The amount may be small when the values are close

together.

If all the values are the same, no dispersion

140

Why measures of Dispersion

The measures of dispersion are helpful in statistical

investigation

correlation, regression, testing of hypothesis etc.

141

Data A

Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21

Data B

Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21

Data C

Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21

142

Comparing standard deviation

Data A

Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21 s = 3.338

Data B

Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21 s = 0.926

Data C

Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21 s = 4.570

143

Measures of Dispersion

Variation

Range Deviation of Variation

same mean and/or median but they

may be quite different

Same center,

different variation

144

Range (R)

• Simplest measure of variation

• Difference between the largest and the smallest observations

in a sample

• Range = Maximum value – Minimum value

• Example –Compute range of the following dataset:

Data values: 5, 9, 12, 16, 23, 34, 37, 42

– Range = 42-5 = 37

145

Properties of range

It is the simplest crude measure and can be easily

understood

poor measure of dispersion

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5

Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120

Range = 120 - 1 = 119

146

Variance (2, s2)

• first used by Karl Pearson in 1893

the mean

to the mean.

– Population variance = σ2

– Sample variance = S2

dispersion is less than when there is scattering over a wide

range.

147

Variance (2, s2)

n

Sample variance: i

(x x) 2

s2 i1

n -1

Where

X = arithmetic mean

n = sample size

N

Xi = ith value of the variable X

(X i ) 2

2

i 1

where

N

Population Variance: N

X i

= i=1

is the population mean.

N

148

Standard deviation (, s)

• It is the square root of the variance

and S = S 2 2

the individual values.

149

Sample Standard Deviation computation

Following are the survival times of 11 patients after heart

transplant surgery. Calculate their sample variance and SD.

150

Exercise 3

151

Sample Standard Deviation

computation

Sample Data (xi) : 10 12 14 15 17 18 18 24

n=8 Mean = X = 16

s

n 1

8 1

126

4.2426

7

152

Properties of SD

SD is considered to be the best measure of dispersion and is

used widely because of the properties of the theoretical

normal curve

Each value in the data set is used in the calculation

Values far from the mean are given extra weight

(because deviations from the mean are squared)

The SD has the advantage of being expressed in the same

units of measurement as the mean

However, if the units of measurements of variables of two

data sets is not the same, then there variability can’t be

compared by comparing the values of SD.

153

Coefficient of variation (CV)

When two data sets have different units of measurements, or

their means differ sufficiently in size, the CV should be used

as a measure of dispersion

It is the best measure to compare the variability of two series

of sets of observations

Can be used to compare two or more sets of data measured

in different units

Measures relative variation s

CV 100%

Shows variation relative to mean x

Data with less coefficient of variation is considered more

consistent (less dispersed) 154

Comparing CV

Stock A:

Average price last year = $50

Standard deviation = $5

s $5

CVA 100% 100% 10%

x $50

Both stocks have the

Stock B: same standard

deviation, but stock B is

Average price last year = $100 less variable relative to

Standard deviation = $5 its price

s $5

CVB 100% 100% 5%

x $100

155

Comparing CV

SD Mean

SBP 15mm 130mm

Cholesterol 40mg/dl 200mg/dl

156

Comparing CV

SD Mean CV (%)

SBP 15mm 130mm 11.5

Cholesterol 40mg/dl 200mg/dl 20.0

157

Standard Score

A standard score for sample value in a data set is obtained

by the mean of the data set from the value and dividing the

result by the standard deviation of the data set.

X-X

Z

S

Basically, the standard score (z-score) tells us how many

standard deviations a specific value is above or below the

mean value of the data set.

i.e. the z-score is the number of standard deviations the data

value falls above (positive z-score) or below (negative z-

score) the mean for the data set.

158

Standard Score

Ex. Suppose a student scored 65% in a statistics test and 70%

in mathematics test. In which subject did he perform better?

with the average score of all students who sat for these exams

(simple comparing 65 and 70 may lead to a wrong conclusion).

We can compute and compare their Z score values.

sample data set?

3 8 6 14 4 12 7 10

159

Standard Score

Exercise: what is the Z-score for the value of 14 in the following

sample data set?

3 8 6 14 4 12 7 10

X - X 14 - 8

Z 1.57

S 3.8173

The data value of 14 is located 1.57 standard deviations above

the mean 8 because the z-score is positive.

160

General Shape of Distributions

Histograms and box plots can be quite useful in suggesting

the shape of a probability distribution.

values lie to the left of the mean, and approximately half of the

data values lie to the right of the mean.

distribution.

161

Eg: Weight, Height, IQ, etc.

General Shape of Distributions

For a distribution that is skewed right (positively Skewed), the bulk

of the data values (including the median) lie to the left of the

mean, and there is a long tail on the right side.

162

General Shape of Distributions

For a distribution that is skewed left (negatively Skewed), the

bulk of the data values (including the median) lie to the right of the

mean, and there is a long tail on the left side.

163

Identifying Outliers

Example: Annual per capita carbon dioxide emissions (metric

tons) for n = 8 largest nations in population size

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2, Indonesia 1.4,

Pakistan 0.7, Russia 9.9, U.S. 20.1. Compute a measure of

central value.

Ordered sample: 0.3, 0.7, 1.2, 1.4, 1.8, 2.3, 9.9, 20.1

Mean = (0.3 + 0.7 + 1.2 + … + 20.1)/8 = 4.7

skewed distributions)

164

Identifying Outliers

Outliers are observations that are far from the center of the

distribution.

Box plots have box from LQ to UQ, with median marked. They

portray a five-number summary of the data:

Minimum, LQ, Median, UQ, Maximum except for outliers

identified separately

1 outlier

165

Identifying Outliers

below Q1 – 1.5(IQR)

or above Q3 + 1.5(IQR) where IQR= Q3-Q1.

or Zj above 3 or less than -3

1.5(8) = 22

166

Remedial Measures for Outliers

If outliers exist, their potentially large squared errors may have a

strong influence on the fitted model (regression line)

Be sure to examine your data graphically for outliers and

extreme points

Decide, based on your model and logic, whether the extreme

points should remain or treated differently.

Outliers

1. Check if these are simply incorrectly recorded data.

2. Fit the model with and without the outlier.

Do the results change much?

If not, report the results including the outlier, but note that it is present.

If results do change substantially, report both.

3. Use a robust estimation procedure.

167

- Astro Plug Thesis Presentation (1)Uploaded byunheardbluda
- Research Methodology...mobile phone questionnaireUploaded byVidhu Jain
- Nike Shoes ProjectUploaded byAnkur Nigam
- marketing strategies of nikeUploaded byAnujRai
- Df31.pdfUploaded byNandang Arif Saefuloh
- 4th Sem SynopysisUploaded bypnarona
- Competitive Analysis of Godrej With SamsungUploaded byPrem Nautiyal
- Quantitative MethodsUploaded byIr Ayen
- Statistics Best Practices GuideUploaded byIvanZary
- finalresearchreport 1Uploaded byapi-326084419
- RESEARCH PROJECT REPORT.docxUploaded byKumarJitendra
- Lect_28Uploaded byaugiegmail
- PB1MAT_01Bahan-Introduction of Statistic and Data Collection Pert 1Uploaded byyeong21
- CHAPTER III PR PRODUCTS.docUploaded bymathura
- BBRC4103-Research-Methodology-assignment2.pdfUploaded bylistran26
- Thesis Format Gf IsUploaded byKatrina Carla Petingco
- MB0040 Assignments June2010Uploaded bySri Kris
- Working Children in IndonesiaUploaded byGustaaf Prihatin
- Data in a Research TtutUploaded byRenuka
- SynopsisUploaded byMrinal Kalita
- researc b.eUploaded byJp Sudan
- SynopsisUploaded bySourabh Agrawal
- Concept PaperUploaded byKristine Valida
- Microsoft Word - Pms Project SynopsisUploaded bySatish Waran S
- Microsoft Word - Pms Project SynopsisUploaded bySatish Waran S
- Manvi RMUploaded byNiket Nanda
- PR2_dlp21Uploaded byReynalyn Hernandez
- Docuri.com PartnershipUploaded bySHAIMER CINTO
- archuUploaded bysayli
- Revised Data Collection Tools 3-1-12Uploaded byQazi Dildar

- Definition of Global Wireless EUploaded byAnanthuSiby
- AP Gov TestUploaded byRunner128
- 6-12 August 12country &PoliticsUploaded byDhawan Sandeep
- Queens Gop OscUploaded byCeleste Katz
- readingPM1781Uploaded bydeepashaji
- Times Leader 09-14-2012Uploaded byThe Times Leader
- Catalan IndependenceUploaded byMiloš Milošević
- Qqr Labor UstUploaded byRod Panay
- 21-Taule vs Santos 200 SCRA 512Uploaded byenan_inton
- Brief Explanation on JoiBrief Explanation on Joint Management Body and the Annual General Meetingnt Management Body and the Annual General MeetingUploaded byHui Leong Terng
- lincoln.pdfUploaded byVivek Ghosal
- Election Law ReviewerUploaded byLouise Ysabel Saclolo
- Proposed Alabama House Bill 282 - Felony Voter Disqualification ActUploaded byBen Culpepper
- July ReportUploaded byaptureinc
- Grego vs ComelecUploaded byAugieray D. Mercado
- Loay, BoholUploaded bySun.Star Philippine news
- SAE Standards for Earthmoving EquipmentUploaded byVinod Yb
- New Text DocumentUploaded bySathya Gananaathaa Sattiamurti
- interpretationUploaded byNagendraKumar Neel
- MCRC Constitution & By-LawsUploaded byMorrisGOP
- 9.Law of MeetingsUploaded byGurrajvin Singh
- Elections in GujaratUploaded byManu Kumar
- Case 1.docxUploaded byJeryl Grace Fortuna
- Manisha Sinha - Today's Erie Echoes of the Civil WarUploaded bycowley75
- Statement from Senate Democratic Communications Director Mike MurphyUploaded byNew York State Senate Democratic Conference
- Elk v. Wilkins, 112 U.S. 94 (1884)Uploaded byScribd Government Docs
- 5. Cauton v. Comelec, 19 Scra 911Uploaded bynewin12
- PoliticsUploaded byElena Gonzalez Marquez
- Historical Guide to Old Charlottesville by Mary RawlingsUploaded byAlreadyaaken
- Types Majorities in Indian ConstitutionUploaded byA Rana