You are on page 1of 30

Lecture 2

Collecting Sample Data


 If sample data are not collected in an
appropriate way, the data may be so
completely useless that no amount of
statistical torturing can salvage them.
 The method used to collect sample data
influences the quality of the statistical
analysis.
 Of particular importance is the simple
random sample.

SIS 1037Y 2023/2024 2


 Experiment:
◦ Apply a treatment to a part of the group.
 Simulation:
◦ Use a mathematical model (often with a computer)
to reproduce condition.
 Census:
◦ A count or measure of the entire population.
 Sampling:
◦ A count or measure of part of the population.

SIS 1037Y 2023/2024 3


 Statistical methods are driven by the data that
we collect. We typically obtain data from two
distinct sources: observational studies and
experiment.
 Observational study
◦ observing and measuring specific
characteristics without attempting to modify the
subjects being studied.
 Experiment
◦ apply some treatment and then observe its
effects on the subjects (subjects in
experiments are called experimental units)

SIS 1037Y 2023/2024 4


 The Pew Research Center surveyed 2252
adults and found that 59% of them go online
wirelessly.
◦ This an observational study because the adults had
no treatment applied to them.
 In the largest public health experiment ever
conducted, 200,745 children were given the
Salk vaccine, while another 201,229 children
were given a placebo.
◦ The vaccine injections constitute a treatment that
modified the subjects, so this is an example of an
experiment.

SIS 1037Y 2023/2024 5


 The goal of the researcher is to use the
highest level of measurement possible.
◦ Example: Two ways of asking about Smoking
behaviour. Which is better, A or B?
 (A) Do you smoke? Yes No
(B) How many cigarettes did you smoke in the
last 3 days (72 hours)? __

◦ (A) is nominal, so the best we can get from this data
are frequencies.
◦ (B) is ratio, so we can compute: mean, median,
mode, frequencies.

SIS 1037Y 2023/2024 6


 Example Two: Comparing Soft Drinks. Which
is better, A or B?
◦ (A) Please rank the taste of the following soft drinks
from 1 to 5 (1=best, 2= next best, etc.) __Coke
__Pepsi __7Up __Sprite __Dr. Pepper
◦ (B) Please rate each of the following brands of soft
drink:
(B) Please rate each of the following brands of soft drink:
Coke: (1) excellent (2) very good (3) good (4) fair (5) poor (6) very poor (7) awful
Pepsi: (1) excellent (2) very good (3) good (4) fair (5) poor (6) very poor (7) awful
7Up: (1) excellent (2) very good (3) good (4) fair (5) poor (6) very poor (7) awful
Sprite: (1) excellent (2) very good (3) good (4) fair (5) poor (6) very poor (7) awful
Dr Pepper: (1) excellent (2) very good (3) good (4) fair (5) poor (6) very poor (7) awful

SIS 1037Y 2023/2024 7


◦ Scale (B) is almost interval and is usually treated so –
means are computed. We call this a rating scale.
 By the way, if you hate all five soft drinks, we can determine
this by your responses.
◦ With scale (A), we have no way of knowing whether you
hate all five soft drinks.

 Rating Scales – what level of measurement?


Probably better than ordinal, but not necessarily
exactly interval. Certainly not ratio.
◦ Are the intervals between “excellent” and “good” equal to
the interval between “poor” and “very poor”? Probably
not. Researchers typically assume that a rating scale is
interval.

SIS 1037Y 2023/2024 8


 Nominal - categories only
 Ordinal - categories with some order
 Interval - differences but no natural zero
point
 Ratio - differences and a natural zero point

SIS 1037Y 2023/2024 9


 Nonprobability Samples – based on convenience or
judgment
◦ Convenience (or chunk) sample - students in a class, mall
intercept
◦ Judgment sample - based on the researcher’s judgment as to
what constitutes “representativeness” e.g., he/she might say these
20 stores are representative of the whole chain.
◦ Quota sample - interviewers are given quotas based on
demographics for instance, they may each be told to interview
100 subjects – 50 males and 50 females. Of the 50, say, 10
nonwhite and 40 white.

 The problem with a nonprobability sample is that we do not


know how representative our sample is of the population.

SIS 1037Y 2023/2024 10


 Random Sample: Each member of the population
has an equal chance of being selected.
 Simple Random Sample: All samples of the same
size are equally likely.
 Assign a number to each member of the
population. Random numbers can be generated
by a random number table, software program or
a calculator. Data from members of the
population that correspond to these numbers
become members of the sample.

SIS 1037Y 2023/2024 11


 In order to be of use, a sample must be
representative of the population it comes
from. There are several ways of forming
random samples.
 Use of a random number table or the random
selection of numbers using a technology tool.

SIS 1037Y 2023/2024 12


 Choose a starting value at random. Then
choose sample members at regular intervals.
 We say we choose every kth member. In this
example, k = 5. Every 5th member of the
population is selected.
 Advantages and disadvantages of a
systematic sample.

SIS 1037Y 2023/2024 13


 Convenience Sample: Choose readily available
members of the population for your sample.
Use results that are easy to get.
 Advantages and disadvantages of a
convenience sample.
 People who volunteer might have certain
characteristics that would bias a sample.

SIS 1037Y 2023/2024 14


 Divide the population into groups (strata) that share
the same characteristics and select a random sample
from each group. Strata could be age groups,
genders or levels of education, for example.
 Different types of sampling techniques are used to
ensure the sample represents the population. A
stratified sample is used to ensure that various
segments of the population are represented in the
sample. If a random selection of diabetic people were
selected to test the effects on a new drug and you
wanted to be sure that people in various age groups
were included in your sample you could separate the
population into three strata-psychology. People
under 25, people 26-50 and people over 50.

SIS 1037Y 2023/2024 15


 Divide the population into individual units or
groups and randomly select one or more
units. The sample consists of all members
from selected unit(s).
 A cluster is a naturally occurring group. If the
tests were done at 3 different clinics, each
cluster would be a clinic. Note that each clinic
would have people of all ages. Use this type
of sampling when all clusters have similar
characteristic.

SIS 1037Y 2023/2024 16


 Collect data by using some combination of
the basic sampling methods.

 In a multistage sample design, pollsters


select a sample in different stages, and each
stage might use different methods of
sampling.

SIS 1037Y 2023/2024 17


 Random
 Systematic
 Convenience
 Stratified
 Cluster
 Multistage

SIS 1037Y 2023/2024 18


 Different types of observational studies and
experiment design.
 Cross-sectional study
◦ Data are observed, measured, and collected at one
point in time.
 Retrospective (or case control) study
◦ Data are collected from the past by going back in
time (examine records, interviews, and so on …).
 Prospective (or longitudinal or cohort) study
◦ Data are collected in the future from groups sharing
common factors (called cohorts).

SIS 1037Y 2023/2024 19


 Randomization is used when subjects are
assigned to different groups through a
process of random selection.
 The logic is to use chance as a way to create
two groups that are similar.

SIS 1037Y 2023/2024 20


 Replication is the repetition of an experiment
on more than one subject.
◦ Samples should be large enough so that the erratic
behavior that is characteristic of very small samples
will not disguise the true effects of different
treatments.
◦ It is used effectively when there are enough
subjects to recognize the differences from different
treatments.
 Use a sample size that is large enough to let us see
the true nature of any effects, and obtain the
sample using an appropriate method, such as one
based on randomness.

SIS 1037Y 2023/2024 21


 Blinding is a technique in which the subject
doesn’t know whether he or she is receiving a
treatment or a placebo.
 Blinding allows us to determine whether the
treatment effect is significantly different from
a placebo effect, which occurs when an
untreated subject reports improvement in
symptoms.

SIS 1037Y 2023/2024 22


 Double-Blind
◦ Blinding occurs at two levels:
◦ (1) The subject doesn’t know whether he or she is
receiving the treatment or a placebo.
◦ (2)The experimenter does not know whether he or
she is administering the treatment or placebo.
 Confounding
◦ occurs in an experiment when the experimenter is
not able to distinguish between the effects of
different factors.
◦ Try to plan the experiment so that confounding
does not occur.

SIS 1037Y 2023/2024 23


 Completely Randomized Experimental Design
◦ Assign subjects to different treatment groups through a
process of random selection.
 Randomized Block Design
◦ A block is a group of subjects that are similar, but blocks
differ in ways that might affect the outcome of the
experiment.
 Matched Pairs Design
◦ Compare exactly two treatment groups using subjects
matched in pairs that are somehow related or have similar
characteristics.
 Rigorously Controlled Design
◦ Carefully assign subjects to different treatment groups, so
that those given each treatment are similar in ways that are
important to the experiment.

SIS 1037Y 2023/2024 24


 Three very important considerations in the
design of experiments are the following:
 1. Use randomization to assign subjects to
different groups.
 2. Use replication by repeating the
experiment on enough subjects so that
effects of treatment or other factors can be
clearly seen.
 3. Control the effects of variables by using
such techniques as blinding and a completely
randomized experimental design.

SIS 1037Y 2023/2024 25


Summary
Three very important considerations in the
design of experiments are the following:
1. Use randomization to assign subjects to
different groups.

2. Use replication by repeating the experiment


on enough subjects so that effects of
treatment or other factors can be clearly seen.
3. Control the effects of variables by using such
techniques as blinding and a completely
randomized experimental design.
SIS 1037Y 2023/2024 26
 No matter how well you plan and execute the sample
collection process, there is likely to be some error in
the results.
 Sampling error
◦ the difference between a sample result and the true
population result, such an error results from chance sample
fluctuations.
 Nonsampling error
◦ sample data incorrectly collected, recorded, or analyzed
(such as by selecting a biased sample, using a defective
instrument, or copying the data incorrectly).
 Nonrandom sampling error
◦ result of using a sampling method that is not random, such
as using a convenience sample or a voluntary response
sample.

SIS 1037Y 2023/2024 27


 Data errors that arise from issues with survey responses.
◦ subject lies – question may be too personal or subject tries to give
the socially acceptable response (example: “Have you ever used an
illegal drug? “Have you even driven a car while intoxicated?”)
◦ subject makes a mistake – subject may not remember the answer
(e.g., “How much money do you have invested in the stock
market?”
◦ interviewer makes a mistake – in recording or understanding
subject’s response
◦ interviewer cheating – interviewer wants to speed things up so
s/he makes up some answers and pretends the respondent said
them.
◦ interviewer effects – vocal intonation, age, gender, race, clothing,
mannerisms of interviewer may influence response. An elderly
woman dressed very conservatively asking young people about
usage of illegal drugs may get different responses than young
interviewer wearing jeans with tattoos on her body and a nose
ring.

SIS 1037Y 2023/2024 28


 If the rate of response is low, the sample may not be representative.
The people who respond may be different from the rest of the
population. Usually, respondents are more educated and more
interested in the topic of the survey. Thus, it is important to achieve
a reasonably high rate of response. (How to do this? Use follow-
ups.)
◦ Which Sample is better?

Sample 1 Sample 2
Sample size n = 2,000 n = 1,000,000
Rate of Response 90% 20%
◦ Answer: A small but representative sample can be useful in making inferences. But, a
large and probably unrepresentative sample is useless. No way to correct for it.
Thus, sample 1 is better than sample 2.

SIS 1037Y 2023/2024 29


Comments?

You might also like