You are on page 1of 56

Chapter 1

Defining and Collecting


Data

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 1


Objectives
In this chapter you learn:
 To understand issues that arise when defining
variables.
 How to define variables.
 To understand the different measurement scales.
 How to collect data.
 To identify different ways to collect a sample.
 To understand the issues involved in data
preparation.
 To understand the types of survey errors.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 2


Classifying Variables By Type
DCOVA
 Categorical (qualitative) variables take categories
as their values such as “yes”, “no”, or “blue”,
“brown”, “green”.

 Numerical (quantitative) variables have values that


represent a counted or measured quantity.
 Discrete variables arise from a counting process.
 Continuous variables arise from a measuring process.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 3


Examples of Types of Variables
DCOVA

Question Responses Variable Type

Do you have a Facebook


profile? Yes or No Categorical

How many text messages Numerical


have you sent in the past --------------- (discrete)
three days?
How long did the mobile Numerical
app update take to --------------- (continuous)
download?

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 4


Types of Variables
DCOVA
Variables

Categorical Numerical

Nominal Ordinal Discrete Continuous


Examples: Examples: Ratings Examples: Examples:
 Marital Status  Good, Better,  Number of Children  Weight
 Political Party Best  Defects per hour  Voltage
 Eye Color  Low, Med, High (Counted items) (Measured
(Defined Categories) (Ordered Categories) characteristics)

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 5


DCOVA
Measurement Scales
A nominal scale classifies data into distinct categories in
which no ranking is implied.
Categorical Variables Categories

Do you have a
Facebook profile? Yes, No

Type of investment Growth, Value, Other

Cellular Provider AT&T, Sprint, Verizon,


Other, None

EX: Four different beverages are sold at a fast-food restaurant: soft


drinks, tea, coffee, and bottled water.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 6


Measurement Scales (con’t.)
An ordinal scale classifies data into distinct DCOVA
categories in which ranking is implied.
Categorical Variable Ordered Categories

Student class designation Freshman, Sophomore, Junior,


Senior
Product satisfaction Very unsatisfied, Fairly unsatisfied,
Neutral, Fairly satisfied, Very
satisfied
Faculty rank Professor, Associate Professor,
Assistant Professor, Instructor
Standard & Poor’s bond ratings AAA, AA, A, BBB, BB, B, CCC, CC,
C, DDD, DD, D
Student Grades A, B, C, D, F

U.S. businesses are listed by size: small, medium, and large

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 7


Measurement Scales (con’t.)
DCOVA
 An interval scale is an ordered scale in which the
difference between measurements is a meaningful
quantity but the measurements do not have a true
zero point.

 A ratio scale is an ordered scale in which the


difference between the measurements is a
meaningful quantity and the measurements have a
true zero point. EX: The time it takes to download a
video from the Internet is measured.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 8


A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 9
Interval and Ratio Scales
DCOVA

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 10


Exercise 1:

For each of the following variables, determine whether the


variable is categorical or numerical and determine its
measurement scale. If the variable is numerical, determine
whether the variable is discrete or continuous.
a. Number of cellphones in the household
b. Monthly data usage (in MB)
c. Number of text messages exchanged per month
d. Voice usage per month (in minutes)
e. Whether the cellphone is used for email (yes / no)

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 11


Exercise 2:

Suppose the following information is collected from Robert


Keeler on his application for a home mortgage loan at the Metro
County Savings and Loan Association.
a. Monthly payments: $2,227
b. Number of jobs in past 10 years: 1
c. Annual family income: $96,000
d. Marital status: Married

Classify each of the responses by type of data and measurement


scale.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 12


Exercise 3:
For each of the following variables, determine whether the
variable is categorical or numerical and determine its measurement
scale. If the variable is numerical, determine whether the variable
is discrete or continuous.
a. Name of Internet service provider
b. Time, in hours, spent surfing the Internet per week
c. Whether the individual uses a mobile phone to connect to the
Internet
d. Amount of money spent on clothing in the past month
e. Favorite department store
f. Most likely time period during which shopping for clothing
takes place (weekday, weeknight, or weekend)
g. Number of pairs of shoes owned.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 13


Data Is Collected From Either A
Population or A Sample
DCOVA

POPULATION
A population contains all of the items or
individuals of interest that you seek to study.

SAMPLE
A sample contains only a portion of a
population of interest.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 14


Population vs. Sample DCOVA

Population Sample

All the items or individuals A portion of the population


about which you want to draw of items or individuals.
conclusion(s).

A Population of Size 40 A Sample of Size 4

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 15


Collecting Data Via Sampling Is Used
When Doing So Is
DCOVA

 Less time consuming than selecting every item


in the population.

 Less costly than selecting every item in the


population.

 Less cumbersome and more practical than


analyzing the entire population.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 16


Parameter or Statistic? DCOVA

 A population parameter summarizes the value


of a specific variable for a population.

 A sample statistic summarizes the value of a


specific variable for sample data.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 17


1. Parameter: There are exactly 100 Senators in the 109th Congress of
the United States, and 55% of them are Republicans. The figure of
55% is a parameter because it is based on the entire population of all
100 Senators.
2. Statistic: In 1936, Literary Digest polled 2.3 million adults in the
United States, and 57% said that they would vote for Alf Landon for
the presidency.
That figure of 57% is a statistic because it is based on a sample, not
the entire population of all adults in the United States.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 18


Sources Of Data Arise From
The Following Activities DCOVA
 Capturing data generated by ongoing business
activities.
 Distributing data compiled by an organization or
individual.
 Compiling the responses from a survey.
 Conducting a designed experiment and
recording the outcomes.
 Conducting an observational study and
recording the results.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 19


Examples of Data Collected From
Ongoing Business Activities
DCOVA
 A bank studies years of financial transactions to
help them identify patterns of fraud.

 Economists utilize data on searches done via


Google to help forecast future economic
conditions.

 Marketing companies use tracking data to


evaluate the effectiveness of a web site.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 20


Examples Of Data Distributed
By An Organization or Individual
DCOVA
 Financial data on a company provided by
investment services.

 Industry or market data from market research


firms and trade associations.

 Stock prices, weather conditions, and sports


statistics in daily newspapers.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 21


Examples of Survey Data
DCOVA
 A survey asking people which laundry detergent
has the best stain-removing abilities.

 Political polls of registered voters during


political campaigns.

 People being surveyed to determine their


satisfaction with a recent product or service
experience.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 22


Examples of Data From A
Designed Experiment
DCOVA
 Consumer testing of different versions of a
product to help determine which product should
be pursued further.

 Material testing to determine which supplier’s


material should be used in a product.

 Market testing on alternative product


promotions to determine which promotion to
use more broadly.
A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 23
Examples of Data Collected
From Observational Studies
DCOVA
 Market researchers utilizing focus groups to
elicit unstructured responses to open-ended
questions.

 Measuring the time it takes for customers to be


served in a fast food establishment.

 Measuring the volume of traffic through an


intersection to determine if some form of
advertising at the intersection is justified.
A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 24
Observational Studies & Designed
Experiments Have A Common Objective
DCOVA
 Both are attempting to quantify the effect that a
process change (called a treatment) has on a
variable of interest.

 In an observational study, there is no direct


control over which items receive the treatment.

 In a designed experiment, there is direct control


over which items receive the treatment.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 25


Sources of Data DCOVA

 Primary Sources: The data collector is the one


using the data for analysis:
 Data from a political survey.
 Data collected from an experiment.
 Observed data.
 Secondary Sources: The person performing
data analysis is not the data collector:
 Analyzing census data.
 Examining data from print journals or data published
on the internet.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 26


1. The American Community Survey (www.census.gov/acs) provides
data every year about communities in the United States. Addresses are
randomly selected, and respondents are required to supply answers to a
series of questions.
a. Which of the sources of data best describe the American Community
Survey?
b. Is the American Community Survey based on a sample or a
population?

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 27


A Sampling Process Begins With A
Sampling Frame
DCOVA

 The sampling frame is a listing of items that


make up the population.
 Frames are data sources such as population
lists, directories, or maps.
 Inaccurate or biased results can result if a
frame excludes certain groups or portions of the
population.
 Using different frames to generate data can
lead to dissimilar conclusions.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 28


Types of Samples DCOVA

Samples

Non Probability Probability Samples


Samples

Simple Stratified
Random
Judgment Convenience

Systematic Cluster

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 29


Types of Samples:
Nonprobability Sample DCOVA

 In a nonprobability sample, items included are


chosen without regard to their probability of
occurrence.

 In convenience sampling, items are selected based


only on the fact that they are easy, inexpensive, or
convenient to sample.

 In a judgment sample, you get the opinions of pre-


selected experts in the subject matter.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 30


Types of Samples:
Probability Sample DCOVA

 In a probability sample, items in the


sample are chosen on the basis of known
probabilities.
Probability Samples

Simple
Systematic Stratified Cluster
Random

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 31


Probability Sample:
Simple Random Sample DCOVA

 Every individual or item from the frame has an


equal chance of being selected.

 Selection may be with replacement (selected


individual is returned to frame for possible
reselection) or without replacement (selected
individual isn’t returned to the frame).

 Samples obtained from table of random


numbers or computer random number
generators.
A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 32
Selecting a Simple Random Sample
Using A Random Number Table DCOVA

Sampling Frame For Portion Of A Random Number Table


Population With 850 49280 88924 35779 00283 81163 07275
11100 02340 12860 74697 96644 89439
Items 09893 23997 20048 49420 88872 08401

Item Name Item #


Bev R. 001
Ulan X. 002
. . The First 5 Items in a simple
. . random sample
. . Item # 492
Item # 808
. .
Item # 892 -- does not exist so ignore
Joann P. 849 Item # 435
Paul F. 850 Item # 779
Item # 002

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 33


Probability Sample:
Systematic Sample DCOVA
 Decide on sample size: n
 Divide frame of N individuals into groups of k
individuals: k=N/n
 Randomly select one individual from the 1st
group
 Select every kth individual thereafter
First Group
N = 40
n=4
k = 10

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 34


Probability Sample:
Stratified Sample DCOVA

 Divide population into two or more subgroups (called


strata) according to some common characteristic.
 A simple random sample is selected from each subgroup,
with sample sizes proportional to strata sizes.
 Samples from subgroups are combined into one.
 This is a common technique when sampling population of
voters, stratifying across racial or socio-economic lines.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 35


Probability Sample
Cluster Sample DCOVA

 Population is divided into several “clusters,” each representative of


the population.

 A simple random sample of clusters is selected.

 All items in the selected clusters can be used, or items can be


chosen from a cluster using another probability sampling technique.

 A common application of cluster sampling involves election exit polls,


where certain election districts are selected and sampled.

Population
divided into
16 clusters. Randomly selected
clusters for sample

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 36


Probability Sample:
Comparing Sampling Methods
DCOVA
 Simple random sample and Systematic sample:
 Simple to use.
 May not be a good representation of the population’s
underlying characteristics.
 Stratified sample:
 Ensures representation of individuals across the
entire population.
 Cluster sample:
 More cost effective.
 Less efficient (need larger sample to acquire the
same level of precision).

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 37


The registrar of a university with a population of N = 4,000
full-time students is asked by the president to conduct a survey to
measure satisfaction with the quality of life on campus. The
following table contains a breakdown of the 4,000 registered full-
time students, by gender and class designation:

The registrar intends to take a probability sample of n = 200


students and project the results from the sample to the entire
population of full-time students.
a. If the frame available from the registrar’s files is an alphabetical
listing of the names of all N = 4,000 registered full-time
students, what type of sample could you take? Discuss.
A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 38
b. What is the advantage of selecting a simple random sample
in (a)?
c. What is the advantage of selecting a systematic sample in (a)?
d. If the frame available from the registrar’s files is a list of the
names of all N = 4,000 registered full-time students compiled
from eight separate alphabetical lists, based on the gender and
class designation breakdowns shown in the class designation
table, what type of sample should you take? Discuss.
e. Suppose that each of the N = 4,000 registered full-time students lived
in one of the 10 campus dormitories. Each dormitory
accommodates 400 students. It is college policy to fully integrate
students by gender and class designation in each dormitory. If
the registrar is able to compile a listing of all students by dormitory,
explain how you could take a cluster sample.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 39


Data Cleaning Is An Important Data
Preprocessing Task Prior To Analysis
Data cleaning corrects irregularities in the data:
 Invalid variable values, including: DCOVA
 Non-numerical data for numerical variable.
 Invalid categorical values for a categorical variable.
 Numeric values outside a defined range.
 Coding errors, including:
 Inconsistent categorical values.
 Inconsistent case for categorical values.
 Extraneous characters.
 Data integration errors, including:
 Redundant columns.
 Duplicated rows.
 Differing column lengths.
 Different units of measure or scale for numerical variables.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 40


Examples Of Coding Errors
DCOVA
Copy-and-paste or data import can result in poor
recording or entry of data.

Categorical variable: Gender, Correct coding: F or M


 Correctable error: Female.
 Invalid data: New York.
 Correctable or software tolerated: m.
 Extraneous and nonprintable characters:
 Leading or trailing space(s): _F or F_.
 Other nonprintable characters may also be leading or trailing

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 41


Data Integration Errors From Combining
Two Different Computerized Data Sources
DCOVA
 Data integration errors often requires time-
consuming manual effort.
 Some examples:
 Variable names or definitions may differ.

 Duplicated rows (observations) may also occur.

 Different units of measurement (or scale) may not be


obvious without human interpretation.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 42


Stacked vs Unstacked Data
 For unstacked data you create separate
numerical variables for different groups (i.e. DCOVA
genders, locations, etc.)

 For stacked data you create a single column for


the variable of interest and create additional
columns for the potential grouping variables.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 43


An amusement park company owns three hotels on an adjoining
site. A guest relations manager wants to study the time it takes
for shuttle buses to travel from each of the hotels to the
amusement park entrance. Data were collected on a particular
day that recorded the travel times in minutes.
a. Explain how the data could be organized in an unstacked
format.
b. Explain how the data could be organized in a stacked format

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 44


A hotel management company runs 10 hotels in a resort area. The
hotels have a mix of pricing—some hotels have budget priced rooms,
some have moderate-priced rooms, and some have deluxe-priced rooms.
Data are collected that indicate the number of rooms that are occupied at
each hotel on each day of a month.
Explain how the 10 hotels can be recoded into these three price
categories.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 45


Evaluating Survey Worthiness
DCOVA
 What is the purpose of the survey?
 Is the survey based on a probability sample?
 Coverage error – appropriate frame?
 Nonresponse error – follow up.
 Measurement error – good questions elicit good
responses.
 Sampling error – always exists.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 46


Types of Survey Errors DCOVA

 Coverage error or selection bias:


 Exists if some groups are excluded from the frame and have
no chance of being selected.
 Nonresponse error or bias:
 People who do not respond may be different from those who
do respond.
 Sampling error:
 Variation from sample to sample will always exist.
 Measurement error:
 Due to weaknesses in question design and / or respondent
error.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 47


Types of Survey Errors (continued)
DCOVA

Excluded from
 Coverage error frame

Follow up on
 Nonresponse error nonresponses

Random
 Sampling error differences from
sample to sample

Bad or leading
 Measurement error question
A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 48
Nonresponse Error
Not everyone is willing to respond to a survey. Nonresponse error
arises from failure to collect data on all items in the sample and results
in a nonresponse bias.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 49


Ethical Issues About Surveys
DCOVA
 Coverage error and nonresponse error can be
leveraged by survey designers to purposely
bias survey results.
 Sampling error can be an ethical issue if the
findings are purposely not reported with the
associated margin of error.
 Measurement error can be an ethical issue:
 Survey sponsor chooses leading questions.
 Interviewer purposely leads respondents in a
particular direction.
 Respondent(s) willfully provide false information.
A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 50
Chapter Summary
In this chapter we have discussed:
 Understanding issues that arise when defining
variables.
 How to define variables.
 Understanding the different measurement scales.
 How to collect data.
 Identifying different ways to collect a sample.
 Understanding the issues involved in data
preparation.
 Understanding the types of survey errors.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 51


1. What is the difference between a sample and a population?
2. What is the difference between a statistic and a parameter?
3. What is the difference between a categorical variable and a
numerical variable?
4. What is the difference between a discrete numerical variable
and a continuous numerical variable?
5. What is the difference between a nominal scaled variable and
an ordinal scaled variable?
6. What is the difference between an interval scaled variable and
a ratio scaled variable?
7. What is the difference between probability sampling and nonprobability sampling?
8. What is the difference between a missing value and an outlier?
9. What is the difference between unstack and stacked variables?
10. What is the difference between coverage error and nonresponse error?
11. What is the difference between sampling error and measurement error?

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 52


1. Results of a 2017 Computer Services, Inc. (CSI) survey of a sample of
163 bank executives reveal insights on banking priorities among
financial institutions (goo.gl/mniYMM). As financial institutions begin
planning for a new year, of utmost importance is boosting profitability
and identifying growth areas.
The results show that 55% of bank institutions note customer experience
initiatives as an area in which spending is expected to increase.
Implementing a customer relationship management (CRM) solution was
ranked as the top most important omnichannel strategy to pursue with
41% of institutions citing digital banking enhancements as the greatest
anticipated strategy to enhance the customer experience.
a. Describe the population of interest.
b. Describe the sample that was collected.
c. Describe a parameter of interest.
d. Describe the statistic used to estimate the parameter in (c)

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 53


2. Three professors examined awareness of four widely
disseminated retirement rules among employees at the University
of Utah. These rules provide simple answers to questions about
retirement planning (R. N. Mayer, C. D. Zick, and M. Glaittle,
“Public Awareness of Retirement Planning Rules of Thumb,”
Journal of Personal Finance, 2011 10(1), 12–35). At the time of
the investigation, there were approximately 10,000 benefited
employees, and 3,095 participated in the study. Demographic data
collected on these 3,095 employees included gender, age (years),
education level (years completed), marital status, household
income ($), and employment category.
a. Describe the population of interest.
b. Describe the sample that was collected.
c. Indicate whether each of the demographic variables mentioned
is categorical or numerical.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 54


3. IBM Survey The computer giant IBM has 329,373 employees
and 637,133 stockholders. A vice president plans to conduct a
survey to study the numbers of shares held by individual
stockholders.
a. Are the numbers of shares held by stockholders discrete or
continuous?
b. Identify the level of measurement (nominal, ordinal, interval,
ratio) for the numbers of shares held by stockholders.
c. If the survey is conducted by telephoning 20 randomly selected
stockholders in each of the 50 United States, what type of
sampling (random, systematic, convenience, stratified, cluster) is
being used?
d. If a sample of 1000 stockholders is obtained, and the average
(mean) number of shares is calculated for this sample, is the result
a statistic or a parameter?
e. What is wrong with gauging stockholder views about employee
benefits by mailing a questionnaire that IBM stockholders could
complete and mail
A L WAY back?
S LEAR NING Copyright © 2020 Pearson Education Ltd. Slide 55
4. IBM Survey Identify the type of sampling (random, systematic,
convenience, stratified, cluster) used when a sample of the 637,133
stockholders is obtained as described. Then determine whether the sampling
scheme is likely to result in a sample that is representative of the population
of all 637,133 stockholders.
a. A complete list of all stockholders is compiled and every 500th name is
selected.
b. At the annual stockholders’ meeting, a survey is conducted of all who
attend.
c. Fifty different stockbrokers are randomly selected, and a survey is made of
all their clients who own shares of IBM.
d. A computer file of all IBM stockholders is compiled so that they are all
numbered consecutively, then random numbers generated by computer are
used to select the sample of stockholders.
e. All of the stockholder zip codes are collected, and 5 stockholders are
randomly selected from each zip code.

A L WAY S L E A R N I N G Copyright © 2020 Pearson Education Ltd. Slide 56

You might also like