You are on page 1of 13

Levels of Measurement

Nominal: Qualitative variables whose values cannot be put into a meaningful order.
Eye color, race/ethnicity, ZIP code, Political affiliation, …

Ordinal: Qualitative variables whose values can be put into a meaningful order.
Level of agreement, socioeconomic status, …

Interval: Quantitative variables whose values lie on a scale with no natural starting point.
Differences between values are meaningful, but ratios are not.
Birth year, temperature (Fahrenheit or Celcius), …

Ratio: Quantitative variables whose values lie on a scale with a natural starting point. Both
differences and ratios are meaningful.
Age, height, salary, GPA, …

A study was conducted to assess student eating patterns in high schools across the United
States. The study analyzed the impact of vending machines and school policies on student food
consumption. Determine the level of measurement of the following variables considered in the
study.
a. Number of snack and soft drink vending machines in the school
Ratio
b. Class rank (Freshman, Sophomore, Junior, Senior)
Ordinal
c. Whether or not the school has a closed campus policy during lunch
Nominal

d. Number of days per week a student eats school lunch


Ratio

1.3 & 1.4 Sampling Methods

Populations and Samples


When conducting a statistical study, the individuals under investigation are called the
population of interest.
When the population is small, it may be possible to collect information from every individual.
The process of attempting to collect information from every individual in the population is
called a census.
When the population is large, it may not be possible to collect information from every
individual, in which case researchers may try to collect information from a subset of the
population called a sample. The process of collecting information from a sample is called a
survey.
Information collected from a group of individuals (either a population or a sample) is called
data.

A number that represents a property of the population is called a parameter.


A number that represents a property of a sample is called a statistic.
Calculating the exact value of a parameter is difficult because it requires collecting data from
the entire population (i.e., a census). Therefore, we usually just estimate its value by conducting
a survey.
A statistic is an estimate. A parameter is the number we are trying to estimate!
Types of Sampling
In order for a statistic to be a good estimate of a parameter, the sample used to calculate the
statistic needs to be representative of the entire population. Thus, the method used to select
the sample can have a major impact on the accuracy of our estimate.
The following sampling methods all have advantages and disadvantages:

 Convenience sampling
 Simple random sampling
 Systematic sampling
 Cluster sampling
 Stratified sampling
Convenience sampling selects participants based on how convenient it is to collect data from
them.

 Pros: cheap and easy


 Cons: unlikely to be representative of the population, self-selection bias (e.g.,
RateMyProfessor.com)

Random sampling selects participants in such a way that each individual has an equal chance of
being selected.
Simple random sampling selects participants in such a way that every possible sample has an
equal chance of being selected.

 Pros: highly likely to be representative of the population, hence fair and unbiased
 Cons: expensive, difficult to implement
How to generate random numbers

 TI-83/84 Plus: MATH > PROB > randIntNoRep(


 Microsoft Excel: =RANDBETWEEN(lower, upper)
 Random.org
Systematic sampling picks a random starting point, then selects every kth individual after that,
where
k = N / n ↓ (round down)

To select a systematic sample of size n from your class roster, divide the total number of
students N by the desired sample size n, then round down. For example, if N = 76 and n = 10,
k = 76/10 = 7.6 ↓ = 7
Use Random.org to choose a random number p between 1 and k. List every kth number starting
with p. For example, if p = 4 and k = 7,
4, 11, 18, 25, 32, 39, 46, 53, 60, 67

 Pros: straightforward, likely to be representative …


 Cons: … unless the population follows a pattern

Cluster sampling involves dividing the population into groups, called clusters, then randomly
selecting some of the clusters. All the members from the chosen clusters are included in the
sample.
Divide the class into rows, then randomly select three rows. All the students sitting in those rows
belong to the sample.

 Pros: more convenient than simple random sampling, likely to be representative of the
population if each cluster is heterogeneous, i.e., consists of dissimilar individuals.
 Cons: may not be representative of the population if the individuals within each cluster
have something in common (besides being in the same cluster)

Stratified sampling involves dividing the population into subgroups, called strata, then taking a
proportionately sized sample from each stratum.

 Pros: allows the researchers to control for confounding variables


 Cons: complicated, requires knowing a lot about the participants

How might the researchers studying the effectiveness of the Moderna COVID-19 vaccine stratify
their sample?
Possible answers include: age, race/ethnicity, gender, body-mass index, pre-existing health
conditions …

Which Sampling Method?


Example 1. American Airlines conducts a customer-satisfaction survey by randomly selecting 87
of its flights and distributing the survey to all of the passengers on the selected flights.
Cluster sampling
Example 2. A Gallup survey conducted May 3–18, 2021, asked respondents the following
question:
Do you think marriages between same-sex couples should or should not be recognized
by the law as valid, with the same rights as traditional marriages?
The survey was based on a random sample of 1016 adults, aged 18 and older, living in all 50
U.S. states and the District of Columbia, including a minimum quota of 70% cellphone
respondents and 30% landline respondents, with additional minimum quotas by time zone
within region. Landline and cellular telephone numbers are selected using random-digit-dial
methods.
Stratified sampling

Example 3. A quality-control engineer at a bottling plant wants to verify that the filling machine
is filling each bottle with 12 ounces of soda. The engineer decides to measure the amount of
soda in 50 bottles knowing that the bottling machine fills 4178 bottles per day. He calculates
k = 4178 / 50 = 83.56 ↓ = 83
then randomly starts with the 58th bottle, selecting every 83rd bottle thereafter:
58, 141, 224, 307, …, 4125
Systematic sampling
1.5 Sampling Errors, Bias, and Misleading Statistics

Sampling Error and Bias


Different samples contain different individuals, so our estimates will vary from sample to
sample.
Sampling error is the difference between our estimate and the true value of the parameter that
is caused by random variation between the population and the sample. In general, the more
data we collect, the more accurate our estimate will be, so we should be wary of studies whose
sample size is small.
Nonsampling error is caused by nonrandom factors that unduly influence the outcome of the
study. Remember, when selecting a random sample, every member of the population should
have an equal chance of being selected. In particular, the results of a survey can only be
generalized to the population from which the sample was taken.

 Sampling bias, also called selection bias, occurs when some members of the population
are more likely to be selected than others.
 Self-selection bias occurs when the sample consists of individuals who choose to
respond, such as many online polls.
 Response bias can occur due to the phrasing of survey questions, the environment in
which a study is conducted, the behavior of the survey taker, etc.
 Self-funded or self-interest studies are studies performed by a person or organization in
order to support their own claim.

Example. In the first clinical trial for the Moderna COVID-19 vaccine, all of the participants were
at least 18 years old. Therefore, we cannot generalize the results of the trial to children.
Moderna needed to conduct a second clinical trial to test whether the vaccine was safe and
effective in children 12 to < 18 years old, and they are currently conducting a third clinical trial
involving children from 6 months to < 12 years old.
Example. In 1936, both The Literary Digest and the American Institute of Public Opinion
conducted surveys to predict the outcome of the upcoming U.S. presidential election. The
Literary Digest predicted that the Republican candidate, Alfred Landon, would win the election
based on a sample of 2.3 million individuals. The American Institute of Public Opinion, founded
by George Gallup, predicted that the Democratic candidate, Franklin D. Roosevelt, would win
based on a much smaller sample of 50,000 individuals.
The Literary Digest surveyed its subscribers, registered automobile owners, and people with
telephones. What is wrong with this sample?
The sample is not representative of the population: the election took place during the height of
the Great Depression and most subscribers to the magazine, households with telephones, and
automobile owners were Republicans, the party of Landon. Essentially, there was an under-
coverage of Democrats by the Literary Digest.

Even if our sampling method is unbiased, the sample we select may not be representative of
the population if it contains an outlier (an observation that differs significantly from the rest of
the data). Outliers may result from atypical individuals, inaccurate responses, or data-entry
error. Outliers should be carefully scrutinized because they might unduly influence the results
of the study.

Example. An economist records the following prices at six randomly selected gas stations in
Texas:
$2.81 2.91 2.95 2.96 3.09 4.23
Calculate the average gas price at these six stations. Is this a good estimate for the average
price of gas in Texas?
$3.16. No, the $4.23 price is an outlier: because it unduly influences the average gas price, i.e.,
pulls up the average gas price toward itself. A good average estimate should be somewhere in
the middle of the data.
Misleading Statistics
Accurate data can still be misleading if it is displayed improperly. When creating a display, try to
ensure that it leaves the viewer with an accurate impression.
Which of the following line graphs is misleading? Why?
Manipulating the Vertical Scale: Bar Graphs
Truncating the y-axis of a bar graph exaggerates the differences between the categories.

The heights of the bars should always be proportional to the value each bar represents.
The Terri Schiavo case was a right-to-die case stretching from 1998 to 2005 that garnered
national attention. After the court ordered that she be removed from life support according
to her husband’s wishes, CNN used a graph similar to the one below comparing support for
the court’s decision by political party.

Remedy: Use Excel to draw a bar graph representing the same data, but with the y-axis
starting at zero.
70

60

50

40

30

20

10

0
Democrats Republicans Indepedents

Microsoft Excel: Insert > 2-D Column > Format Axis… (make sure to type the data into excel and
highlight the data before following the above instructions for drawing a bar graph)
Manipulating the Vertical Scale: Line Graphs
When drawing a line graph, truncating the vertical scale may be necessary to show small
changes over time. In such cases, it is good practice to break the vertical axis in order to draw
the viewer’s attention to the fact that vertical differences have been exaggerated.

Area Distortions
Draw a small square. Then draw another square whose sides are twice as long as the first
square.

How much bigger is the second square?


Four times (and not two times as one may have initially thought)
Avoid 3-D Displays
How tall is the bar representing April? Not clear.

10

0
Jan Feb Mar Apr May

How tall is the bar representing April? 10 units tall.


12

10

0
Jan Feb Mar Apr May

Nb: Both bar graphs are based on the same data.


Chapter 2 Graphic Displays of Data

2.1 Frequency Tables


Twenty MATH 1680 students were asked how many siblings they have. The data is shown
below. Construct a frequency table.
1 1 3 2 1 2 1 3 5 0
2 3 1 1 5 4 1 3 0 2

Siblings Frequency

0 2

1 7

2 4

3 4

4 1

5 2

You might also like