Professional Documents
Culture Documents
Nominal: Qualitative variables whose values cannot be put into a meaningful order.
Eye color, race/ethnicity, ZIP code, Political affiliation, …
Ordinal: Qualitative variables whose values can be put into a meaningful order.
Level of agreement, socioeconomic status, …
Interval: Quantitative variables whose values lie on a scale with no natural starting point.
Differences between values are meaningful, but ratios are not.
Birth year, temperature (Fahrenheit or Celcius), …
Ratio: Quantitative variables whose values lie on a scale with a natural starting point. Both
differences and ratios are meaningful.
Age, height, salary, GPA, …
A study was conducted to assess student eating patterns in high schools across the United
States. The study analyzed the impact of vending machines and school policies on student food
consumption. Determine the level of measurement of the following variables considered in the
study.
a. Number of snack and soft drink vending machines in the school
Ratio
b. Class rank (Freshman, Sophomore, Junior, Senior)
Ordinal
c. Whether or not the school has a closed campus policy during lunch
Nominal
Convenience sampling
Simple random sampling
Systematic sampling
Cluster sampling
Stratified sampling
Convenience sampling selects participants based on how convenient it is to collect data from
them.
Random sampling selects participants in such a way that each individual has an equal chance of
being selected.
Simple random sampling selects participants in such a way that every possible sample has an
equal chance of being selected.
Pros: highly likely to be representative of the population, hence fair and unbiased
Cons: expensive, difficult to implement
How to generate random numbers
To select a systematic sample of size n from your class roster, divide the total number of
students N by the desired sample size n, then round down. For example, if N = 76 and n = 10,
k = 76/10 = 7.6 ↓ = 7
Use Random.org to choose a random number p between 1 and k. List every kth number starting
with p. For example, if p = 4 and k = 7,
4, 11, 18, 25, 32, 39, 46, 53, 60, 67
Cluster sampling involves dividing the population into groups, called clusters, then randomly
selecting some of the clusters. All the members from the chosen clusters are included in the
sample.
Divide the class into rows, then randomly select three rows. All the students sitting in those rows
belong to the sample.
Pros: more convenient than simple random sampling, likely to be representative of the
population if each cluster is heterogeneous, i.e., consists of dissimilar individuals.
Cons: may not be representative of the population if the individuals within each cluster
have something in common (besides being in the same cluster)
Stratified sampling involves dividing the population into subgroups, called strata, then taking a
proportionately sized sample from each stratum.
How might the researchers studying the effectiveness of the Moderna COVID-19 vaccine stratify
their sample?
Possible answers include: age, race/ethnicity, gender, body-mass index, pre-existing health
conditions …
Example 3. A quality-control engineer at a bottling plant wants to verify that the filling machine
is filling each bottle with 12 ounces of soda. The engineer decides to measure the amount of
soda in 50 bottles knowing that the bottling machine fills 4178 bottles per day. He calculates
k = 4178 / 50 = 83.56 ↓ = 83
then randomly starts with the 58th bottle, selecting every 83rd bottle thereafter:
58, 141, 224, 307, …, 4125
Systematic sampling
1.5 Sampling Errors, Bias, and Misleading Statistics
Sampling bias, also called selection bias, occurs when some members of the population
are more likely to be selected than others.
Self-selection bias occurs when the sample consists of individuals who choose to
respond, such as many online polls.
Response bias can occur due to the phrasing of survey questions, the environment in
which a study is conducted, the behavior of the survey taker, etc.
Self-funded or self-interest studies are studies performed by a person or organization in
order to support their own claim.
Example. In the first clinical trial for the Moderna COVID-19 vaccine, all of the participants were
at least 18 years old. Therefore, we cannot generalize the results of the trial to children.
Moderna needed to conduct a second clinical trial to test whether the vaccine was safe and
effective in children 12 to < 18 years old, and they are currently conducting a third clinical trial
involving children from 6 months to < 12 years old.
Example. In 1936, both The Literary Digest and the American Institute of Public Opinion
conducted surveys to predict the outcome of the upcoming U.S. presidential election. The
Literary Digest predicted that the Republican candidate, Alfred Landon, would win the election
based on a sample of 2.3 million individuals. The American Institute of Public Opinion, founded
by George Gallup, predicted that the Democratic candidate, Franklin D. Roosevelt, would win
based on a much smaller sample of 50,000 individuals.
The Literary Digest surveyed its subscribers, registered automobile owners, and people with
telephones. What is wrong with this sample?
The sample is not representative of the population: the election took place during the height of
the Great Depression and most subscribers to the magazine, households with telephones, and
automobile owners were Republicans, the party of Landon. Essentially, there was an under-
coverage of Democrats by the Literary Digest.
Even if our sampling method is unbiased, the sample we select may not be representative of
the population if it contains an outlier (an observation that differs significantly from the rest of
the data). Outliers may result from atypical individuals, inaccurate responses, or data-entry
error. Outliers should be carefully scrutinized because they might unduly influence the results
of the study.
Example. An economist records the following prices at six randomly selected gas stations in
Texas:
$2.81 2.91 2.95 2.96 3.09 4.23
Calculate the average gas price at these six stations. Is this a good estimate for the average
price of gas in Texas?
$3.16. No, the $4.23 price is an outlier: because it unduly influences the average gas price, i.e.,
pulls up the average gas price toward itself. A good average estimate should be somewhere in
the middle of the data.
Misleading Statistics
Accurate data can still be misleading if it is displayed improperly. When creating a display, try to
ensure that it leaves the viewer with an accurate impression.
Which of the following line graphs is misleading? Why?
Manipulating the Vertical Scale: Bar Graphs
Truncating the y-axis of a bar graph exaggerates the differences between the categories.
The heights of the bars should always be proportional to the value each bar represents.
The Terri Schiavo case was a right-to-die case stretching from 1998 to 2005 that garnered
national attention. After the court ordered that she be removed from life support according
to her husband’s wishes, CNN used a graph similar to the one below comparing support for
the court’s decision by political party.
Remedy: Use Excel to draw a bar graph representing the same data, but with the y-axis
starting at zero.
70
60
50
40
30
20
10
0
Democrats Republicans Indepedents
Microsoft Excel: Insert > 2-D Column > Format Axis… (make sure to type the data into excel and
highlight the data before following the above instructions for drawing a bar graph)
Manipulating the Vertical Scale: Line Graphs
When drawing a line graph, truncating the vertical scale may be necessary to show small
changes over time. In such cases, it is good practice to break the vertical axis in order to draw
the viewer’s attention to the fact that vertical differences have been exaggerated.
Area Distortions
Draw a small square. Then draw another square whose sides are twice as long as the first
square.
10
0
Jan Feb Mar Apr May
10
0
Jan Feb Mar Apr May
Siblings Frequency
0 2
1 7
2 4
3 4
4 1
5 2