You are on page 1of 10

STQM 260: Introduction to Statistics

Professor Scott Foos


Spring 2021

Chapter 1: Statistics & Problem Solving

Chapter 2: Data, Reality & Problem Solving

Chapter 3: Visualizing Data

Chapter 4: Describing and Summarizing Data from One Variable

Chapter 5: Discovering Relationships

Chapter 6: Probability, Randomness, and Uncertainty

Chapter 7: Discrete Probability Distributions

Chapter 8: Continuous Probability Distributions

Chapter 9: Samples and Sampling Distributions

Chapter 10: Estimation: Single Samples


STQM 260, S21, Pg. 2

Chapters 1 & 2

Statistics and mathematics have similarities but are different. Mathematics solves problems with 100%
certainty and has only one correct answer. Statistics, because of variability and randomness, does not
solve problems with 100% certainty and frequently has multiple reasonable answers.

Statistics: The study of data including collecting, organizing, summarizing and analyzing information to
draw conclusions or answer questions. A ‘statistic’ is a measurement that describes a sample.

Anecdotal claim: A conclusion based on very little data.

Populations vs. Samples: It’s important to know which one you are working with.

 Population: The complete set of information. Measurements that describe populations are called
parameters and are usually denoted with Greek letters. A census is a population (it’s supposed to
be).

 Sample: A subset or portion of the population. Measurements that describe a sample are called
statistics and are usually denoted with English letters.

Statistics may be biased (misleading or misrepresentative) due to poor samples. On the other hand,
parameters cannot be biased because all parties are represented in a population so it never misrepresents
itself. You should question statistics and you should accept parameters. Statistics are estimates of
parameters.

Example 1.1: A clinical trial studied 100 people who were diagnosed with a rash to determine if sun
exposure affected their rashes. The 100 people were randomly divided into two groups of 50; one group
was exposed to 3 hours of a sun lamp and the other was protected from the UV rays. Their rashes were
monitored before and after the 3-hour period and the people were classified as (1) Rash worsened, (2) Rash
improved, or (3) Rash did not change. Identify the population and the sample in this study.

Population:____________________________________________________________________

Sample:______________________________________________________________________
STQM 260, S21, Pg. 3

Random Variable: Any characteristic that varies from individual to individual within a population.
Knowing the type of variable will help you choose the appropriate graph, formula, or test method.

 Qualitative: Attributes, categories, or characteristics that are not naturally numeric like ethnicity,
degree level of education, and zip-code. Qualitative is often referred to as categorical.

 Quantitative: Values that are numeric like age, grade point average, and temperature. Don’t be
fooled by numbers that are codes (zip-code is qualitative…code for a location). Quantitative is
often referred to as numerical.

Quantitative data is preferred over qualitative data because there are more tools/options for analyzing it…
it contains more information.


Example 1.2: Classify each random variable as qualitative or quantitative.

_______Height _______Gender

_______Marital status _______Number of mutations in a strand of


DNA

_______Student ID number _______Blood type

Qualitative Categories – Nominal vs. Ordinal

 Nominal: Variables that are categorical but without any natural ordering or hierarchy. For
example, hair color (brown, black, blonde, red, or white/silver).

 Ordinal: Variables that are categorical but with a natural ordering or hierarchy. For example,
class standing (freshmen, sophomore, junior, or senior).

Ordinal data contains more information than nominal; nominal is the weakest type of data.

Example 1.3: Classify each qualitative variable as nominal or ordinal.

_______Letter grade earned in a course (A, B, C, D, F).

_______Method of payment (cash, check, credit card).

_______Beverage size (S, M, L)

_______Marital status (Single, Married, Divorced, Widowed)


STQM 260, S21, Pg. 4

Quantitative Categories – Discrete vs. Continuous

 Discrete: Variables that have a finite or a countable number of outcomes. For example, the
number of mutations in a strand of DNA=0, 1, 2, …

 Continuous: Variables that have an infinite number of outcomes which are usually measured and
rounded and fall on some interval. For example, temperature in degrees Kelvin=[0, )

Example 1.4: Classify each quantitative variable as discrete or continuous.

_______Length of a newborn

_______Number of tornadoes in a year

_______Number of people in a group of 100 who have attached earlobes

_______Number of calories in a sandwich

Quantitative Categories – Interval vs. Ratio

 Interval: Numerical values where differences between values are meaningful but there is no ‘true
zero’ which means ratios are not meaningful.

 Ratio: Numerical values where both differences and ratios between values are meaningful
because there is a ‘true zero.’

Example 1.5: Classify each quantitative variable as interval or ratio.

_______Length of a newborn _______Temperature in Celsius

_______Temperature in Kelvin _______Dress sizes (0, 2, 4…)


STQM 260, S21, Pg. 5

Levels of Measurement Summary

Data can be grouped into 4 levels: nominal, ordinal, interval, and ratio. Nominal is the most simple, and
ratio the most sophisticated. Each level possesses the characteristics of the preceding level, plus an
additional quality.

 Nominal is hardly a measurement. It refers to quality more than quantity. A nominal level of
measurement is simply a matter of distinguishing by name. Nominal time of day - categories; no
additional information.

 Ordinal refers to order in measurement. An ordinal scale indicates direction, in addition to


providing nominal information. Ordinal time of day - indicates direction or order of occurrence;
spacing between is uneven.

 Interval scales provide information about order, and also possess equal intervals. Interval time of
day - equal intervals; analog (12-hr.) clock, difference between 1 and 2 pm is same as difference
between 11 and 12 am.

 Ratio scales have an absolute zero (a point where none of the quality being measured exists) in
addition to possessing the qualities of nominal, ordinal, and interval scales. Using a ratio scale
permits comparisons such as being ‘twice’ as high, or ‘half’ as much. Ratio time of day - 24-hr.
time has an absolute 0 (midnight); 14 o'clock is twice as long from midnight as 7 o'clock.

Measurement at the interval or ratio level is desirable because we can use more powerful statistical


procedures. Sometimes ordinal data are treated as though they were interval; for example, subjective
ratings scales (1 = terrible, 2= poor, 3 = fair, 4 = good, 5 = excellent). The scale probably does not meet
the requirement of equal intervals -- we don't know that the difference between 2 (poor) and 3 (fair) is the
same as the difference between 4 (good) and 5 (excellent). In order to take advantage of more powerful
statistical techniques, researchers often assume that the intervals are equal.
 
STQM 260, S21, Pg. 6

Data Sources:

Census: A census is a list of all the individuals in a population. An example is the US Census held every
10 years (this is only an example though). While answers have 100% certainty, this data may be difficult
or too costly to obtain.

Existing Source: An existing source is an appropriate data set has already been collected that can be used
for this study. While this data will save collection time and money, there may not be an existing source.

Survey Sample: A survey sample is a study when only a subset of the population is considered and there
is no attempt to influence the value of the variable of interest. While this data can save time and money,
choosing an appropriate sample can be difficult.

Observational Study: An observational study is one where there is no attempt to influence the value of
the variable. A survey sample is an example of an observational study. An observational study is also
called an ex post facto (after the fact) study. This data can detect associations between variables but it
cannot causation.

Designed Experiment: A designed experiment is a controlled study. The purpose of designed


experiments is to control as many factors as possible to isolate the effects of a particular factor. Control
usually means that a treatment was applied to a group of individuals and then the treated group is
compared to a control (untreated) group. Experimenters should be able to control most of the variables.
In this data you can analyze individual factors but it can’t be done when the variables cannot be
controlled which may be the case for moral/ethical reasons.

 Factors or Explanatory Variables: Factors, also called explanatory variables, are the variables
in a designed experiment that are controlled. They have values that can be changed by the
researcher and are considered possible causes.

 Response Variable: The designed experiment analyzes the effects of the factors on the response
variable. Response variables are not controlled, have values that are measured by the researcher,
and measure the effects.

Always be careful of…

 Lurking Variables: Variables that were not studied that actually cause changes in the response
variable. Example: In children it can be shown that as shoe size increases, vocabulary size
increases. The age of the child is the lurking variable in the relationship.

 Confounding Variables: Variables that seem to both cause changes in the response but their
effects are confounded together. Example: A car dealership noticed an increase in sales after
offering a lower interest rate plus a $1000 discount towards the financing of a new car loan. The
lower interest rate and the $1000 discount are confounded variables. It cannot be determined how
much each one individually effected the increase in sales.
STQM 260, S21, Pg. 7

Example 1.6: Determine whether each study depicts an observational study or an experimental study.

Study 1: Rats with cancer are divided into two groups. One group receives 5 mg of a medication that
is thought to fight cancer, and the other receives 10 mg. After 2 years, the spread of the cancer is
measured.

Study 2: A study to determine whether there is a relation between the rate of cancer and an individual’s
proximity to high-tension wires

Sampling Techniques

Sampling: The process of choosing the sample is called sampling. Good sampling techniques include an
element of randomness. Proper sampling is critical for reliable results.

Simple Random Sample (SRS): A simple random sample is when every possible sample of size n out of
a population of N has an equally likely chance of occurring. Simple random sampling requires that we
have a list of all the individuals within a population; this list is called a frame. If we do not have a frame,
then a different sampling method must be used.

An easy SRS method would be to draw names from a hat.

What you should avoid is a convenience sample…

Convenience Sampling: A convenience sample is obtained when we choose individuals in an easy, or


convenient way like “asking around” and like self-selecting surveys (when individuals voluntarily
respond to television or radio polls). Convenience sampling has little statistical validity because the
design is poor; however, there are times when convenience sampling could be useful as a rough guess.

Other EFFECTIVE Sampling Techniques:

Systematic Sampling: A systematic sample is obtained when we choose every kth individual in a
population. Systematic sampling is appropriate when we do not have a frame. The first individual
selected corresponds to a random number between 1 and k.

Stratified Sampling: A stratified sample is obtained when we choose a simple random sample from
subgroups of a population. This is appropriate when the population is made up of distinct (non-
STQM 260, S21, Pg. 8

overlapping) groups called strata. Within each stratum, the individuals are likely to have a common
attribute. For example, strata can be based on gender, political affiliation, income, level of education, etc.

Example: A poll about parking at university is to be given and researchers identified three strata,
1:Resident students, 2:Commuter students, and 3:Faculty and staff. It is reasonable to assume
that the opinions within each group are similar but it’s also reasonable to assume that the
opinions between each group are different. Assume that the sizes of the strata are:

Resident students – 5,000


Commuter students – 4,000
Faculty and staff – 1,000

If we wish to obtain a sample of size n = 100 that reflect the same relative proportions (desirable
but not mandatory), we would want to choose simple random samples of:

50 resident students
40 commuter students
10 faculty and staff

Cluster Sampling: A cluster sample is obtained when we divide our population into groups or clusters
and then choose a random set of groups of which all individuals within those groups are sampled. Cluster
sampling is appropriate when it is very time consuming or expensive to choose the individuals one at a
time.

Example: Testing the fill amount of cans of Pepsi. It is time consuming to wait for
individual cans to fill. It is wasteful to open an entire box of 12 cans to just test one.
If we would like to test 240 cans, we could randomly select 20 boxes and test all 12 cans
within each. This reduces the time and expense required.

Multistage Sampling: A multistage sample is obtained using a combination of simple random sampling,
stratified sampling, systematic sampling, and/or cluster sampling. Many large scale samples (the US
census in non-census years) use multistage sampling.

Example 1.7: Identify the type of sampling method.

__________________A farmer divides his orchard into 50 subsections, randomly selects 4, and
samples all the trees within the 4 subsections to approximate the yield of his orchard.

__________________A group of lobbyists has a list of the 100 senators of the U.S. To determine the
Senate’s position regarding student loan rates, they decide to talk with every seventh senator on the list
starting with the third.

__________________A school official divides the student population into five classes: freshmen,
sophomore, junior, senior, and graduate student. The official takes a simple random sample from each
class and asks the members’ opinions regarding student services.

__________________A survey regarding teacher quality (think ratemyprofessor.com) is administered


online by a market research firm and is available to anyone who would like to take it.

__________________A teacher has each of her students write their name on an index card. She
shuffles the cards and draws five.
STQM 260, S21, Pg. 9

Chapters 1 & 2 Self Practice

1) True or false:

A) Statistics is the study of data.


B) A statistic is a measurement that describes a population.
C) Statistics are used to answer questions with 100% certainty.
D) Anecdotal claims are very trustworthy.
E) A parameter can be biased.
F) Interval data contains more information/power than ordinal data.
G) Ratio data involves a true zero.
H) Nominal data is the weakest level of data.
I) In an experiment, the response variable is the dependent variable, factors and lurking variables
are potential independent variables, and sometimes two or more factors are confounded.
J) Factors are also called explanatory variables.
K) Causation can be determined from well-designed experiments.
L) Causation can be determined from observational studies.

ANSWER:
True are statements: A, F, G, H, I, J, K
False are statements: B, C, D, E, L

2) Classify the random variables as discrete or continuous.

A) The amount of fingernail growth measured in millimeters after a month.


B) The number of left-handed people in a crowd.

ANSWER: A is continuous and B is discrete

3) Classify the random variables as nominal, ordinal, interval, or ratio.

A) Weight of a molecule.
B) Time of day on a 12-hour clock.
C) Whether amino acids are charged, hydrophobic, or polar
D) Level of pain on a scale of 1 to 10 (1 being the least in pain and 10 being in extreme pain).

ANSWER: A is ratio, B is interval, C is nominal, D is ordinal

4) Classify each data source as a designed experiment or observational study.

A) A psychologist wants to study the interaction between parents and children at a local park. She
sits on a bench near the playground and takes notes on how involved the parents are in their
children’s play on the playground equipment.
B) A scientist wants to know how removing a gene affects the growing height of a plant. She has
a control group in which she does not change the gene sequence and she has another group in
which she removes the gene she wants to test. She records the data of the plants heights.

ANSWER: A is an observation study, B is a designed experiment


STQM 260, S21, Pg. 10

5) Classify the following as simple random, convenience, systematic, cluster, or stratified samples.

A) A group of environmental students ask every 10th person from the school’s student
registry about how they feel about implanting a recycling facility starting with the 4 th
person who they chose randomly.

B) Levodopa is a drug used to help control the symptoms of Parkinson’s disease. Doctors
wish to choose 80 out of 200 patients with Parkinson’s disease to test the effectiveness of
Levodopa. They assign each patient a number from 1 to 200. They then choose 80
numbers using a random number table.

C) A chemist divides a group of flasks of acids and bases into 7 categories based on pH.
She then takes a simple random sample from each of the 7 categories in order to use in an
experiment.

D) A florist sells roses by the dozen and currently has 120 roses separated in 10 different
bunches. She randomly picks 2 bunches and samples the 24 roses to determine the
average number of petals per rose.

E) On election day, a pollster for Fox News positions herself outside a polling place near
her home. She then asks the first 50 voters leaving the facility to complete a survey.

ANSWER: A is systematic, B is simple random, C is stratified, D is cluster, and E is convenience

You might also like