You are on page 1of 83

Exploring Data

Mrs. Watkins
AP Statistics
Unit 1: Chapters 1-4
Statistics is the study of

DATA and Data: systematically


recorded information
VARIATION
Variation: concept
that data values will
be different from
subject to subject
Who or what do we study?
• Population: entire collection of subjects
about which information is desired
Ex: teenagers who purchase shoes
without laces
• Sample: a subset of population which is
used to gather information
Ex: a random sample of 500 teenagers who
purchased shoes without laces in the
last month
Sample is a subset of
the population

Each member of the


sample is called an
observational unit (or
case or subject or
experimental unit)
Samples
• We sample because it is simpler to
use a small group
• We sample because it is cheaper
to use a small group
We must ensure our sample is
RANDOM and
REPRESENTATIVE
Measures
Population Parameters: a measure of the
population, like the population MEAN
(average)
Symbol:  X (mu)
Sample Statistic: a measure of the sample,
like the sample MEAN (average)
Symbol: x ( x bar)
Variables
• Variable: characteristic of observational
unit
• Quantitative: uses numerical values that
are quantities
• Categorical: uses labels or groups that
are not quantities
Quantitative Variables
Two Types :
Discrete-- “countable”
# of people, # of plants, # of cars

Continuous-- “measurable”
cost, pulse rate, temperature,
weight
Categorical Variables
Two Types :
Binary– two labels
Yes/No or other two-group measures

Non-binary– More than two labels possible


Most variables are non-binary
For some companies now, data on gender is non-
binary.
Example:
A company which makes sneakers
wants to survey 500 teens about
preferences for no-lace sneakers. Identify
each:
Population:
Sample:
Observational Unit:
Possible Variables:
Example:
Classify each variable:
a)Weight of package delivered

b) Age of customer

c) Highest Level of Education of patient


Example:
Identify the Observational Unit:
a)Survival rates are collected about the 10
most common cancers in the US
b)Color preference for cars purchased by
200 customers at local Toyota dealership
c)Customers at outlet store are asked their
zip code for marketing purposes
Categorical
Data Graphs
Bar Chart

Best Used: To display counts or percentages


for categorical data
**should have space between bars
Advantage: Easiest to make
Example: Elementary School
Circle Graph

Best Used: To display counts or percentages for


categorical data
**should be labeled with percents
Advantage: visually appealing
Example: Elementary School
Conewago

Londonderry

East Hanover

South Hanover

Nye
Frequency Tables
Variable Tally Freq. Rel Freq
9th III
10th IIIIIII
11th IIIIIIIII
12th IIIII

Purpose: To organize raw data


Cumulative Relative
Frequency
Variable Rel. Freq Cum. Rel.Freq
9th
10th
11th
12th

Should add up to 100% or 1.00 (or close)


Quantitative
Data Graphs
Graphs Quantitative Data

Dotplot
Stem/Leaf Plot
Histogram
Dot Plot

Best used: small data sets with small range


Advantage: Show distribution of discrete
values; shows gaps in data
Stem/Leaf Plot

Best used: Small data set, two digits


Advantage: data values are preserved, quick
Back to Back Stem/Leaf Plot
These graphs share a common stem
Histogram

Best used: Large range, large amount of


data
Advantage: Usually made by
computer/calculator, can see shape easily
Histograms—two types
FREQUENCY
showing actual counts for each variable
value on vertical axis

RELATIVE FREQUENCY
showing proportion/percent for each
variable value on vertical axis
How to make histogram on
TI---pages 42 and 71 in
textbook
1. Stat—Edit—type in values
2. To sort—Stat—Edit—Sort A
3. To make histogram—StatPlot—turn on
Select Histogram—Name X List as which
list you want to display
Freq: 1 (leave as 1)
Press Zoom 9
Press Trace to see values and adjust by
using Window
SOCS
When describing a distribution of data,
put on your socs!
DATA ANALYSIS
AP questions will ask you to “comment on
the distribution”
S: SHAPE? Symmetric, skewed, bimodal

O: OUTLIERS? Any unusual values, gaps


C: CENTER ? Middle of the data

S: SPREAD? Range of data

ALWAYS DESCRIBE IN CONTEXT OF


DATA
Skewed Right—most data are low
values, a few high
Examples: income, housing prices,
number of speeding tickets per
driver in a year
Skewed Left: most data are high
values, a few low
Examples: scores on honors level
exam, blood pressure among
overweight patients, prices of auto
insurance for teens
Symmetric—data relatively equal on both
sides of center
Examples: body temp, pulse, IQ
Bimodal—two peaks in distribution
Uniform: relatively equal
distribution
Statistical Measures
and Data Distribution
Mrs. Watkins
AP Statistics
Unit 1, Chapters 5,6
MEASURES OF CENTER
Mean: arithmetic average of all data values
population mean: μx (read “mu”)
sample mean: X (read x bar)

Median: the middle value in a data set


also referred to as 2nd quartile Q2
and 50th percentile, P50
Midrange: average of the extremes
High + Low
2

Mode: the most common value in a data set


—best for categorical data
RESISTANCE
Resistant Measures: measures that are
NOT affected by extreme data values
Non-resistant Measures: measures
that ARE affected by extreme data
values
Mean, Midrange: NON-resistant
Median: resistant
SHAPE
If the mean > median, then data distribution
is skewed RIGHT. The mean is in the tail.

If the mean < median, then data distribution


is skewed LEFT. The mean is in the tail.

If the mean ≈ median, then data distribution


is approximately SYMMETRIC.
MEASURES OF
SPREAD
Range: Maximum – minimum
This is a single value measure
Resistant? NO

IQR (Interquartile Range): Q3 - Q1


This is a single value measure
Resistant? YES
5 Number Summary
5 important numbers in data set:
Min: lowest value
Q1: first quartile (25th percentile)
Med: middle (50th percentile)
Q3: third quartile (75th percentile)
Max: highest value
Q1, Med, Q3, may not be actual data values
BOXPLOT
graphical display of data using 5 number summary
(if outliers shown, called “modified box plot”)
5 # Summary Law: {60, 68, 74, 85, 94)
5# Summary Business: {65, 76, 86, 95, 100}
OUTLIERS
Outliers: unusually large or small data
values
Can see on modified box plot
IQR Test for Outliers
Calculate (IQR )
Multiply IQR x (1.5) = constant K
Q1 - K = outlier lower fence
Q3 + K = outlier upper fence
If any data values exceed these
fences(bounds), they are outliers
Example: IQR Test
A college student looks for a used textbook
on-line and finds the following costs:
83 94 85 88 78 28 80
Are there outliers in this data set?
STANDARD DEVIATION
a measure of the average amount of
deviation from the mean among the data
values

Population St. Deviation: σx (read “sigma”of x)


Sample St. Deviation: sx (read s of x)

We use sx because we usually do not have


entire population. NOT RESISTANT
VARIANCE
*the square of the standard deviation
*what you get before taking square root
NOT RESISTANT
Population Variance: σ 2
Sample Variance: s2
This measure not used much in elementary
statistics but you need to know what it is.
Formulas for Standard
Deviation
 ( x  x ) 2
 ( x   ) 2

sx  X 
n 1 n

Variance is the number you get before the


square root is taken
“Comment on the
distribution”
You now have numbers to support your
statements, rather than just graphs.
SHAPE: how is the data distributed?
OUTLIERS: do you have any outliers?
CENTER: where is the middle?
SPREAD: how widely does the data vary?
Unusual Features: gaps, clusters
ADJUSTMENTS TO DATA SET
What would happen to the statistical
measures if one very low or very high data
value was added to the set?

Mean:
Standard Deviation:
Median:
IQR:
TRANSFORMATIONS TO
DATA
What would happen to the statistical
measures if each data value had a
constant added to or subtracted from it?

Mean:
Standard Deviation:
Median:
IQR:
TRANSFORMATIONS TO
DATA
What would happen to the statistical
measures if each data value had a
constant multiplied or divided by it?
Mean:
Standard Deviation:
Median:
IQR:
MEASURES OF
POSITION
These give a numerical approximation of
where a single data value stands
compared to the whole distribution
Quartiles: mark 25th, 50th, 75th percentiles
Percentiles: mark what percent of data
are equal to or below a certain value
Z SCORE
Standardized Score: how a single
value compares to entire data set
in terms of position in distribution
z = individual value – mean
st. deviation
x  X xx
z z
X sx
NORMAL MODEL
shows how continuous data is distributed
symmetrically along an interval according
to empirical rule
Empirical Rule:
68% of data within 1 st. deviation of μ
95 % of data within 2 st. deviations of μ
99.7% of data within 3 st. deviations of μ
OUTLIER TEST
Using Empirical Rule:
Data values of z > +2 st. deviations away
from mean are mild outliers

Data values of z > +3 st. deviations away


from mean are extreme outliers
NORMAL CURVE

a theoretical ideal about how


traits/characteristics are distributed

Many human traits are approximately normally


distributed such as height, body temp, IQ,
pulse

Avoid using “normal” when describing data—say


“approximately normal or symmetric” unless
clearly mound-shaped, bell-shaped
NORMAL CURVE
Normal curve—symmetric, mound-shaped

Area under curve = 1 for whole curve

A z score can be used to establish what % of


the curve is less or more than the z score,
and establish probability of a data value
being in that position.
Normal Curve Example #1
Studies on car safety report that stopping
distances follow a normal model. Suppose
that one model of car traveling at 62 mph
has a mean stopping distance of 155 feet
with st. dev. = 5 feet.
Draw the model:
Normal Curve Example #1
a. What proportion of cars will stop in less
than 145 feet?

b. What proportion of cars will need more


than 160 feet to stop?

c. What proportion of cars will stop between


145 and 165 feet?
PERCENTILES USING
NORMAL CURVE
1. Find a z score(s)
2. Use calculator: normalcdf under DISTR
Looking for area > z score: normalcdf (z, ∞)
Looking for area < z score: normalcdf (∞, z)
Looking for area between z scores:
normalcdf (z1, z2)
Normal Curve Example #2
• Data from health studies show that the
distribution of human pregnancies is
approximately normal with a mean of 270
and st. dev = 15 days.
• Draw the model:
Normal Curve Example #2
a. What proportion of pregnancies will last
more than 280 days?

b. What proportion of pregnancies will last


less than 236 days?

c. What proportion of pregnancies will last


between 290 and 310 days?
FINDING CUT OFF SCORES
If you are given a percentile or probability,
and need to determine the “cut off score”
1. Sketch curve to determine where z score is
located.
2. Determine if you want area above or below this
percentile
3. Use INVNORM on calculator
invnorm(percentile)= z score
4. Use z score formula to solve for x.
Inverse Norm Example #1
• Data from health studies show that the
distribution of human pregnancies is
approximately normal with a mean of 270
and st. dev = 15 days.
• Find the 90th percentile for human
pregnancies:
Inverse Norm Example #2
• Data from a standardized test of 3rd grade
reading ability show that the distribution of
reading ability is approximately normal
with an approx. mean of 85 and st.dev = 6
days.
• Find the score interval for the middle 70%
of 3rd grade reading abilities:
Does the data fit a normal
model?
1. Check mean and median—how close are
they, in context of data?
2. Make a NORMAL PROBABILITY PLOT
on calculator. It should be approx. linear.
3. Make a BOXPLOT on calculator. It
should be approx. symmetric.

AVOID histograms on calculator to check.


Mean versus Median Check
Example: the mean spending on a laptop
among college students is $825 and median
spending is $749.

This means that the distribution is likely not


normal as these values are not close
enough to assert symmetry of the
distribution
Normal Probability Plot Check
Boxplot Check
Histogram Check

You might also like