You are on page 1of 9

A. P.

Statistics
Important Concepts
Here is a list of important concepts by chapter. You should skim the chapters that seem
unfamiliar (based on this list), and look at the chapter reviews.
Chapter 1
Distribution of a variable tells the values a variable attained and how often.
Describe a distribution of a quantitative variable by describing shape, center, and
spread
Describe symmetry distributions using mean and standard deviation use !"number
summary for skewed distributions
#ean is not resistant and is always pulled toward the tail
$tandard deviation is always positive and equals %ero only when all observations are
identical
&ive number summary' #in, (), #edian, (*, #a+. () is the ,!
th
percentile which
means that ,!- of observations are at or below that value. (* is the .!
th
percentile
which means that .!- of observations are at or below that value. #edian is !/
th
percentile.
&requency histogram has values of quantitative variable on one a+is and frequency on
other a+is. Relative frequency histogram has values of quantitative variable on one
a+is and proportion or percent of observations on other a+is.
0umulative frequency histogram or ogive gives the percent or frequency of
observations at or below each value. 0umulative relative frequency histogram or
ogive displays percentiles on one a+is.
1utliers may be identified using
) ! . IQR
rule, or by using a modified bo+ plot on
calculator.
#ean and standard deviation are 213 resistant. #edian and quartiles are resistant.
4se median and 5(6 as measures of center and spread (respectively) if data is
strongly skewed or has outliers.
7raphs to display univariate, quantitative data' bo+plot, stemplot, histogram, dotplot.
(2ote' bo+ plot does not give information about individual observations.)
Chapter 2
Density curve has area of ) and is always on or above the +"a+is
8rea under curve in a certain range is the same as the proportion of observations in
that range. (4se area formulas for geometric density curves such as rectangles,
triangles, and trape%oids.)
Normal density curve is mound"shaped (or bell"shaped) mean9median area can be
found converting the observations to observation on the standard normal curve
statistic " parameter
std. dev. of statistic
z

=


then use Normalcdf(left bound, right bound) or table.
Empirical rule applies to all normal density curves.
4se 5nv2orm to find the standardi%ed observation associated with a certain percentile
ranking' InvNorm(%rank (as proportion)) = !statistic. 3hen use %"formula to
convert from the %"value to the observation value.
4se normal probability plot to determine if data can be modeled by a normal curve.
3he plot looks kind of like a scatterplot (last type of graph in the $383 :;13 menu)
and if plot looks linear, then data can be modeled by a normal density curve. 5f plot
shows definite curve, then data is skewed. 5f one observation is set apart from others
on left or right, then the observation is possibly an outlier.
Chapter "
<ivariate data' 3wo measures recorded on each individual.
4se a scatterplot to determine if there is a relationship between two quantitative
variables.
:ositive association means positive slope as values of e+planatory variable increase,
values of response variable increase. 1r, above average values of one variable are
associated with above average values of the other variable.
3o describe a plot give strength, form, and direction of relationship.
Correlation is a measure of the strength of a linear relationship. 8lso gives direction
(sign).
0orrelation coefficient has values' ) ) r where ) r = is a perfect line with
positive slope and ) r = is a perfect line with negative slope. Properties of
correlation are on
p. 132.
0orrelation of / r = or r close to %ero could mean no association at all (randomly
scattered points) or a non-linear association, such as a quadratic.
Least suares re!ression line means that the line produces the smallest sum of
squared residuals possible for the data.
;east squares line always passes through ( ) x, y
;east squares regression line can be obtained using ;in6eg on calculator or using
means and standard deviations of the two data sets. ($ee formula on formula sheet.)
$lope of the least squares lines tells the amount that the y"variable changes for each
unit of change in the +"variable.
Coefficient of determination,
,
r , is the percent of the variation in the response
variable that is e+plained by the model on the e+planatory variable.
"esidual is
or observed predicted y y
5f residual plot has no pattern, then that is evidence that the model selected is a good
fit for the data. (6esiduals are plotted against +"values or against y"values.)
Influential point sharply affects regression line if removed points that are e+treme in
the +"direction can be influential influential points may have small residuals.
(problems on p. )=> good to look at for practice?)
#utliers in a regression do not fit the pattern of the data generally have large
residuals
Correlation does not mean causation?? @ven if you have a perfect correlation, that
does not mean that the +"variables causes changes in the y"variable. 3he correlation
could be due to lurking variables.
Chapter #
3o transform e+ponential data so that itAs linear, plot ( ) ( ) or x,log y x,ln y
. ;inreg
equation will be of the form'
ln y a bx = +
3o transform data from power model, plot ( ) ( ) or log x,log y ln x,ln y
. ;inreg
equation will be of the form'
ln y a bln x = +
:redictions from e$trapolation, or predicting outside the range of the data, are not
reliable as pattern may not continue.
8 lur%in! varia&le is a variable that has an important effect on the relationship
between the variables in a study, but it is not included among the variables studied
6elationship between two variables can be due to' causation B + causes y (very hard
to demonstrateConly with controlled e+periment) common response B both + and y
variable are responding to some third variable confoundin! B the effect of the
e+planatory variable on the y variable is hopelessly mi+ed up with the effects of other
variables.
Simpson's Parado$ refers the reversal of the direction of a relationship when data
from several groups are combined to form a single group.
0ategorial data can be analy%ed using a two"way table.
7raphs for categorical data' bar graph, pie graph, segmented bar graph
3o see if there is a relationship between two categorical variables, compare
conditional probabilities.

Chapter $
8 census is when every individual in a population is used in a study a sample is
when only a portion of a population is used in a study.
8 study is &iased if one outcome is systematically favored over other outcomes.
Samplin! &ias may be due to' voluntary response, undercovera!e( convenience
samplin!, 6andom sampling reduces the chance of bias.
$ample si%e is 213 bias?? Yes, larger sample si%es are more accurate (less spread),
but sample si%e does not result in one outcome being systematically favored over
another.
Non)samplin! &ias cannot be corrected by random sampling. $ources of non"
sampling bias are' poor *ordin! of uestions, or nonresponse.
8 sample is a simple random sample +S"S, if every group of n individuals has an
equal chance of being selected.
3o create an $6$, number the individuals in the population from ) to whatever and
then using a random number table select individuals. #ust use the same number of
digits for each number, so made need to label individuals as /), /,, etc, or //), //,,
etc.
8 stratified sample (which is 213 an $6$) is one in which individuals are first
divided into separate groups, or strata, and then an $6$ is selected from each stratum.
8n o&servational study occurs when the e+perimenter observes individuals and
measures variables of interest but does not attempt to influence the responses.
5n an e$periment, the researcher deliberately imposes a treatment on individuals in
order to observe how they respond to the treatment.
3he basic principles of e$perimental desi!n are' control B control the effects of
lurking variables (done by comparison of groups like control group and treatment
group, blindness) randomi-ation B randomly assigning subDects to groups
replication B perform the e+periment on many subDects to reduce chance variation in
results. (%ood idea to read p& 2'$(()
Completely randomi-ed e$periment is one in which all e+perimental units are
assigned completely at random to groups. (3his is the opposite of a block design??)
.lind e+periment means that subDect does not know whether he is receiving the real
treatment or the individual interacting with the e+perimental unit does not know.
Dou&le &lind e+periment means that neither the subDect nor the people in contact with
the subDect know which treatment the subDect is receiving.
<lindness and double"blindness are used to control the placebo effect. 3he placebo
effect is the phenomenon that humans will always respond to a treatment.
.loc% desi!n is one in which e+perimental units that are similar are grouped together
and assigned to treatment groups within the block. 1nly subDects within a block are
compared. Ehen blocking, put ;5F@ 3H527$ together so that the variable being
controlled is constant within the block.
Blocking is used to reduce variation and to control lurking variables by grouping
subDects according to those lurking variables.
/atc0ed pairs is a special case of blocking in which each block consists of only two
e+perimental units, or one e+perimental unit receiving two treatments. 1rder of
treatments must be randomi%ed since there is not randomi%ation within groups.
@ither one subDect receives both treatments (in random order) or two subDects that are
alike in every important way are compared with one subDect randomly receiving one
treatment and the other randomly receiving a treatment.
Simulation can be used to represent results of a certain situation. 3o simulate a
situation, assign numbers to outcomes so that the number occurs with the same
frequency (percent) as the outcome. 4se a random number table (or calculator) to
find outcomes according to the numbers that come up.
Ehen designing a simulation' describe phenomenon to be simulated described how
numbers will be assigned to outcomes state stopping condition state assumptions
(independence, special cases such as repeated numbers) carry out many repetitions.
Chapter )
Pro&a&ility is the proportion of times that a certain event occurs in a long series of
repetitions.
3he La* of Lar!e Num&ers describes the phenomenon of the more trials we do, the
closer the ratio of occurrences to trials becomes closer to the true probability.
3he complement of an event is
) P event !
. 0omplement is denoted ( )
c
P event
1*o events are mutually e+clusive if they cannot occur simultaneously that is, the
Doint probability is %ero,
/ P " B! =
. (3wo events that are mutually e+clusive
cannot be independent.)
3wo events, 8 and <, are independent if
( ) ( G ) P " P " B =
. @vents that can occur
simultaneously may or may not be independent.
3he 2oint pro&a&ility of two (or more) events is the probability that both occur
simulataneously (82D). 5f two events are independent, then Doint probability is'
P " B! P "!P B! =
3he union or pro&a&ility t0at at least one event occurs is (on formula sheet )
P " B! P "! P B ! P " B ! = +
Conditional pro&a&ility of two events is the probability that one event occurs given
that the other event has already occurred (on formula sheet). (
G
P " B!
P "B !
P B!

=
:robability of at least one is
) P none !
Chapter '
A random variable is a variable whose value is a numerical outcome of a random
phenomenon.
8 discrete random varia&le is one in which there are a countable number of
outcomes. 3he distribution of a discrete random variable is a table (or histogram)
showing each possible outcome along with the probability of that outcome. 3o find
the probability of a discrete random variable, add the probabilities of all of the
outcomes in the range.
8 continuous random varia&le is one in which the variable takes on every possible
value in an interval. 3he distribution of a continuous random variable is a density
curve. 3o find the probability for a continuous random variable, find the area under
the density curve.
8 normal distribution is a special distribution of a continuous random variable.
3he mean of a random variable or e$pected value is (on formula sheet)
i i
# $ ! x p =

. (#ultiply each outcome by its probability and add them all up.)
3his gives the average outcome per game if the phenomenon were repeated many
times.
3he variance of a random variable is (on formula sheet)
,
i i
%"R $ ! x x ! p =

to get standard deviation, you square root the variance.
You can find the mean and standard deviation of a random variable on your calculator
by entering outcomes in ;) and probabilities in ;, and then doing )"Har $tats ;),;,.
$pecial rules'
a) 5f a constant is added to every number in a distribution, then the mean of the new
distribution is the old mean plus the constant and the standard deviation stays the
same. (8dding does not change the spread it Dust shifts distribution.)
b) 5f every number in a distribution is multiplied by a constant, then the mean of the
new distribution is the old mean times the constant and standard deviation of the
new distribution is the old standard deviation times the constant. (3he variance is
the square of the standard deviation so the variance is multiplied by the square of
the constant.)
c) 5f a new distribution is formed by adding or subtracting randomly selected
individuals from two e+isting distributions then the mean of the sums (or
differences) is the sum (or difference) of the means. 3hat is,
x y x y

+
= +
or
x y x y

=

3his calculation works regardless of whether the observations are independent.
d) 5f a new distribution is formed by adding or subtracting randomly selected
individuals from two e+isting distributions then the variance of the sums (or
differences) is the sum of the variances, provided that the observations are
independent. 3hat is,

, , ,
x y x y

+
= + . 3o the standard deviation, square root this result. 2ote that we
always 8DD variances.
Chapter *
.inomial distri&ution is a special probability distribution in which' there are two
outcomes for the event there is a fi+ed number of observations the observations are
independent probability of success is the same for each observation
3o find the probability of e+actly % success in n trials of a binomial phenomenon
either use the formula (on formula sheet)' 1
n n %
n %
P+ 3 % , C p + p ,

= = or use
<inompdf on your calculator' <inompdf(num observations, probability of success,
number of successes we want)
<inomial distribution is symmetrical if p 9 /.! skewed right if p close to %ero
skewed left if p close to ).
#ean of a binomial distribution is (on formula sheet)
=
3
np
and standard deviation
of binomial distribution is )
3
np+ p , = .
4eometric distri&ution is a special probability distribution in which' there are two
outcomes, success or fail, for the event the observations are independent the
probability of success is the same for each observation the variable of interest is the
number of trials before we see the first success.
3o find the probability of the first success on the kt& observation either use the
formula'
)
)
n
P+ 3 % , + p , p

= = 7eometric distribution is always skewed right.


3he mean of a geometric distribution is (213 on formula sheet)
1
3
p
=
Chapter +
;arger sample si%es are more accurate increasing sample si%e decreases sampling
variability.
3he samplin! distri&ution of a statistic is the distribution of the statistic in all
possible samples of a certain si%e.
3he distri&ution of t0e sample mean(
x
( has mean
=
$

(where

is the mean of
the population from with the sample is drawn) and standard deviation
$
n

=
.
3he Central Limit 10eorem gives the mean and standard deviation of the sampling
distribution of sample means as state above, and more importantly says that if the
sample si%e is large ( */ n ) the sampling distribution of sample means will be
appro+imately normal regardless of the shape of the distribution of the population. (5f
*/ n < , then the sampling distribution of sample means will mimic the shape of the
population, and will be more like it the smaller the sample si%e is. 3he sampling
distribution of sample means is normal if the distribution of the population is normal.)
3he sampling distribution of sample proportions,
p5
, has mean 5p
p =
(where p is
the population proportion) and standard deviation
1
5p
p+ p ,
n

= .
3he sampling distribution of sample proportions will be appro+imately normal if
1, and 1 1, np n+ p ,
Chapter 1,
8 confidence interval is used to estimate a population parameter.
An N6 confidence interval is interpreted as follo*s' 2- of all intervals that could
be obtained contain the population parameter, so weAre fairly confident that our
interval contains the population parameter.
5ncreasing the confidence level will increase the margin of error. Decreasing
confidence andIor increasing sample si%e will decrease the margin of error.
3he P)value of a statistic is the probability of obtaining a statistic as e+treme as the
one you got, if the null hypothesis is true.
8 statistic is si!nificant if it is unlikely to occur by random chance the statistic (or
sample) is unusual or rare. 3he smaller the :"value (closer to %ero) the more
significant the statistic and the stronger the evidence against the null hypothesis.
8 1ype I error is the probability that we incorrectly reDect the null hypothesis when it
is really true. 3he probability of a 3ype 5 error is

(significance level). 8 3ype 5


error will occur

of the time by random chance.


8 1ype II error is the probability that we incorrectly accept the null hypothesis when
the alternate is really true. 3he probability of a 3ype 55 error is called

and is the
area under the JtrueK distribution that falls in the acceptance region of the
hypothesi%ed distribution.
3he po*er of a test is the probability that the test will reDect the null if the alternate is
really true. (
)
) power is the complement of 3ype 55 error.
5f sample si%e is increased, the probability of a 3ype 55 error decreases and power
increases. (:robability of 3ype 5 error is still alpha.)
5f significance level (alpha) is made smaller (from /./! to /./)) then probability of
3ype 55 error is increased and power decreased.
Chapter 11
8 t)distri&ution is used when we donAt know the population standard deviation and
we want to estimate a population mean. #ean of the distribution is always %ero.
8 t"distribution is bell"shaped, but is more variable (wider and flatter) than a standard
normal curve for small sample si%es. 5t is more variable because the standard
deviation is calculated from the sample so varies with each sample. 8s sample si%e
increases without bound, the t"distribution becomes closer to a normal distribution.
&or dependent samples (#atched :airs) do a one"sample t"test on the list of
differences. 3he null hypothesis would be
/
di''
=
&or two independent samples, use a two"sample t"test with null hypothesis
) ,
=
&or ,"sample t"test, the degrees of freedom is the smaller sample si%e minus ).
$@ is standard error. &or sample means, $@ is an estimate of the standard deviation
of the sampling distribution (
n

) and is
s
(#
n
=
.
$@ for the difference of two means is
, ,
) ,
) ,
s s
n n
+
Chapter 12
3he standard error of p"hat is
) p p !
n

. 4se this when creating a confidence


interval to estimate the population proportion.
$ampling distribution of sample proportions is appro+imate normal if
)/ np
and
) )/ n p !
. &or a )"prop %"test, use the JpK from the null hypothesis. &or a
confidence interval, use p"hat.
4se a one"proportion %"test to compare an unknown population proportion to a known
population proportion. 3he standardi%ed statistic is' ( )
L
)
p p
z
p p
n

=
where p is the
known proportion stated in the null hypothesis (since we assume the null is true).
You may use
)":rop M test on calculator to test if a population proportion is equal to a given value
but you still must show all steps including the appropriate calculation for the %"
statistic.
3he $@ for a difference of two proportions is
) ) , ,
) ,
) ) p p ! p p !
n n

+
. 4se this $@
when estimating the difference between two population proportions. (Do 213 pool
the p"hats when calculating a confidence interval because we have no null that
assumes the proportions are equal.)
3he sampling distribution for the difference of two proportions is appro+imately
normal if
! n p !
and
) ! n p !
for <13H p"hats. 5f checking Jnearly normalK
for a two"proportion %"test, you may use the pooled
L p
in the place of both p"hats
when checking the nearly normal condition.
4se ,":rop M"test on calculator to see if two population proportions are equally where
you have two independent samples.
Chapter 1"
3he )&i-s*uare distribution is always skewed to the right. 8lso, the mean shifts to
the right as the degrees of freedom increases.
4se a 0hi"square 7oodness of &it test if you want to see if the distribution of a single
categorical variable JfitsK some ideali%ed distribution. 0reate the chi"sq statistic
using lists. 6emember to always use 01423$ not percents in your lists?
4se a 0hi"square 3est of independence if you want to see if two variables measured
on one set of individuals (sample or population) are independent"""mainly a two"way
table. 2ull hypothesis is that the variables are independent, or that there is no
relationship between the variables.
4se 0hi"square test of homogeneity if you want to see if the values of a categorical
variable are distributed the same way for two or more populations. 3he test works the
same way as a test of independence e+cept the null hypothesis is that the populations
are homogeneous.
@+pected counts for two"way table'
row total column total
grand total

Chapter 1#
3o test to see if a linear model is appropriate, we make an inference about the
parameter

where beta is the slope of the population line.


2ull' 8ssume there is no linear relationship, so beta is %ero. Ho'
/ =
Degrees of freedom is n B , where n is the number of observations.
3o make a decision, look at the t"statistic and p"value from the computer output.
3o create a confidence interval for the population slope, use the t"statistic with n B ,
degrees of freedom and use the $@ from the computer output.
Hypothesis test decisions'
(uantitative variable(s)
4se )"sample t"test if comparing known population mean to unknown population
mean.
4se ,"sample t"test if comparing two unknown population means if the two
samples are independent.
4se )"sample t"test for matched pairs if you want to know if thereAs a difference
in two means given that the two samples are 213 independent"""matched
on some variable.
4se )"sample t"interval to estimate unknown population mean.
4se ,"sample t"interval to estimate the difference between two unknown
population means.
0ategorical variable(s)
4se )"proportion %"test if comparing a known population proportion to an
unknown population proportion.
4se ,"proportion %"test if comparing two unknown population proportions if the
samples are independent.
4se )"proportion %"interval to estimate an unknown population proportion.
4se ,"proportion %"interval to estimate the difference between two unknown
population proportions.
4se 0hi"square goodness of fit test if testing to see if the distribution of one
categorical variable for one population matches some given distribution.
4se 0hi"square test of homogeneity if you want to know if the distribution of one
categorical variable is the same for two or more independent samples.
4se 0hi"square test of independence if you want to determine if there is a
relationship between two categorical variables.