You are on page 1of 28

# Statistics250

Interactive
LectureNotes
Solutions

Fall2015Winter2016

Dr.BrendaGunderson
DepartmentofStatistics
UniversityofMichigan

TableofContents

Topic

Page

Intro

1:SummarizingData

2:Sampling,Surveys,andGatheringData

25

3:Probability

35

4:RandomVariables

45

Part1:DistributionforaSampleProportion
Part2:EstimatingProportionswithConfidence
Part1:DistributionforaSampleMean
Part2:ConfidenceIntervalforaPopulationMean
Part1:DistributionforaSampleMeanDifference
Part2:ConfidenceIntervalforaPopulationMeanDifference

65
71
81

93
97
99

103
111
117

127
131
135

139
143
151

10:ANOVA:AnalysisofVariance

159

11:RelationshipsbetweenQuantitativeVariables:Regression

171

12:RelationshipsbetweenCategoricalVariables:ChiSquare

193

Stat250GundersonLectureNotes
Introduction

Statistics...the most important science in the whole world: for upon it depends the
practicalapplicationofeveryotherscienceandofeveryart:theonescienceessentialto
foritonlygivesresultsofourexperience."FlorenceNightingale,Statistician

Definitions:

Statisticsarenumbersmeasuredforsomepurpose.

Statisticsisacollectionofproceduresandprinciplesforgatheringdataand
analyzinginformationinordertohelppeoplemakedecisionswhenfacedwith
uncertainty.

CourseGoal:Learnvarioustoolsforusingdatatogainunderstandingandmakesounddecisions

Stat250GundersonLectureNotes
1:SummarizingData

Youmustnevertellathing.Youmustillustrateit.Welearnthrough
theeyeandnotthenoggin."

WillRogers(18791935)

Simplesummariesofdatacantellaninterestingstoryandareeasiertodigestthanlonglists.
Sowewillbeginbylookingatsomedata.

RawData

Rawdatacorrespondtonumbersandcategorylabelsthathavebeencollectedormeasuredbut
havenotyetbeenprocessedinanyway.OnthenextpageisasetofRAWDATAinformation
about a sample size of n = 86 college students. For each student we are provided with their
reportedtypicalamountofsleeppernight(inhours).Theinformationwehaveisorganizedinto
variables. In this case these 86 college students are a subset from a larger population of all
collegestudents,sowehavesampledata.

Definition:
Avariableisacharacteristicthatdiffersfromoneindividualtothenext.

Sampledataarecollectedfromasubsetofalargerpopulation.

Populationdataarecollectedwhenallindividualsinapopulationaremeasured.

Astatisticisasummarymeasureofsampledata.

Aparameterisasummarymeasureofpopulationdata.

TypesofVariables

Wehave2variablesinourdataset.Nextwewanttodistinguishbetweenthedifferenttypesof
variablesdifferenttypesofvariablesprovidedifferentkindsofinformationandthetypewill
guidewhatkindsofsummaries(graphs/numerical)areappropriate.

CouldyoucomputetheAVERAGEAMOUNTOFSLEEPforthese86students?YES
CouldyoucomputetheAVERAGESLEEPDEPRIVEDSTATUSforthese86students?NO
(couldcode,butwouldbearbitrary:0and1,orcoulduseanytwovalueslike1and203)

SLEEPDEPRIVEDSTATUSissaidtobeaCATEGORICALvariable,

AMOUNTOFSLEEPisaQUANTITATIVE_variable.

Definitions:
A categorical variable places an individual or item into one of several groups or
categories.Whenthecategorieshaveanorderingorranking,itiscalledanordinal
variable.

Aquantitativevariabletakesnumericalvaluesforwhicharithmeticoperationssuchas
adding and averaging make sense. Other names for quantitative variable are:
measurementvariableandnumericalvariable.

TryIt!
Foreachvariablelistedbelow,giveitstypeascategoricalorquantitative.

Age(years)

QUANTITATIVE

TypicalClassroomSeatLocation(Front,Middle,Back) CATEGORICAL

NumberofsongsonaniPodQUANTITATIVE

Timespentstudyingmaterialforthisclassinthelast24hourperiod(inhours)QUANTITATIVE

SoftDrinkSize(small,medium,large,supersized)

CATEGORICAL(ordinal)

TheAndthen...countrecordedinapsychologystudyonchildren(detailswillbeprovided)
justheardQUANTITATIVE

avariableismodeleddiscretely(becauseitsvaluesarecountable)orwhetheritwould
bemodeledcontinuously(becauseitcantakeanyvalueinanintervalorcollectionof

DATASET=DEPRIVED
From Utts, Jessica M. and Robert F. Heckard. Mind on Statistics, Fourth Edition. 2012. Used with permission.

FeelSleep
Deprived?
No
No
No
Yes
Yes
Yes
Yes
Yes
No
No
No
No
Yes
Yes
Yes
No
No
No
Yes
Yes
No
No
No
Yes
No
No
No
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes

AmountSleep
perNight(hours)
9
7
8
7
7
8
7
8
10
8
9
8
8
4
6
8
10
4
7
8
9
9
7
8
9
9
8
6
9
7
11
7
9
7
8
7
7
9
1
7
6
8
6

FeelSleep
Deprived?
No
No
No
Yes
Yes
Yes
Yes
Yes
No
Yes
No
No
Yes
Yes
No
No
Yes
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes

AmountSleep
perNight(hours)
8
7
9
7
7
7
7
6
8
6
9
8
7
8
8
8
7
7
7
7
7
8
7
7
7
7
8
6
6
8
9
7
8
6
7
8
5
6
7
8
8
7
6

Ourdatasetissomewhatlarge,containingalotofmeasurementsinalonglist.Presentedasa
tablelisting,wecanviewtherecordofaparticularcollegestudent,butitisjustalisting,andnot
easytofindthelargestvaluefortheamountofsleeporthenumberofstudentswhofeltthey
aresleepdeprived.Wewouldliketolearnappropriatewaystosummarizethedata.

SummarizingCategoricalVariables

NumericalSummaries
How would you go about summarizing the SLEEP DEPRIVED STATUS data? The first step is to
simplycounthowmanyindividuals/itemsfallintoeachcategory.Sincepercentsaregenerally
more meaningful than counts, the second step is to calculate the percent (or proportion) of
individuals/itemsthatfallintoeachcategory.

Count = Frequency

## Percent or Proportion is ok!

SleepDeprived?

Count

Percent

Yes

51

(51/86)*100=59.3%

No

35

(35/86)*100=40.7%

Total

86

100%

The table above provides both the frequency distribution and the relative frequency
distributionforthevariableSLEEPDEPRIVEDSTATUS.

VisualSummaries
There are two simple visual summaries for
categoricaldataabargraphorapiechart.Hereis
Ifyouweremakingone:Dontforgettolabeleach
axisandshowsomevaluesoneachaxis!

counts:
Deprived
No Yes
35 51
percentages:
Deprived
No Yes
40.7 59.3

Aside:Doesitmatterwhether
theNoorYesbarisgivenfirst?No,notordinalhere=>
weshouldnotcommentonshape(i.e.donotusewordslikeskewedorincreasing
patternhere)

PieChart:Anothergraphforcategoricaldatawhichhelpsusseewhatpartofthewholeeach
groupforms.

Piechartsarenotaseasytodrawbyhand.
Itisnotaseasytocomparesizesofpie
piecesversuscomparingheightsofbars.

Thuswewillprefertouseabargraphfor
categoricaldata.

## Recap: We have discussed that some

variables are categorical and others are
quantitative.Wehaveseenthatbargraphs
and pie charts can be used to display data
for categorical variables. We turn next to
displaying the data for quantitative
variables.

ExploringFeaturesofQuantitativeDatawithPictures

RecallourSleepDeprivedDataforn=86collegestudents.Wehavedataontwovariables:sleep
hoursdata?Thesemeasurementsdovary.Howdotheyvary?Whatistherangeofvalues?What
isthepatternofvariation?

Findthesmallestvalue=____1______andlargestvalue=_____11_______

Takethisoverallrangeandbreakitupintointervals(ofequalwidth).
Whatmightbereasonablehere?
Perhapsby2s;butweneedtowatchtheendpoints.

SummaryTable:

Class

Frequency
(orcount)

RelativeFrequency
(orproportion)

Percent

[0, 2]

1/86 = 0.012

1.2%

(2, 4]

2/86 = 0.023

2.3%

(4, 6]

12

0.139

13.9%

(6, 8]

56

0.651

65.1%

(8, 10]

14

0.163

16.3%

(10, 12]

0.012

1.2%

watch
endpoints
different
softwarewill
dodifferent
endpoints
W/tablewe
drawa
histogram

Graphforquantitativedata=Histogram:

Note:ifwedivide
countby86,
wewouldhave
proportionbut
picturewould
looksame.

Note:eachbarrepresentsaclass,andthebaseofthebarcoverstheclass.

TheabovetableandhistogramshowthedistributionofthisquantitativevariableSLEEPHOURS,
thatis,theoverallpatternofhowoftenthepossiblevaluesoccur.

RHistograms(defaultontheleftandcustomizedontheright):

Allimages

Howtointerpret?
LookforOverallPattern
Threesummarycharacteristicsoftheoveralldistributionofthedata

Shape(approximatelysymmetric,skewed,bellshaped,uniform)

Location(center,average)
Approximatelythemiddlevalueorwhereitwouldbalance

Range(overallandthenwheremostoftheobservationsare)

LookfordeviationsfromOverallPattern

DescribethedistributionforSLEEPHOURS:
Approximatelybellshaped,symmetricdistribution,unimodal,
centeredaround7hours,withmostvaluesbetween4and10hours.
Noapparentoutliers.

Count

Whatwouldittellyou?

Response
Wewouldcallthisabimodaldistribution.Thereappearstobetwosubgroupsofobservations.It
separatelyforeachgroup.

10

NOSPACEBETWEENBARS!Unlesstherearenoobservationsinthatinterval.
HowManyClasses?Useyourjudgment:generallysomewherebetween6and15
intervals.
Bettertouserelativefrequenciesontheyaxiswhencomparingtwoormoresetsof
observations.
Softwarehasdefaultsandmanyoptionstomodifysettings.

OneMoreExample:

AstudywasconductedinDetroit,Michigantofind
out the number of hours children aged 8 to 12 How many
yearsspentwatchingtelevisiononatypicalday. are in first

class? 3
Alistingofallhouseholdsinacertainhousingarea
having children aged 8 to 12 years was first
constructed. Out of the 100 households in this
listing, a random sample of 20 households was
selectedandallchildrenaged8to12yearsinthe
selectedhouseholdswereinterviewed.

Thefollowinghistogramwasobtainedforallthe
childrenaged8to12yearsinterviewed.

## a. Complete the sentence: Based on this

histogram,thedistributionofnumberofhours
spentwatchingTVisunimodal,

withaslightskewnesstothe___left____.

b. Assumingthatallchildreninterviewedare
representedinthehistogram,whatisthetotalnumberofchildreninterviewed?

3+6+9+10+4=32

c. Whatproportionofchildrenspentlessthan2hourswatchingtelevision?

interviewedchildren?Ifso,reportit.Ifnot,explainwhynot.

No,itissomewherebetween4and5,
buttheexactvalueisnotknownforsure.

11

NumericalSummariesofQuantitativeVariables

Wehavediscussedsomeinterestingfeaturesofaquantitativedatasetandlearnedhowtolook
fortheminpictures(graphs).Section2.5focusesonnumericalsummariesofthecenterandthe

Notationforagenericrawsetofdata:
x1,x2,x3,,xnwheren=#itemsinthedatasetorsamplesize

Twobasicmeasuresoflocationorcenter:

Meanthenumericalaveragevalue
Werepresentthemeanofasample(calledastatistic)by

x1 x 2 x n

x
n

Medianthemiddlevaluewhendataarrangedfromsmallesttolargest.

nodd:M=middleobs;neven:M=avgoftwomiddleobservations

TryIt!FrenchFries
Weightmeasurementsfor16smallordersofFrenchfries(ingrams).
78
72
69
81
63
67
65
75

79
74
71
83
71
79
80
69

Whatshouldwedowithdatafirst?Graphit!

Basedonourhistogram,thedistributionofweightisunimodalandapproximatelysymmetric,
socomputingnumericalsummariesisreasonable.Theweights(ingrams)rangefromthe60sto
thelower80s,centeredaroundthelower70s.

12

1. Computethemeanweight.
78 72 69 80 69
x
Does73.6makesense?(yeslookathistogram)Would83?(no)
16
73.5 grams
2. Computethemedianweight.
Ordered:63,65,67,69,69,71,71,72,74,75,78,79,79,80,81,83

(n+1)/2=(16+1)=8.5soavg8thand9thobservations=>(72+74)/2=73
Note:areabove73andarebelowit.

Medianwouldstaythesame.Meanwoulddecrease.

Note: Themeanis____sensitivetoextremeobservations.

Themedianis________resistanttoextremeobservations.

Mostgraphicaldisplayswouldhavedetectedsuchanoutlyingvalue.

Somedianbetter
ifoutliersor
stronglyskewed.

SomePictures:MeanversusMedian

13

Midtermsarereturnedandtheaverage
wasreportedas76outof100.
Howshouldyoufeel?Happytojustbeaboveaverage?
Oftenwhatismissingwhentheaverageofsomethingisreported,isacorrespondingmeasure
of spread or variability. Here we discuss various measures of variation, each useful in some
situations,eachwithsomelimitations.

Range:

Range=HighvalueLowvalue=MaximumMinimum
Percentiles: Thepthpercentileisthevaluesuchthatp%oftheobservationsfallatorbelow
thatvalue.

SomeCommonpercentiles:
Median:
50thpercentileQ2or.50
Firstquartile: 25thpercentileQ1or.25(medianofvaluesbelowmedian)
Thirdquartile: 75thpercentileQ3or.75(medianofvaluesabovemedian)

FiveNumberSummary:

VariableNameandUnits
(n=numberofobservations)

Median
Quartiles
Extremes

M
<=IQR=>
<=range=>

Q1
Min

Q3
Max

Provides a quick overview of the data values and information about the center and spread.
Dividesthedatasetintoapproximatequarters.

Tryit!FrenchFriesData
Ordered:63,65,67,69,69,71,71,72,74,75,78,79,79,80,81,83

WeightofFries(ingrams)
(n=16orders)

Median

73

Quartiles

69

79

Extremes

63

83

Range:8363=20grams

IQR: 7969=10grams

14

IQR=Q3Q1

AndconfirmingthesevaluesusingRwehave:

## > numSummary(FrenchFries[,"Weight"], statistics=c("mean", "sd", "IQR",

+
"quantiles"), quantiles=c(0,.25,.5,.75,1))

mean
sd IQR 0% 25% 50% 75% 100% n
73.5 6.0663 10 63 69 73 79
83 16

Example:TestScores
The fivenumber summary for the distribution of test scores for a very large math class is
providedbelow:

TestScore(points)
(n=1200students)

Median
Quartiles
Extremes

58
46
34

78
95

1. Whatisthetestscoreintervalcontainingthelowestofthestudents?

34to46points

scoredhigherthanyou?

75%

scoredhigherthanyou?

Between50%and75%

togetanAonthetest?

Needascoreof78orhigher

Boxplots
Aboxplotisagraphicalrepresentationofthefivenumber
summary.
Steps:
Labelanaxiswithvaluestocovertheminimumand
maximumofthedata.
MakeaboxwithendsatthequartilesQ1andQ3.
DrawalineintheboxatthemedianM.
Checkforpossibleoutliersusingthe1.5*IQRrule
andifany,plotthemindividually.
Extendlinesfromendofboxtosmallestandlargest
observationsthatarenotpossibleoutliers.

Note:Possibleoutliersareobservationsthataremorethan1.5*IQRoutsidethequartiles,
thatis,observationsthatarebelowQ11.5*IQRorobservationsthatareaboveQ3+1.5*IQR.

15

Tryit!FrenchFriesData
Ordered:63,65,67,69,69,71,71,72,74,75,78,79,79,80,81,83

Thefivenumbersummary:

WeightofFries(ingrams)
(n=16orders)

Median

73

Quartiles

69

79

Extremes

63

83

Fromtheboxplotshown,weseethereareno
pointsplottedseparately,sotherearenooutliers
bythe1.5(IQR)rule.

Verifytherearenooutliersusingthisrule.

IQR=7969=10grams

1.5*IQR=1.5(10)=15grams

Lowerboundary(fence)=Q11.5*IQR=6915=54

Arethereanyobservationsthatfallbelowthislowerboundary?
No,sonolowoutliers.

Upperboundary(fence)=Q3+1.5*IQR=79+15=94

Arethereanyobservationsthatfallabovethisupperboundary?
No,sonohighoutliers.

16

Whatifthelargestweightof83gramswasactually93grams?

Ordered:63,65,67,69,69,71,71,72,74,75,78,79,79,80,81,95

Thenthefivenumbersummarywouldbe:

WeightofFries(ingrams)
(n=12orders)

Median
Quartiles
Extremes

73
69
63

79

95

TheIQRand1.5*IQRwouldbethesame,so
the boundaries for checking for possible
outliersareagain54and94.

## Now we would have one potential high

outlier,themaximumvalueof95.

Themodifiedboxplotwhenwehavethis
oneoutlierisshown.

Whyisthelineextendingoutonthetop
sidenowdrawnouttojust81?

81isthelargestvalueinthedataset
thatisnotanoutlier

NotesonBoxplots:
Sidebysideboxplotsaregoodfor...comparing2ormoresetsofobs.

Watchoutpointsplottedindividuallyare...

Stillpartofthedatasetdontignorethem!

Can'tconfirm....

Shapefromaboxplotonly(histogrambetterforshowingshape).

(soappropriatecreditcanbegivenonexam/quiz).

17

TryIt:Sidebysideboxplots
pointscale)wascomputedforeachchild.Sidebysideboxplotsofthechildrensstandardized

11
a. Whatis(approx)thelowest
10
havebreakfast?
9

_____4.5______points

4
b. Completethefollowing
3
sentence:
No
Yes

## Do you have breakfast?

Amongthechildrenwhodidnot
eatbreakfast,

c. Considerthefollowingstatement:Thesymmetryintheboxplotforthechildrennoteating

TrueorFalse?False

FeaturesofBellShapedDistributions

somewhatsymmetric.Ifweweretodrawacurvetosmoothoutthetopsofthebarsofthe
histogram,itwouldresembletheshapeofabell,andthuscouldbecalledbellshaped.

Onefairlycommondistributionofmeasurementswiththisshapehasaspecialname,calleda
normaldistributionornormalcurve.Wewillseenormalcurvesinmoredetailwhenwestudy
homogeneous set of measurements), a useful measure of spread is called the standard
deviation. In fact, the mean and the standard deviation are two summary measures that
completelyspecifyanormalcurve.

18

We will refer to it as a kind of average distance of the observations from the mean. But it
actuallyisthesquarerootoftheaverageofthesquareddeviationsoftheobservationsfromthe
mean.Sincethatisabitcumbersome,weliketothinkofthestandarddeviationasroughly,
theaveragedistancetheobservationsfallfromthemean.Hereisaquicklookattheformula
forthestanddeviationwhenthedataareasamplefromalargerpopulation:

s=samplestandarddeviation=

( x1 x ) 2 ( x 2 x ) 2 ( x n x ) 2

n 1

(x

x)2

n 1

Note:Thesquaredstandarddeviation,denotedbys2,iscalledthevariance.Weemphasizethe
standarddeviationsinceitisintheoriginalunits.

Example:Considerthissampleofn=5scores:94,97,99,103,107.
deviatesfromthemean.Thenconsidertheaverageofthesedeviations.

## Deviation from mean = 107 100 = 7

x
x x
x
x

90
95
100
105
110

Howthecalculationsaredone:

x
x100 (x100)2 Calculations
1. x =500/5=100(usedincolumns2&3)
94
6
36
2. Variance:s2=104/(51)=26
97
3
9
3. Standarddeviation:s=(26)=5.1
99
1
1

103
3
9
Note#ofdeserveddecimalsusedfors.
107
7
49
500
0
104
Sums(ortotals)ofthecolumns

19

Tryit!FrenchFriesData
Weightmeasurementsfor12smallordersofFrenchfries(ingrams).
78
72
69
81
63
67
65
75

79
74
71
83
71
79
80
69
Themeanwascomputedearliertobe73.5.Findthestandarddeviationforthisdata.

s=

## (78 73.5) 2 (72 73.652 (69 73.5) 2

36.8 6.1 grams
16 1

wewillhaveacalculatororcomputerdoitforus.

Interpretation:

Theweightsofsmallordersoffrenchfriesareroughly

## ___6.1grams ____awayfromtheirmeanweightof73.5grams ,onaverage.

OR

fromtheirmeanweightof 73.5grams

Likethemean,sis...sensitivetoextremeobservations

Sousethemeanand
standarddeviationfor____ reasonablysym,bellshapeddistributions_____.

Thefivenumbersummary
isbetterforskeweddistributionsorifoutliers

20

Datasetsarecommonlytreatedasiftheyrepresentasamplefromalargerpopulation.A
numericalsummarybasedonasampleiscalledastatistic.Thesamplemeanandsample
standarddeviationaretwosuchstatistics.However,ifyouhaveallmeasurementsforan
entirepopulation,thenanumericalsummarywouldbereferredtoasaparameter.

Thesymbolsforthemeanandstandarddeviationforapopulationaredifferent,andthe
formula for the standard deviation is also slightly different. A population mean is
represented by the Greek letter (mu), and a population standard deviation is
represented by the Greek letter (sigma). The formula for the population standard
deviationisbelow.

chapterandbeyond.

Populationstandarddeviation:

(x

)2

N
whereNisthesizeofthepopulation.

EmpiricalRule
Forbellshapedcurves,approximately

68%ofthevaluesfallwithin1standarddeviationofthemeanineitherdirection.

95%ofthevaluesfallwithin2standarddeviationsofthemeanineitherdirection.

99.7%ofthevaluesfallwithin3standarddeviationsofthemeanineitherdirection.

21

TryIt!AmountofSleep

Thetypicalamountofsleeppernightforcollegestudentshasabellshapeddistributionwitha
meanof7hoursandastandarddeviationof1.7hours.

Work:71.7=5.3and7+1.7=8.7

Verifythevaluesbelowthatcompletethesentences.

Drawapictureofthedistributionshowingthemeanandintervalsbasedontheempiricalrule.

Supposelastnightyouslept11hours.
Howmanystandarddeviationsfromthemeanareyou?(117)/1.7=2.35

Supposelastnightyousleptonly5hours.
Howmanystandarddeviationsfromthemeanareyou?(57)/1.7=1.18

Thestandarddeviationisausefulyardstickformeasuringhowfaranindividualvaluefallsfrom
themean.Thestandardizedscoreorzscoreisthedistancebetweentheobservedvalueand
themean,measuredintermsofnumberofstandarddeviations.Valuesthatareabovethemean
havepositivezscores,andvaluesthatarebelowthemeanhavenegativezscores.

Standardizedscoreorzscore: z

## observed value - mean

standard deviation

22

TryIt!ScoresonaFinalExam
Scoresonthefinalinacoursehaveapproximatelyabellshapeddistribution.
Themeanscorewas70pointsandthestandarddeviationwas10points.

WhatwasRobsscore?70+2(10)=90points