You are on page 1of 6

# 1692016

DatascreeningStatWiki

Datascreening
FromStatWiki
LESSON:DataScreening(http://www.kolobkreations.com/DataScreening.pptx)
VIDEOTUTORIAL:DataScreening(http://youtu.be/1KuM5e0aFgU)
conductfurtherstatisticalanalyses.Datamustbescreenedinordertoensurethedataisuseable,reliable,andvalidfortestingcausal

Contents
1MissingData
2Outliers
2.1Univariate
2.2Multivariate
3Normality
4Linearity
5Homoscedasticity
6Multicollinearity

MissingData
Ifyouaremissingmuchofyourdata,thiscancauseseveralproblems.Themostapparentproblemisthattheresimplywon'tbe
enoughdatapointstorunyouranalyses.TheEFA,CFA,andpathmodelsrequireacertainnumberofdatapointsinordertocompute
estimates.Thisnumberincreaseswiththecomplexityofyourmodel.Ifyouaremissingseveralvaluesinyourdata,theanalysisjust
won'trun.
thenyouwillhavemalebiaseddata.Perhapsonly50%ofthefemalesreportedtheirgender,but95%ofthemalesreportedgender.If
youusegenderinyourcausalmodels,thenyouwillbeheavilybiasedtowardmales,becauseyouwillnotendupusingthe
unreportedresponses.
Tofindouthowmanymissingvalueseachvariablehas,inSPSSgotoAnalyze,thenDescriptiveStatistics,thenFrequencies.Enter
thevariablesinthevariableslist.ThenclickOK.Thetableintheoutputwillshowthenumberofmissingvaluesforeachvariable.
Thethresholdformissingdataisflexible,butgenerally,ifyouaremissingmorethan10%oftheresponsesonaparticularvariable,
orfromaparticularrespondent,thatvariableorrespondentmaybeproblematic.Thereareseveralwaystodealwithproblematic
variables.
Justdon'tusethatvariable.
Ifitmakessense,imputethemissingvalues.Thisshouldonlybedoneforcontinuousorintervaldata(likeageorLikertscale
responses),notforcategoricaldata(likegender).
however,ifthenumberofmissingresponsesisgreaterthan10%.
ToimputevaluesinSPSS,gotoTransform,ReplaceMissingValuesthenselectthevariablesthatneedimputing,andhitOK.See
thescreenshotsbelow.Inthisscreenshot,IusetheMeanreplacementmethod.Butthereareotheroptions,includingMedian
replacement.TypicallywithLikerttypedata,youwanttousemedianreplacement,becausemeansarelessmeaningfulinthese
scenarios.Formoreinformationonwhentousewhichtypeofimputation,referto:Lynch(2003)

http://statwiki.kolobkreations.com/index.php?title=Data_screening

1/6

1692016

DatascreeningStatWiki

simplydonothavethedataforthatperson.Myrecommendationistofirstdeterminewhichvariableswillactuallybeusedinyour
model(oftenwecollectdataonmorevariablesthanweactuallyendupusinginourmodel),thendetermineiftherespondentis
problematic.Ifso,thenremovethatrespondentfromtheanalysis.

Outliers
Outlierscaninfluenceyourresults,pullingthemeanawayfromthemedian.Twotypesofoutliersexist:outliersforindividual
variables,andoutliersforthemodel.

Univariate
Todetectoutliersoneachvariable,justproduceaboxplotinSPSS(asdemonstratedinthevideo).Outlierswillappearatthe
extremes,andwillbelabeled,asinthefigurebelow.Ifyouhaveareallyhighsamplesize,thenyoumaywanttoremovetheoutliers.
extreme(1or5)isnotreallyrepresentativeoutlierbehavior.

Anothertypeofoutlierisanunengagedrespondent.Sometimesrespondentswillenter'3,3,3,3,...'foreverysinglesurveyitem.This
participantwasclearlynotengaged,andtheirresponseswillthrowoffyourresults.Otherpatternsindicativeofunengaged
respondentsare'1,2,3,4,5,1,2,...'or'1,1,1,1,5,5,5,5,1,1,...'.Therearemultiplewaystoidentifyandeliminatethese
unengagedrespondents:
thirdandtwothirdsofthewaythroughmysurveys.IamalwaysastoundedathowmanyIcatchthisway...
http://statwiki.kolobkreations.com/index.php?title=Data_screening

2/6

1692016

DatascreeningStatWiki

respondedstronglyagreetobothoftheseitems,thentheywerenotpayingattention:"Iamveryhungry","Idon'thavemuch
appetiterightnow".

Multivariate
Multivariateoutliersrefertorecordsthatdonotfitthestandardsetsofcorrelationsexhibitedbytheotherrecordsinthedataset,with
regardstoyourcausalmodel.So,ifallbutonepersoninthedatasetreportsthatdiethasapositiveeffectonweightloss,butthisone
guyreportsthathegainsweightwhenhediets,thenhisrecordwouldbeconsideredamultivariateoutlier.Todetecttheseinfluential
multivariateoutliers,youneedtocalculatetheMahalanobisdsquared.ThisisasimplematterinAMOS.Seethevideotutorialfor
morewillshowup.Itisaslipperyslope.
AmoreconservativeapproachthatIwouldrecommendistoexaminetheinfluentialcasesindicatedbytheCook'sdistance.Hereisa
videoexplainingwhatthisisandhowtodoit.Thisvideoalsodiscussesmulticollinearity.
VIDEOTUTORIAL:MultivariateAssumptions(https://youtu.be/J2EkjIeKPE)

Normality
Normalityreferstothedistributionofthedataforaparticularvariable.Weusuallyassumethatthedataisnormallydistributed,even
thoughitusuallyisnot!Normalityisassessedinmanydifferentways:shape,skewness,andkurtosis(flat/peaked).
Shape:TodiscovertheshapeofthedistributioninSPSS,buildahistogram(asshowninthevideotutorial)andplotthenormal
curve.Ifthehistogramdoesnotmatchthenormalcurve,thenyoulikelyhavenormalityissues.Youcanalsolookatthe
boxplottodeterminenormality.
Skewness:Skewnessmeansthattheresponsesdidnotfallintoanormaldistribution,butwereheavilyweightedtowardone
endofthescale.Incomeisanexampleofacommonlyrightskewedvariablemostpeoplemakebetween20and70thousand
dollarsintheUSA,butthereissmallergroupthatmakesbetween70and100,andanevensmallergroupthatmakesbetween
100and150,andamuchsmallergroupthatmakesbetween150and250,etc.allthewayuptoBillGatesandMark
TherearetworulesonSkewness:
(1)Ifyourskewnessvalueisgreaterthan1thenyouarepositive(right)skewed,ifitislessthan1youarenegative(left)
skewed,ifitisinbetween,thenyouarefine.Somepublishedthresholdsareabitmoreliberalandallowforupto+/2.2,
(2)Iftheabsolutevalueoftheskewnessislessthanthreetimesthestandarderror,thenyouarefineotherwiseyouareskewed.
Usingtheserules,wecanseefromthetablebelow,thatallthreevariablesarefineusingthefirstrule,butusingthesecondrule,they
areallnegative(left)skewed.

Skewnesslookslikethis:

Kurtosis:
Kurtosisreferstotheoutliersofthedistributionofdata.Datathathaveoutliershavelargekurtosis.Datawithoutoutliershavelow
kurtosis.Thekurtosis(excesskurtosis)ofthenormaldistributionis0.Theruleforevaluatingwhetherornotyourkurtosisis
problematicisthesameasruletwoabove:
Iftheabsolutevalueofthekurtosisislessthanthreetimesthestandarderror,thenthekurtosisisnotsignificantlydifferent
fromthatofthenormaldistributionotherwiseyouhavekurtosisissues.Althoughalooserruleisanoverallkurtosisscoreof
2.200orless(ratherthan1.00)(Spositoetal.,1983).
Kurtosislookslikethis:
http://statwiki.kolobkreations.com/index.php?title=Data_screening

3/6

1692016

DatascreeningStatWiki

Bimodal:
Oneotherissueyoumayrunintowiththedistributionofyourdataisabimodaldistribution.Thismeansthatthedatahasmultiple
(two)peaks,ratherthanpeakingatthemean.Thismayindicatetherearemoderatingvariableseffectingthisdata.Abimodal
distributionlookslikethis:

Transformations:
VIDEOTUTORIAL:Transformations(http://youtu.be/twwT6FgwlAo)
Whenyouhaveextremelynonnormaldata,itwillinfluenceyourregressionsinSPSSandAMOS.Insuchcases,ifyouhavenon
Likertscalevariables(so,variableslikeage,income,revenue,etc.),youcantransformthempriortoincludingtheminyourmodel.
Healsoreferenceshisarticleinthevideo.

Linearity
Perhapsthemostelegant(easyandclearcut,yetrigorous),isthedeviationfromlinearitytestavailableintheANOVAtestinSPSS.
InSPSSgotoAnalyze,CompareMeans,Means.PutthecompositeIVsandDVsinthelists,thenclickonoptions,andselect"Test
forLinearity".ThenintheANOVAtableintheoutputwindow,iftheSigvalueforDeviationfromLinearityislessthan0.05,the
relationshipbetweenIVandDVisnotlinear,andthusisproblematic(seethescreenshotsbelow).Issuesoflinearitycansometimes
befixedbyremovingoutliers(ifthesignificanceisborderline),orthroughtransformingthedata.Inthescreenshotbelow,wecansee
thatthefirstrelationshipislinear(Sig=.268),butthesecondrelationshipisnonlinear(Sig=.003).
Ifthistestturnsupoddresults,thensimplyperformanOLSlinearregressionbetweeneachIV>DVpair.Ifthesigvalueis
lessthan0.05,thentherelationshipcanbeconsidered"sufficiently"linear.Whilethisapproachissomewhatlessrigorous,it
hasthebenefitofworkingeverytime!Youcanalsodoacurvelinearregression("curveestimation")toseeiftherelationship
ismorelinearthannonlinear.

http://statwiki.kolobkreations.com/index.php?title=Data_screening

4/6

1692016

DatascreeningStatWiki

Homoscedasticity
VIDEOTUTORIAL:PlottingHomoscedasticity(http://youtu.be/WrQ1O_He63Q?hd=1)
id=HVmsxuaQl2oC&pg=PA581&dq=homoscedasticity+residual+scatterplots&hl=en&sa=X&ei=iF5TUeiRGcnOyQGg
4HICQ&ved=0CEIQ6AEwAg#v=onepage&q=homoscedasticity%20residual%20scatterplots&f=false)
Homoscedasticityisanastywordthatmeansthatthevariable'sresidual(error)exhibitsconsistentvarianceacrossdifferentlevelsof
thevariable.Therearegoodreasonsfordesiringthis.Formoreinformation,seeHairetal.2010chapter2.:)Asimplewayto
determineifarelationshipishomoscedasticistodoasimplescatterplotwiththevariableontheyaxisandthevariable'sresidualon
thexaxis.Toseeastepbystepguideonhowtodothis,watchthevideotutorial.Iftheplotcomesupwithaconsistentpatternasin
thefigurebelow,thenwearegoodwehavehomoscedasticity!Ifthereisnotaconsistentpattern,thentherelationshipisconsidered
heteroskedastic.Thiscanbefixedbytransformingthedataorbysplittingthedatabysubgroups(suchastwogroupsforgender).You

Schoolsofthoughtonhomoscedasticityarestillout.Somesuggestthatevidenceofheteroskedasticityisnotaproblem(andis
thistestunlessspecificallyrequestedtobyareviewer.

Multicollinearity
Multicollinearityisnotdesirable.Itmeansthatthevarianceourindependentvariablesexplaininourdependentvariableareare
overlappingwitheachotherandthusnoteachexplaininguniquevarianceinthedependentvariable.Thewaytocheckthisisto
calculateaVariableInflationFactor(VIF)foreachindependentvariableafterrunningamultivariateregression.Therulesofthumb
fortheVIFareasfollows:
VIF<3:notaproblem
VIF>3potentialproblem
VIF>5verylikelyproblem
VIF>10definitelyproblem
ThetolerancevalueinSPSSisdirectlyrelatedtotheVIF,andvalueslessthan0.10arestrongindicationsofmulticollinearityissues.
ForparticularsonhowtocalculatetheVIFinSPSS,watchthestepbystepvideotutorial.Theeasiestmethodforfixing
muchuniqueexplanationofvarianceanyway.
Obrien,R.M.2007.ACautionRegardingRulesofThumbforVarianceInflationFactors.Quality&Quantity,41,673
Retrievedfrom"http://statwiki.kolobkreations.com/index.php?title=Data_screening&oldid=350425"

http://statwiki.kolobkreations.com/index.php?title=Data_screening

5/6

1692016

DatascreeningStatWiki

Thispagewaslastmodifiedon27April2016,at11:12.
Thispagehasbeenaccessed9,680times.

http://statwiki.kolobkreations.com/index.php?title=Data_screening

6/6