You are on page 1of 6

1692016

DatascreeningStatWiki

Datascreening
FromStatWiki
LESSON:DataScreening(http://www.kolobkreations.com/DataScreening.pptx)
VIDEOTUTORIAL:DataScreening(http://youtu.be/1KuM5e0aFgU)
Datascreening(sometimesreferredtoas"datascreaming")istheprocessofensuringyourdataiscleanandreadytogobeforeyou
conductfurtherstatisticalanalyses.Datamustbescreenedinordertoensurethedataisuseable,reliable,andvalidfortestingcausal
theory.InthissectionIwillfocusonsixspecificissuesthatneedtobeaddressedwhencleaning(notcooking)yourdata.
Doyouknowofsomecitationsthatcouldbeusedtosupportthetopicsandproceduresdiscussedinthissection?Pleaseemail
themtome(mailto:james.eric.gaskin@gmail.com)withthenameofthesection,procedure,orsubsectionthattheysupport.Thanks!

Contents
1MissingData
2Outliers
2.1Univariate
2.2Multivariate
3Normality
4Linearity
5Homoscedasticity
6Multicollinearity

MissingData
Ifyouaremissingmuchofyourdata,thiscancauseseveralproblems.Themostapparentproblemisthattheresimplywon'tbe
enoughdatapointstorunyouranalyses.TheEFA,CFA,andpathmodelsrequireacertainnumberofdatapointsinordertocompute
estimates.Thisnumberincreaseswiththecomplexityofyourmodel.Ifyouaremissingseveralvaluesinyourdata,theanalysisjust
won'trun.
Additionally,missingdatamightrepresentbiasissues.Somepeoplemaynothaveansweredparticularquestionsinyoursurvey
becauseofsomecommonissue.Forexample,ifyouaskedaboutgender,andfemalesarelesslikelytoreporttheirgenderthanmales,
thenyouwillhavemalebiaseddata.Perhapsonly50%ofthefemalesreportedtheirgender,but95%ofthemalesreportedgender.If
youusegenderinyourcausalmodels,thenyouwillbeheavilybiasedtowardmales,becauseyouwillnotendupusingthe
unreportedresponses.
Tofindouthowmanymissingvalueseachvariablehas,inSPSSgotoAnalyze,thenDescriptiveStatistics,thenFrequencies.Enter
thevariablesinthevariableslist.ThenclickOK.Thetableintheoutputwillshowthenumberofmissingvaluesforeachvariable.
Thethresholdformissingdataisflexible,butgenerally,ifyouaremissingmorethan10%oftheresponsesonaparticularvariable,
orfromaparticularrespondent,thatvariableorrespondentmaybeproblematic.Thereareseveralwaystodealwithproblematic
variables.
Justdon'tusethatvariable.
Ifitmakessense,imputethemissingvalues.Thisshouldonlybedoneforcontinuousorintervaldata(likeageorLikertscale
responses),notforcategoricaldata(likegender).
Ifyourdatasetislargeenough,justdon'tusetheresponsesthathadmissingvaluesforthatvariable.Thismaycreateabias,
however,ifthenumberofmissingresponsesisgreaterthan10%.
ToimputevaluesinSPSS,gotoTransform,ReplaceMissingValuesthenselectthevariablesthatneedimputing,andhitOK.See
thescreenshotsbelow.Inthisscreenshot,IusetheMeanreplacementmethod.Butthereareotheroptions,includingMedian
replacement.TypicallywithLikerttypedata,youwanttousemedianreplacement,becausemeansarelessmeaningfulinthese
scenarios.Formoreinformationonwhentousewhichtypeofimputation,referto:Lynch(2003)
(http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.177.857&rep=rep1&type=pdf)

http://statwiki.kolobkreations.com/index.php?title=Data_screening

1/6

1692016

DatascreeningStatWiki

Handlingproblematicrespondentsissomewhatmoredifficult.Ifarespondentdidnotansweralargeportionofthequestions,their
otherresponsesmaybeuselesswhenitcomestotestingcausalmodels.Forexample,iftheyansweredquestionsaboutdiet,butnot
aboutweightloss,forthisindividualwecannottestacausalmodelthatarguesthatdiethasapositiveeffectonweightloss.We
simplydonothavethedataforthatperson.Myrecommendationistofirstdeterminewhichvariableswillactuallybeusedinyour
model(oftenwecollectdataonmorevariablesthanweactuallyendupusinginourmodel),thendetermineiftherespondentis
problematic.Ifso,thenremovethatrespondentfromtheanalysis.

Outliers
Outlierscaninfluenceyourresults,pullingthemeanawayfromthemedian.Twotypesofoutliersexist:outliersforindividual
variables,andoutliersforthemodel.

Univariate
VIDEOTUTORIAL:DetectingUnivariateOutliers(http://www.youtube.com/watch?v=vB0WMDUlJQ)
Todetectoutliersoneachvariable,justproduceaboxplotinSPSS(asdemonstratedinthevideo).Outlierswillappearatthe
extremes,andwillbelabeled,asinthefigurebelow.Ifyouhaveareallyhighsamplesize,thenyoumaywanttoremovetheoutliers.
Ifyouareworkingwithasmallerdataset,youmaywanttobelessliberalaboutdeletingrecords.However,thisisatradeoff,because
outlierswillinfluencesmalldatasetsmorethanlargeones.Lastly,outliersdonotreallyexistinLikertscales.Answeringatthe
extreme(1or5)isnotreallyrepresentativeoutlierbehavior.

Anothertypeofoutlierisanunengagedrespondent.Sometimesrespondentswillenter'3,3,3,3,...'foreverysinglesurveyitem.This
participantwasclearlynotengaged,andtheirresponseswillthrowoffyourresults.Otherpatternsindicativeofunengaged
respondentsare'1,2,3,4,5,1,2,...'or'1,1,1,1,5,5,5,5,1,1,...'.Therearemultiplewaystoidentifyandeliminatethese
unengagedrespondents:
Includeattentiontrapsthatrequesttherespondentto"answersomewhatagreeforthisitemifyouarepayingattention".I
usuallyincludetwooftheseinoppositedirections(i.e.,onesayssomewhatagreeandonesayssomewhatdisagree)atabouta
thirdandtwothirdsofthewaythroughmysurveys.IamalwaysastoundedathowmanyIcatchthisway...
http://statwiki.kolobkreations.com/index.php?title=Data_screening

2/6

1692016

DatascreeningStatWiki

Seeiftheparticipantansweredreversecodedquestionsinthesamedirectionasnormalquestions.Forexample,ifthey
respondedstronglyagreetobothoftheseitems,thentheywerenotpayingattention:"Iamveryhungry","Idon'thavemuch
appetiterightnow".

Multivariate
VIDEOTUTORIAL:DetectingMultivariateInfluentialOutliers(http://www.youtube.com/watch?v=0vtgynhkH60)
Multivariateoutliersrefertorecordsthatdonotfitthestandardsetsofcorrelationsexhibitedbytheotherrecordsinthedataset,with
regardstoyourcausalmodel.So,ifallbutonepersoninthedatasetreportsthatdiethasapositiveeffectonweightloss,butthisone
guyreportsthathegainsweightwhenhediets,thenhisrecordwouldbeconsideredamultivariateoutlier.Todetecttheseinfluential
multivariateoutliers,youneedtocalculatetheMahalanobisdsquared.ThisisasimplematterinAMOS.Seethevideotutorialfor
theparticulars.Asawarninghowever,Ialmostneveraddressmultivariateoutliers,asitisverydifficulttojustifyremovingthem
justbecausetheydon'tmatchyourtheory.Additionally,youwillnearlyalwaysfindmultivariateoutliers,evenifyouremovethem,
morewillshowup.Itisaslipperyslope.
AmoreconservativeapproachthatIwouldrecommendistoexaminetheinfluentialcasesindicatedbytheCook'sdistance.Hereisa
videoexplainingwhatthisisandhowtodoit.Thisvideoalsodiscussesmulticollinearity.
VIDEOTUTORIAL:MultivariateAssumptions(https://youtu.be/J2EkjIeKPE)

Normality
VIDEOTUTORIAL:DetectingNormalityIssues(http://www.youtube.com/watch?v=w8wf6lBh8M)
Normalityreferstothedistributionofthedataforaparticularvariable.Weusuallyassumethatthedataisnormallydistributed,even
thoughitusuallyisnot!Normalityisassessedinmanydifferentways:shape,skewness,andkurtosis(flat/peaked).
Shape:TodiscovertheshapeofthedistributioninSPSS,buildahistogram(asshowninthevideotutorial)andplotthenormal
curve.Ifthehistogramdoesnotmatchthenormalcurve,thenyoulikelyhavenormalityissues.Youcanalsolookatthe
boxplottodeterminenormality.
Skewness:Skewnessmeansthattheresponsesdidnotfallintoanormaldistribution,butwereheavilyweightedtowardone
endofthescale.Incomeisanexampleofacommonlyrightskewedvariablemostpeoplemakebetween20and70thousand
dollarsintheUSA,butthereissmallergroupthatmakesbetween70and100,andanevensmallergroupthatmakesbetween
100and150,andamuchsmallergroupthatmakesbetween150and250,etc.allthewayuptoBillGatesandMark
Zuckerberg.Addressingskewnessmayrequiretransformationsofyourdata(ifcontinuous),orremovinginfluentialoutliers.
TherearetworulesonSkewness:
(1)Ifyourskewnessvalueisgreaterthan1thenyouarepositive(right)skewed,ifitislessthan1youarenegative(left)
skewed,ifitisinbetween,thenyouarefine.Somepublishedthresholdsareabitmoreliberalandallowforupto+/2.2,
insteadof+/1.
(2)Iftheabsolutevalueoftheskewnessislessthanthreetimesthestandarderror,thenyouarefineotherwiseyouareskewed.
Usingtheserules,wecanseefromthetablebelow,thatallthreevariablesarefineusingthefirstrule,butusingthesecondrule,they
areallnegative(left)skewed.

Skewnesslookslikethis:

Kurtosis:
Kurtosisreferstotheoutliersofthedistributionofdata.Datathathaveoutliershavelargekurtosis.Datawithoutoutliershavelow
kurtosis.Thekurtosis(excesskurtosis)ofthenormaldistributionis0.Theruleforevaluatingwhetherornotyourkurtosisis
problematicisthesameasruletwoabove:
Iftheabsolutevalueofthekurtosisislessthanthreetimesthestandarderror,thenthekurtosisisnotsignificantlydifferent
fromthatofthenormaldistributionotherwiseyouhavekurtosisissues.Althoughalooserruleisanoverallkurtosisscoreof
2.200orless(ratherthan1.00)(Spositoetal.,1983).
Kurtosislookslikethis:
http://statwiki.kolobkreations.com/index.php?title=Data_screening

3/6

1692016

DatascreeningStatWiki

Bimodal:
Oneotherissueyoumayrunintowiththedistributionofyourdataisabimodaldistribution.Thismeansthatthedatahasmultiple
(two)peaks,ratherthanpeakingatthemean.Thismayindicatetherearemoderatingvariableseffectingthisdata.Abimodal
distributionlookslikethis:

Transformations:
VIDEOTUTORIAL:Transformations(http://youtu.be/twwT6FgwlAo)
Whenyouhaveextremelynonnormaldata,itwillinfluenceyourregressionsinSPSSandAMOS.Insuchcases,ifyouhavenon
Likertscalevariables(so,variableslikeage,income,revenue,etc.),youcantransformthempriortoincludingtheminyourmodel.
GaryTempletonhaspublishedanexcellentarticleonthisandcreatedaYouTubevideoshowinghowtoconductthetransformation.
Healsoreferenceshisarticleinthevideo.

Linearity
LinearityreferstotheconsistentslopeofchangethatrepresentstherelationshipbetweenanIVandaDV.Iftherelationshipbetween
theIVandtheDVisradicallyinconsistent,thenitwillthrowoffyourSEManalyses.Therearedozensofwaystotestforlinearity.
Perhapsthemostelegant(easyandclearcut,yetrigorous),isthedeviationfromlinearitytestavailableintheANOVAtestinSPSS.
InSPSSgotoAnalyze,CompareMeans,Means.PutthecompositeIVsandDVsinthelists,thenclickonoptions,andselect"Test
forLinearity".ThenintheANOVAtableintheoutputwindow,iftheSigvalueforDeviationfromLinearityislessthan0.05,the
relationshipbetweenIVandDVisnotlinear,andthusisproblematic(seethescreenshotsbelow).Issuesoflinearitycansometimes
befixedbyremovingoutliers(ifthesignificanceisborderline),orthroughtransformingthedata.Inthescreenshotbelow,wecansee
thatthefirstrelationshipislinear(Sig=.268),butthesecondrelationshipisnonlinear(Sig=.003).
Ifthistestturnsupoddresults,thensimplyperformanOLSlinearregressionbetweeneachIV>DVpair.Ifthesigvalueis
lessthan0.05,thentherelationshipcanbeconsidered"sufficiently"linear.Whilethisapproachissomewhatlessrigorous,it
hasthebenefitofworkingeverytime!Youcanalsodoacurvelinearregression("curveestimation")toseeiftherelationship
ismorelinearthannonlinear.

http://statwiki.kolobkreations.com/index.php?title=Data_screening

4/6

1692016

DatascreeningStatWiki

Homoscedasticity
VIDEOTUTORIAL:PlottingHomoscedasticity(http://youtu.be/WrQ1O_He63Q?hd=1)
EncyclopediaofResearchDesign,Volume1(2010),SagePublications,pg.581(http://books.google.ca/books?
id=HVmsxuaQl2oC&pg=PA581&dq=homoscedasticity+residual+scatterplots&hl=en&sa=X&ei=iF5TUeiRGcnOyQGg
4HICQ&ved=0CEIQ6AEwAg#v=onepage&q=homoscedasticity%20residual%20scatterplots&f=false)
Homoscedasticityisanastywordthatmeansthatthevariable'sresidual(error)exhibitsconsistentvarianceacrossdifferentlevelsof
thevariable.Therearegoodreasonsfordesiringthis.Formoreinformation,seeHairetal.2010chapter2.:)Asimplewayto
determineifarelationshipishomoscedasticistodoasimplescatterplotwiththevariableontheyaxisandthevariable'sresidualon
thexaxis.Toseeastepbystepguideonhowtodothis,watchthevideotutorial.Iftheplotcomesupwithaconsistentpatternasin
thefigurebelow,thenwearegoodwehavehomoscedasticity!Ifthereisnotaconsistentpattern,thentherelationshipisconsidered
heteroskedastic.Thiscanbefixedbytransformingthedataorbysplittingthedatabysubgroups(suchastwogroupsforgender).You
canreadmoreabouttransformationsinHairetal.2010ch.4.

Schoolsofthoughtonhomoscedasticityarestillout.Somesuggestthatevidenceofheteroskedasticityisnotaproblem(andis
actuallydesirableandexpectedinmoderatedmodels),andsoweshouldn'tworryabouttestingforhomoscedasticity.Ineverconduct
thistestunlessspecificallyrequestedtobyareviewer.

Multicollinearity
VIDEOTUTORIAL:DetectingMulitcollinearity(http://www.youtube.com/watch?v=oPXjQCtyoG0)
Multicollinearityisnotdesirable.Itmeansthatthevarianceourindependentvariablesexplaininourdependentvariableareare
overlappingwitheachotherandthusnoteachexplaininguniquevarianceinthedependentvariable.Thewaytocheckthisisto
calculateaVariableInflationFactor(VIF)foreachindependentvariableafterrunningamultivariateregression.Therulesofthumb
fortheVIFareasfollows:
VIF<3:notaproblem
VIF>3potentialproblem
VIF>5verylikelyproblem
VIF>10definitelyproblem
ThetolerancevalueinSPSSisdirectlyrelatedtotheVIF,andvalueslessthan0.10arestrongindicationsofmulticollinearityissues.
ForparticularsonhowtocalculatetheVIFinSPSS,watchthestepbystepvideotutorial.Theeasiestmethodforfixing
multicollinearityissuesistodroponeofproblematicvariables.Thiswon'thurtyourRsquaremuchbecausethatvariabledoesn'tadd
muchuniqueexplanationofvarianceanyway.
Foramorecriticalexaminationofmulticollinearity,pleasereferto:
Obrien,R.M.2007.ACautionRegardingRulesofThumbforVarianceInflationFactors.Quality&Quantity,41,673
690.(http://link.springer.com/article/10.1007/s1113500690186)
Retrievedfrom"http://statwiki.kolobkreations.com/index.php?title=Data_screening&oldid=350425"

http://statwiki.kolobkreations.com/index.php?title=Data_screening

5/6

1692016

DatascreeningStatWiki

Thispagewaslastmodifiedon27April2016,at11:12.
Thispagehasbeenaccessed9,680times.

http://statwiki.kolobkreations.com/index.php?title=Data_screening

6/6